0.1 Names, functions and debugging

0.1.1 Naming conventions

  • Valid names contain A-z, 0-9, . and _. Can begin with a period, but cannot be followed by a number.
  • Avoid F and T. Some names are only used in a specific context.
  • Notations:
    • Snake notation: name_of_function
    • Period notation: name.of.function
    • Camel notation: NameOfFunction
  • Do not use period notation with S3 classes.
  • Use the same naming notation everywhere - be consistent!
  • Variable names beginning with a period will be hidden in the environment, but still available.
  • Split your big problem into small subproblems recursively, and encapsulate our code in functional blocks.

0.1.2 Functions

  • Avoid accessing and modifying globals!
  • Use data as the very first argument
  • Set parameters to defaults
  • Global defaults can be changed by options
  • If you are re-using someone else’s function, write a wrapper. A wrapper is a design pattern, put the function inside another function.
  • Make the user able to turn off messages etc.

0.1.3 Debugging

20 percent of the code has 80 percent of the errors - Lowell Arthur

Beware of bugs in the above code; I have only proved it correct, not tried it - Donald Knuth

  • Types of bugs:
    • Syntax
    • Arithmetic
    • Type
    • Logic
  • To avoid bugs:
    • Encapulate code in smaller functions that can be tested
    • Use classes and type checking
    • Test your functions with known data
  • Handling errors with tryCatch
    • Check help page
  • The stop() function breaks the the execution of the program
  • Debugging options:
    • Print() statements
    • Traceback
      • traceback()
      • Shows the function calls and what parameters were used
    • Dumping frames
      • options(error = NULL)
      • load("testdump.rda")
      • # debugger(testdump)
    • Step-by-step debugging
      • debug(h)
      • n - execute next line
      • c - execute whole function
      • q - quit debug mode
    • Profiling
      • t1 <- proc.time()
      • some.code()
      • t2 <- proc.time()
      • t2 - t1
      • microbenchmark package for execution times
      • advanced profiling: profr package, returns tables with information, or profvis which returns interactive tables etc.
  • R automatically copies an element to another place in memory - better to specify length of vector from the start, rather than adding to it
    • Use vectorization!

0.1.4 Optimizing your code

  • Write more efficient - use vectorization instead of loops
  • Allocate memory to avoid copy-on-modify
    • Identify address in memory:
      • pryr::address()
    • Modifying values in a matrix copies the new matrix to another location in the memory
  • The bigmemory package is for huge data sets
  • GPU computations are possible
    • The gpuR package
    • Data can be created/stored on GPU memory, and run on the GPU (faster than sending it to the GPU from the main memory)
  • Multicore support
  • Use data.table or tibble instead of data.frame
  • Parallelization in R
    • parallel package
    • Can select the amount of cores etc.
    • Compatible with NextFlow
    • FlowR package is also available

0.1.5 Tasks

0.1.5.1 1: Coding Style

0.1.5.1.1 1.1 Which of the following are valid/good variable names in R. What is wrong with the ones that are invalid/bad?
Variable name Validity
var1 OK
3way_handshake OK
.password OK, hidden
_test_ OK
my-matrix-M Not good
three.dimensional.array OK
3D.distance OK
.2objects Not good
wz3gei92 Not good
next Not good, already existing function
P OK
Q OK
R OK
S OK
T Not good, TRUE or 1
X Not good
is.larger Not good
0.1.5.1.2 1.2 The code below works, but can be improved. Improve it!

Raw:

myIterAtoR.max <- 5
second_iterator.max<-7
col.NUM= 10
row.cnt =10
fwzy45 <- matrix(rep(1, col.NUM*row.cnt),nrow=row.cnt)
for(haystack in (2-1):col.NUM){
  for(needle in 1:row.cnt) {
if(haystack>=myIterAtoR.max){
fwzy45[haystack, needle]<-NA}
  }}

Formatted

iter_max <- 5
col_num <- 10
row_num <- 10
A <- matrix(rep(1, col_num * row_num), nrow = row_num)
for (i in 1:col_num) {
  for (j in 1:row_num) {
    if (i >= iter_max) {
      A[i, j] <- NA
    }
  }
}
0.1.5.1.3 1.3 Improve formatting and style of the following code

Raw:

simulate_genotype <- function( q, N=100 ) {
  if( length(q)==1 ){
    p <- (1 - q)
    f_gt <- c(p^2, 2*p*q, q^2) # AA, AB, BB
  }else{
    f_gt<-q
  }
  tmp <- sample( c('AA','AB','BB'), size =N, prob=f_gt, replace=T )
  return(tmp)
}

Formatted:

simulate_genotype <- function(q, N = 100) {
  if (length(q) == 1) {
    p <- (1 - q)
    f_gt <- c(p^2, 2*p*q, q^2) # AA, AB, BB
    } else {
    f_gt <- q
  }
  tmp <- sample(c('AA','AB','BB'),
                size = N,
                prob = f_gt,
                replace = T)
  return(tmp)
}
0.1.5.1.4 1.4 Assign a vector of three last months (abbreviated in English) in a year to a hidden variable my_months
.my_months <- c("Oct", "Nov", "Dec")
.my_months2 <- rev(rev(month.abb)[1:3])

.my_months
## [1] "Oct" "Nov" "Dec"
.my_months2
## [1] "Oct" "Nov" "Dec"
0.1.5.1.5 1.5 Pipeline-friendly function: Modify the function below so that it works with pipes:

Raw:

my_filter <- function(threshold = 1, data, scalar = 5) {
  data[data >= threshold] <- NA 
  data <- data * scalar
  return(data)
}

Formatted:

my_filter <- function(data, threshold = 1, scalar = 5) {
  data[data >= threshold] <- NA 
  data <- data * scalar
  return(data)
}
0.1.5.1.6 1.6 Untidy code: Is the code below correct, can it be improved?

Raw:

simulate_phenotype <- function(pop_params, gp_map, gtype) {
  pop_mean <- pop_params[1]
  pop_var <- pop_params[2]
  pheno <- rnorm(n = N, mean = pop_mean, sd = sqrt(pop_var))
  effect <- rep(0, times = length(N))
  for (gt_iter in c('AA', 'AB', 'BB')) {
    effect[gtype == gt_iter] <- rnorm(n = sum(gtype == gt_iter), 
                                      mean = gp_map[gt_iter, 'mean_eff'], 
                                      sd = sqrt(gp_map[gt_iter, 'var_eff']))
  }
  dat <- data.frame(gt = gtype, raw_pheno = pheno, effect = effect, pheno = pheno + effect)
  return(dat)
}

Formatted: No major changes necessary

0.2 Structuring the code and reproducibility

0.2.1 Reproducible research

0.2.1.1 Benefits of reproducibility

  • Rerunning workflows
  • Easy to add new data

0.2.1.2 Solutions for reproducible research

  • NextFlow, GitHub, LaTeX, jupyter etc.
  • PackRat and RMarkdown in R
  • Automate the workflow as much as possible

0.2.1.3 Markdown

  • Can write raw html inside the markdown file, then render it with knitr
  • Package documentation: pkgdown
  • Complex websites: blogdown

0.2.2 Tasks

0.2.2.1 Computing variance: Write a modular code that computes the sample standard deviation given a vector of numbers

num_vec <- rep(1:10, each = 1)

calc_var <- function(vector) {
  mean_val <- sum(vector)/length(vector)
  
  sq_num <- NULL
  
  for(i in vector) {
  x <- (i - mean_val)^2
  sq_num <- append(sq_num, values = x)
  }
  
  variance <- sqrt(sum(sq_num)/length(vector)-1)
  
  return(variance)
}

variance <- calc_var(num_vec)
sd <- (variance)^2

variance
## [1] 2.692582
sd
## [1] 7.25

0.2.2.2 See markdown_tasks.rmd