RaukR 2018 Notes - Day 1: Better Coding

0.1 Names, functions and debugging

0.1.1 Naming conventions

Valid names contain A-z, 0-9, . and _. Can begin with a period, but cannot be followed by a number.
Avoid F and T. Some names are only used in a specific context.
Notations:
- Snake notation: name_of_function
- Period notation: name.of.function
- Camel notation: NameOfFunction
Do not use period notation with S3 classes.
Use the same naming notation everywhere - be consistent!
Variable names beginning with a period will be hidden in the environment, but still available.
Split your big problem into small subproblems recursively, and encapsulate our code in functional blocks.

0.1.2 Functions

Avoid accessing and modifying globals!
Use data as the very first argument
Set parameters to defaults
Global defaults can be changed by options
If you are re-using someone else’s function, write a wrapper. A wrapper is a design pattern, put the function inside another function.
Make the user able to turn off messages etc.

0.1.3 Debugging

20 percent of the code has 80 percent of the errors - Lowell Arthur

Beware of bugs in the above code; I have only proved it correct, not tried it - Donald Knuth

Types of bugs:
- Syntax
- Arithmetic
- Type
- Logic
To avoid bugs:
- Encapulate code in smaller functions that can be tested
- Use classes and type checking
- Test your functions with known data
Handling errors with tryCatch
- Check help page
The stop() function breaks the the execution of the program
Debugging options:
- Print() statements
- Traceback
  - traceback()
  - Shows the function calls and what parameters were used
- Dumping frames
  - options(error = NULL)
  - load("testdump.rda")
  - # debugger(testdump)
- Step-by-step debugging
  - debug(h)
  - n - execute next line
  - c - execute whole function
  - q - quit debug mode
- Profiling
  - t1 <- proc.time()
  - some.code()
  - t2 <- proc.time()
  - t2 - t1
  - microbenchmark package for execution times
  - advanced profiling: profr package, returns tables with information, or profvis which returns interactive tables etc.
R automatically copies an element to another place in memory - better to specify length of vector from the start, rather than adding to it
- Use vectorization!

0.1.4 Optimizing your code

Write more efficient - use vectorization instead of loops
Allocate memory to avoid copy-on-modify
- Identify address in memory:
  - pryr::address()
- Modifying values in a matrix copies the new matrix to another location in the memory
The bigmemory package is for huge data sets
GPU computations are possible
- The gpuR package
- Data can be created/stored on GPU memory, and run on the GPU (faster than sending it to the GPU from the main memory)
Multicore support
Use data.table or tibble instead of data.frame
Parallelization in R
- parallel package
- Can select the amount of cores etc.
- Compatible with NextFlow
- FlowR package is also available

0.1.5 Tasks

0.1.5.1 1: Coding Style

0.1.5.1.1 1.1 Which of the following are valid/good variable names in R. What is wrong with the ones that are invalid/bad?

Variable name	Validity
var1	OK
3way_handshake	OK
.password	OK, hidden
_test_	OK
my-matrix-M	Not good
three.dimensional.array	OK
3D.distance	OK
.2objects	Not good
wz3gei92	Not good
next	Not good, already existing function
P	OK
Q	OK
R	OK
S	OK
T	Not good, TRUE or 1
X	Not good
is.larger	Not good

0.1.5.1.2 1.2 The code below works, but can be improved. Improve it!

Raw:

myIterAtoR.max <- 5
second_iterator.max<-7
col.NUM= 10
row.cnt =10
fwzy45 <- matrix(rep(1, col.NUM*row.cnt),nrow=row.cnt)
for(haystack in (2-1):col.NUM){
  for(needle in 1:row.cnt) {
if(haystack>=myIterAtoR.max){
fwzy45[haystack, needle]<-NA}
  }}

Formatted

iter_max <- 5
col_num <- 10
row_num <- 10
A <- matrix(rep(1, col_num * row_num), nrow = row_num)
for (i in 1:col_num) {
  for (j in 1:row_num) {
    if (i >= iter_max) {
      A[i, j] <- NA
    }
  }
}

0.1.5.1.3 1.3 Improve formatting and style of the following code

Raw:

simulate_genotype <- function( q, N=100 ) {
  if( length(q)==1 ){
    p <- (1 - q)
    f_gt <- c(p^2, 2*p*q, q^2) # AA, AB, BB
  }else{
    f_gt<-q
  }
  tmp <- sample( c('AA','AB','BB'), size =N, prob=f_gt, replace=T )
  return(tmp)
}

Formatted:

simulate_genotype <- function(q, N = 100) {
  if (length(q) == 1) {
    p <- (1 - q)
    f_gt <- c(p^2, 2*p*q, q^2) # AA, AB, BB
    } else {
    f_gt <- q
  }
  tmp <- sample(c('AA','AB','BB'),
                size = N,
                prob = f_gt,
                replace = T)
  return(tmp)
}

0.1.5.1.4 1.4 Assign a vector of three last months (abbreviated in English) in a year to a hidden variable my_months

.my_months <- c("Oct", "Nov", "Dec")
.my_months2 <- rev(rev(month.abb)[1:3])

.my_months

## [1] "Oct" "Nov" "Dec"

.my_months2

## [1] "Oct" "Nov" "Dec"

0.1.5.1.5 1.5 Pipeline-friendly function: Modify the function below so that it works with pipes:

Raw:

my_filter <- function(threshold = 1, data, scalar = 5) {
  data[data >= threshold] <- NA 
  data <- data * scalar
  return(data)
}

Formatted:

my_filter <- function(data, threshold = 1, scalar = 5) {
  data[data >= threshold] <- NA 
  data <- data * scalar
  return(data)
}

0.1.5.1.6 1.6 Untidy code: Is the code below correct, can it be improved?

Raw:

simulate_phenotype <- function(pop_params, gp_map, gtype) {
  pop_mean <- pop_params[1]
  pop_var <- pop_params[2]
  pheno <- rnorm(n = N, mean = pop_mean, sd = sqrt(pop_var))
  effect <- rep(0, times = length(N))
  for (gt_iter in c('AA', 'AB', 'BB')) {
    effect[gtype == gt_iter] <- rnorm(n = sum(gtype == gt_iter), 
                                      mean = gp_map[gt_iter, 'mean_eff'], 
                                      sd = sqrt(gp_map[gt_iter, 'var_eff']))
  }
  dat <- data.frame(gt = gtype, raw_pheno = pheno, effect = effect, pheno = pheno + effect)
  return(dat)
}

Formatted: No major changes necessary

0.2 Structuring the code and reproducibility

0.2.1 Reproducible research

0.2.1.1 Benefits of reproducibility

Rerunning workflows
Easy to add new data

0.2.1.2 Solutions for reproducible research

NextFlow, GitHub, LaTeX, jupyter etc.
PackRat and RMarkdown in R
Automate the workflow as much as possible

0.2.1.3 Markdown

Can write raw html inside the markdown file, then render it with knitr
Package documentation: pkgdown
Complex websites: blogdown

0.2.2 Tasks

0.2.2.1 Computing variance: Write a modular code that computes the sample standard deviation given a vector of numbers

num_vec <- rep(1:10, each = 1)

calc_var <- function(vector) {
  mean_val <- sum(vector)/length(vector)
  
  sq_num <- NULL
  
  for(i in vector) {
  x <- (i - mean_val)^2
  sq_num <- append(sq_num, values = x)
  }
  
  variance <- sqrt(sum(sq_num)/length(vector)-1)
  
  return(variance)
}

variance <- calc_var(num_vec)
sd <- (variance)^2

variance

## [1] 2.692582

sd

## [1] 7.25

RaukR 2018 Notes - Day 1: Better Coding

Håkon Kaspersen

11 June 2018

0.1 Names, functions and debugging

0.1.1 Naming conventions

0.1.2 Functions

0.1.3 Debugging

0.1.4 Optimizing your code

0.1.5 Tasks

0.1.5.1 1: Coding Style

0.1.5.1.1 1.1 Which of the following are valid/good variable names in R. What is wrong with the ones that are invalid/bad?

0.1.5.1.2 1.2 The code below works, but can be improved. Improve it!

0.1.5.1.3 1.3 Improve formatting and style of the following code

0.1.5.1.4 1.4 Assign a vector of three last months (abbreviated in English) in a year to a hidden variable my_months

0.1.5.1.5 1.5 Pipeline-friendly function: Modify the function below so that it works with pipes:

0.1.5.1.6 1.6 Untidy code: Is the code below correct, can it be improved?

0.2 Structuring the code and reproducibility

0.2.1 Reproducible research

0.2.1.1 Benefits of reproducibility

0.2.1.2 Solutions for reproducible research

0.2.1.3 Markdown

0.2.2 Tasks

0.2.2.1 Computing variance: Write a modular code that computes the sample standard deviation given a vector of numbers

0.2.2.2 See markdown_tasks.rmd