Reproducible research

19 October 2018

Project Setup

General considerations

General considerations for open and reproducible research:
Create for each project an own, dedicated (private) GitHub repository
Use a standardised folder and documentation structure
Prepare for each project/task an own docker container
Place raw data in public repositories like figshare, maybe with embargo

Everything is collaborative even if it's just between you and yourself in the future!

Setting up GitHub

Structure considerations

.
+-- Code/
|   +-- Bash/
|   +-- Docker/
|   +-- R/
+-- Data/
+-- Notes/
+-- Results/
|   +-- Figures/
|   +-- Tables/
|   +-- Objects/

File names

Bad examples

abstract.doc
abstract final_new2.doc
fig 2.jpg
Some really nice description.txt

Good examples

2018.10.04-Abstract_for_PAG.txt
fig02-Scatterplot-time_vs_value.png
Some-really-nice-description.txt

Docker

Background

There is often some confusion about the difference between virtual machines (Virtual Box) and containerization (docker)
Both methods have the same aim:

Isolate an application and its dependencies into a self-contained unit that can run anywhere

However, there is a large difference in the architecture between containers and VMs:

Containers share the host system's kernel with other containers whereas VM's rely on a hypervisor.

Difference in architecture

Hypervisor

A hypervisor is usually a piece of software that VMs run on top of
It runs on a physical computer, referred to as the host machine
The host machine provides the VMs with resources (like RAM and CPU)
These resources are divided between VMs and can be distributed as needed.
The hypervisor provides the VMs with a platform to manage and execute its guest OS.

Guest OS

The guest machine contains the application and whatever it needs to run that application.
It also carries an entire virtualized hardware stack of its own (virtualized network adapters, storage, and CPU)
From the inside, the guest machine behaves as its own unit with its own dedicated resources.
From the outside, we know that VM share resources provided by the host machine.

Container

A container provides operating-system-level virtualization
Each container gets its own isolated user-space to allow multiple containers to run
In principle, containers look like a VM
BUT: Containers package up just the user-space, and not the kernel or virtual hardware like a VM does.
Only libaries and binaries are needed, hence container are so lightweight

Docker

Docker is an open-source project based on Linux containers
Containerisation is not a new concept, e.g. Google has its own container architecture
Other popular containerisation solutions are e.g. LXC, FreeBSD jails, AIX Workload Partitions and Solaris Containers.
However, Docker is easy, fast, community-driven and very modular
Docker images are build of reusable layers
The docker engine contains the docker client and daemon

Components

Dockerfile

Example of a Dockerfile (to create an Image)

  FROM ubuntu:16.04
  MAINTAINER Daniel Fischer <daniel.fischer@luke.fi>
  
  RUN apt-get update && apt-get install -y \
        curl \
        unzip \
        wget \
     && rm -rf /var/lib/apt/lists/*  
  
  RUN wget -qO- https://github.com/alexdobin/STAR/archive/2.5.2b.tar.gz | \
      tar -xz && mv /STAR-2.5.2b/ /bin/STAR-2.5.2b/
  
  ENV PATH $PATH:/bin/STAR-2.5.2b/bin/Linux_x86_64_static/

Docker image handling

This creates a docker images and pushes it to Docker hub (account required)

docker build -t fischuu/star:2.5.2b .

docker push fischuu/star:2.5.2b

docker images

Output:

REPOSITORY           TAG                 IMAGE ID            CREATED             SIZE
ubuntu               16.04               14f60031763d        15 months ago       120 MB
fischuu/star         2.5.2b              31d682b42362        10 months ago       321 MB

Basic Docker example

Just run a basic Hello-World docker example

fischuu@Orome ~ $ docker run hello-world

Output:

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
d1725b59e92d: Pull complete 
Digest: sha256:0add3ace90ecb4adbf7777e9aacf18357296e799f81cabc9fde470971
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

List available/old docker container

docker ps -a

Output:

CONTAINER ID  IMAGE               COMMAND     CREATED           
3813c13a6451  hello-world         "/hello"    14 minutes ago
eafbdfe3ff79  fischuu/star:2.5.2b "/bin/bash"  4 seconds ago

STATUS                      PORTS               NAMES
Exited (0) 14 minutes ago                       distracted_hugle  
Up 3 seconds                                    STAR

A more complex docker call

This is a more complex call to start a docker container as background daemon

docker run -dit -v /path/on/host/:/mp1/ \
               -v /other/path/:/mp2/ \
                --name STAR fischuu/star:2.5.2b

Send commands to this container

docker exec STAR /bin/sh -c "star createIndex myFasta.fa"

Stop and remove the container

docker stop STAR ; docker rm STAR

Example cases of docker use

Docker can be applied to a wide field of applications

Running a RStudio session
Start databases/webservices as daemon
Try+Error/Debugging approaches, as containers are just disposable
Create different testing environments on the fly
Scale from development to production
Fully-functional production pipelines 'without' hands-on work

Advantages of Docker

Containers/Images are

small (often between 100 and 500 MB)
stable over time (each container is a new instance)
easily distributed (see Docker hub)
versioned (Images are tagged)
highly productive (e.g. in combination with pipeline scripts)
easily scaleble (pushing to cloud computing services)

Disadvantages of Docker

Containers/Images

are a possible security hazzard (root rights)
require some level of discipline (especially in a hurry)
are maybe not always available from trusted sources (e.g. for RHEL 6)

Conclusion

Suggested Workflow

Working with plain text as much as possible:
- script files (.R, .sh)
- configuration (.docker)
- documentation (.Rmd)
Keep a documented orchestrate/master script
Do not use workspaces, rather store objects after calculation
Place all files in online repositories under version control
Keep all the raw data in a read-only repository
Run anything project realted only in corresponding (docker) container

Supplement Material

Further reads / Used materials

Link to an overview of public data repositories:

https://www.nature.com/sdata/policies/repositories

Slides are based on the following materials:

https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b

https://www.learnitguide.net/2018/04/what-is-docker-get-started-from-basics.html