19 October 2018

Project Setup

General considerations

  • General considerations for open and reproducible research:

  • Create for each project an own, dedicated (private) GitHub repository

  • Use a standardised folder and documentation structure

  • Prepare for each project/task an own docker container

  • Place raw data in public repositories like figshare, maybe with embargo

Everything is collaborative even if it's just between you and yourself in the future!

Setting up GitHub

Structure considerations

.
+-- Code/
|   +-- Bash/
|   +-- Docker/
|   +-- R/
+-- Data/
+-- Notes/
+-- Results/
|   +-- Figures/
|   +-- Tables/
|   +-- Objects/

File names

Bad examples

abstract.doc
abstract final_new2.doc
fig 2.jpg
Some really nice description.txt

Good examples

2018.10.04-Abstract_for_PAG.txt
fig02-Scatterplot-time_vs_value.png
Some-really-nice-description.txt

Docker

Background

  • There is often some confusion about the difference between virtual machines (Virtual Box) and containerization (docker)

  • Both methods have the same aim:

Isolate an application and its dependencies into a self-contained unit that can run anywhere

  • However, there is a large difference in the architecture between containers and VMs:

Containers share the host system's kernel with other containers whereas VM's rely on a hypervisor.

Difference in architecture

Hypervisor

  • A hypervisor is usually a piece of software that VMs run on top of

  • It runs on a physical computer, referred to as the host machine

  • The host machine provides the VMs with resources (like RAM and CPU)

  • These resources are divided between VMs and can be distributed as needed.

  • The hypervisor provides the VMs with a platform to manage and execute its guest OS.

Guest OS

  • The guest machine contains the application and whatever it needs to run that application.

  • It also carries an entire virtualized hardware stack of its own (virtualized network adapters, storage, and CPU)

  • From the inside, the guest machine behaves as its own unit with its own dedicated resources.

  • From the outside, we know that VM share resources provided by the host machine.

Container

  • A container provides operating-system-level virtualization

  • Each container gets its own isolated user-space to allow multiple containers to run

  • In principle, containers look like a VM

  • BUT: Containers package up just the user-space, and not the kernel or virtual hardware like a VM does.

  • Only libaries and binaries are needed, hence container are so lightweight

Docker

  • Docker is an open-source project based on Linux containers

  • Containerisation is not a new concept, e.g. Google has its own container architecture

  • Other popular containerisation solutions are e.g. LXC, FreeBSD jails, AIX Workload Partitions and Solaris Containers.

  • However, Docker is easy, fast, community-driven and very modular

  • Docker images are build of reusable layers

  • The docker engine contains the docker client and daemon

Components

Dockerfile

Example of a Dockerfile (to create an Image)

  FROM ubuntu:16.04
  MAINTAINER Daniel Fischer <daniel.fischer@luke.fi>
  
  RUN apt-get update && apt-get install -y \
        curl \
        unzip \
        wget \
     && rm -rf /var/lib/apt/lists/*  
  
  RUN wget -qO- https://github.com/alexdobin/STAR/archive/2.5.2b.tar.gz | \
      tar -xz && mv /STAR-2.5.2b/ /bin/STAR-2.5.2b/
  
  ENV PATH $PATH:/bin/STAR-2.5.2b/bin/Linux_x86_64_static/

Docker image handling

This creates a docker images and pushes it to Docker hub (account required)

docker build -t fischuu/star:2.5.2b .

docker push fischuu/star:2.5.2b

docker images

Output:

REPOSITORY           TAG                 IMAGE ID            CREATED             SIZE
ubuntu               16.04               14f60031763d        15 months ago       120 MB
fischuu/star         2.5.2b              31d682b42362        10 months ago       321 MB

Basic Docker example

Just run a basic Hello-World docker example

fischuu@Orome ~ $ docker run hello-world

Output:

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
d1725b59e92d: Pull complete 
Digest: sha256:0add3ace90ecb4adbf7777e9aacf18357296e799f81cabc9fde470971
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

List available/old docker container

docker ps -a

Output:

CONTAINER ID  IMAGE               COMMAND     CREATED           
3813c13a6451  hello-world         "/hello"    14 minutes ago
eafbdfe3ff79  fischuu/star:2.5.2b "/bin/bash"  4 seconds ago

STATUS                      PORTS               NAMES
Exited (0) 14 minutes ago                       distracted_hugle  
Up 3 seconds                                    STAR

A more complex docker call

This is a more complex call to start a docker container as background daemon

docker run -dit -v /path/on/host/:/mp1/ \
               -v /other/path/:/mp2/ \
                --name STAR fischuu/star:2.5.2b

Send commands to this container

docker exec STAR /bin/sh -c "star createIndex myFasta.fa"

Stop and remove the container

docker stop STAR ; docker rm STAR

Example cases of docker use

Docker can be applied to a wide field of applications

  • Running a RStudio session
  • Start databases/webservices as daemon
  • Try+Error/Debugging approaches, as containers are just disposable
  • Create different testing environments on the fly
  • Scale from development to production
  • Fully-functional production pipelines 'without' hands-on work

Advantages of Docker

Containers/Images are

  • small (often between 100 and 500 MB)
  • stable over time (each container is a new instance)
  • easily distributed (see Docker hub)
  • versioned (Images are tagged)
  • highly productive (e.g. in combination with pipeline scripts)
  • easily scaleble (pushing to cloud computing services)

Disadvantages of Docker

Containers/Images

  • are a possible security hazzard (root rights)
  • require some level of discipline (especially in a hurry)
  • are maybe not always available from trusted sources (e.g. for RHEL 6)

Conclusion

Suggested Workflow

  • Working with plain text as much as possible:
    • script files (.R, .sh)
    • configuration (.docker)
    • documentation (.Rmd)
  • Keep a documented orchestrate/master script
  • Do not use workspaces, rather store objects after calculation
  • Place all files in online repositories under version control
  • Keep all the raw data in a read-only repository
  • Run anything project realted only in corresponding (docker) container

Supplement Material

Further reads / Used materials