Common Workflow Language - A Tutorial on making Bioinformatics Repeatable

If you’ve spent any time at the command line you know that feeling where you can’t figure out how you did something a few weeks ago and want to repeat it. You’ve probably hit ctrl-r and looked through your bash history, but alas what you need is just out of reach.

The next time you did something you thought you’d repeat, you opened up mysuperawesomeworkflow.sh and carefully copied each command over. And it sort of works…at least you have some command reference. Then someone asks you for your bash script…gulp…you really don’t want to hand over that garbage fire.

Enter CWL

But, there is a new kid on the block when it comes to repeatable workflows – The Common Workflow Language.

What the hell is this thing?

Technically, it’s just a spec that defines a file format. But, there are some reference implementations which provide programs to parse this special format and run the commands that are specified. That’s it. Think of it as a suped-up form of bash. So instead of writing a .sh file you’d write a .cwl file.

The most basic and common way to run these cwl files is with cwltool. It’s pip installable so you can just fire up a terminal and install it by typing pip install cwltool.

But, other tools also will take these cwl files as workflow specifications. For example, Arvados uses the CWL to define it’s pipelines and there is talk of using it in Galaxy as well.

Combining CWL with Docker

But, the other thing about repeating past work is that it depends on how your computer was set up at the time.

What operating system was used?
What programs were installed?
What version of the Java virtual machine was used?

Docker is a cool way to specify and create this context. It creates containers, which you can think of like little walled off computers that run on your computer. It does a lot of fancy things to make this efficient, but the key thing is that it creates a little sandbox for your work that looks the same every time you use the container.

So by using Docker, we can create a stable environment to run a tool in. And we can run the same sequence of commands using the CWL. All this adds up to reproducible workflows.

Let’s go

Most of the examples on the web are either too simplistic to be useful or too complex to be understood. So I’ll try to explain a moderately complex workflow that showcases some interesting “features” that you might run into.

Let’s run SnpEff on some Thousand Genomes data. I trimmed the chr22 down to just a few hundred variants and removed all the samples to clean up vcf.

In this workflow we’ll see:

How to run a command on the native system
How to run a command in a Docker container
How to pass in input files
How to pass in arguments for a command
How to capture stdout
How to grab a file produced by a command
How to specific command flag
How to use the outputs of one step as the input for the next step

First, define the context

The first thing we want to do is define the context for the analysis. To do that we create a Docker container.

Step 0. Install Docker

Head over to the Docker site to learn how.

Step 1. Create a Dockerfile

The Dockerfile defines what the container will look like. By convention these files are named Dockerfile and live in the directory where you will build the container.

FROM openjdk:8-jdk

ENV JAVA_OPTS="-Xmx4g"

RUN apt-get update && apt-get install -y unzip wget

RUN wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
RUN unzip snpEff_latest_core.zip
RUN rm snpEff_latest_core.zip

RUN cp snpEff/snpEff.jar .
RUN cp snpEff/snpEff.config .
WORKDIR /snpEff
RUN java -jar snpEff.jar download -v hg19

COPY start.sh /opt/snpeff/
ENTRYPOINT ["sh", "/opt/snpeff/start.sh"]

So what does this do?

First, it gets a base container that has the Java Development Kit installed.

Then, it sets an environment variable JAVA_OPTS with the flag -XmX4g. This flag will be used to tell the Java Virtual Machine to use 4gb of memory.

Then, it installs unzip and wget that will be used to download and install SnpEff.

Next, it downloads and installs SnpEff. By install, we mean unzip and copy the .jar and .config files to /snpEff directory in the container.

Then it switches to that directory and downloads the hg19 database. SnpEff will do this automatically at runtime, but doing it when we build the container allows us not to download the database each time we run the container.

Finally, it copies a script and tells the container to run it on startup.

To build this container and make an image we can just cd into the directory containing the Dockerfile an run docker build --tag=andrewjesaitis/snpeff .. This builds the image and tags it with andrewjesaitis/snpeff. We could then push it to our container registry, but for now we just need it on our system.

Step 2. The start.sh script

So in the last step we told Docker to copy start.sh to /opt/snpeff. What’s in this script.

#!/usr/bin/env bash
cp -R /snpEff/* .
java $JAVA_OPTS -jar snpEff.jar "$@"

When this run the first thing that happens is that we copy the SnpEff files to our current directory. If you were to run pwd you’d see that we are in a temp directory here so we need to copy SnpEff to where we will execute it. Then we just run SnpEff with the -XmX4g flag that we specified as an environment variable in the Dockerfile.

Define the Workflow

Step 3. Specifying the Pipeline

So what’s this pipeline going to do? We are going to take a gzip’d 1kg vcf, unzip it, and run it through SnpEff.

Trivial, right? Yes and no.

The pipeline has a couple steps so we have to hand off files and link together a couple CWL scripts into a pipeline (or in CWL parlance a Workflow). This is complex enough to show off some of features that make the CWL useful, but it isn’t a bioinformatics core workflow that consists of 40 different steps to get buried in.

Let’s start at the high level and work our way down.

Here’s the workflow:

#!/usr/bin/env cwl-runner
class: Workflow

cwlVersion: v1.0

inputs:
  genome:
    type: string
  infile:
    type: File
    doc: gzip VCF file to annotate

outputs:
  outfile:
    type: File
    outputSource: snpeff/output
  statsfile:
    type: File
    outputSource: snpeff/stats
  genesfile:
    type: File
    outputSource: snpeff/genes

steps:
  gunzip:
    run: gunzip.cwl
    in:
      gzipfile:
        source: infile
    out: [unzipped_vcf]

  snpeff:
    run: snpeff.cwl
    in:
      input_vcf: gunzip/unzipped_vcf
      genome: genome
    out: [output, stats, genes]

doc: |
    Annotate variants provided in a gziped VCF using SnpEff

Let’s take it line by line.

First, we label this a Workflow. We’ll see later that the actual execution of programs is done in CommandLineTool classes.

Next, notice that the cwlVersion is v1.0. You’ll see a lot of workflows around using the draft-3 spec. These versions are incompatible so you can’t mix and match.

Now the inputs section:

inputs:
  genome:
    type: string
  infile:
    type: File
    doc: gzip VCF file to annotate

Here is where you declare the “things” you’ll pass in at runtime. The section establishes the bindings that you can use to pass the parameters to workflow steps. genome is a string that tells SnpEff what reference to use (e.g. “hg18”, “hg19”, “mm9” etc). infile is the gziped vcf. As you can see, you can document these parameter with the doc key.

Onto the outputs:

outputs:
  outfile:
    type: File
    outputSource: snpeff/output
  statsfile:
    type: File
    outputSource: snpeff/stats
  genesfile:
    type: File
    outputSource: snpeff/genes

This section specifies what the workflow will return. We are going to capture 3 files produced by SnpEff. The file outfile will contain the annotated VCF, statsfile is an html document describing what SnpEff processed, and finally SnpEff produces a summary of the number of variants affecting each gene and transcript which we will reference as genesfile. Note, that at the workflow level we don’t have to worry about the name or location of these files; we need only know what step they are coming from.

Now for the heart of the file, the steps:

steps:
  gunzip:
    run: gunzip.cwl
    in:
      gzipfile:
        source: infile
    out: [unzipped_vcf]

  snpeff:
    run: snpeff.cwl
    in:
      input_vcf: gunzip/unzipped_vcf
      genome: genome
    out: [output, stats, genes]

Each sub-block defines a step. The step is identified by the sub-blocks key (i.e. gunzip or snpeff). Each step specifies a CWL file that defines the command and the input and output file mappings. Note how you can define the file using a source key or directly with the parent key. The other quirk is that the output files need to be specified as an array. I believe this is a bug in reference implementation, since according to the spec out: unzipped_vcf and out: [unzipped_vcf] are equivalent. However, in the snpeff step we are returning 3 files, so they must be specified as an array. Finally, note how the references to files are constructed. If you are using one of the workflow inputs, you don’t need to namespace the reference (i.e. you just say infile). But, to use a steps output as an input, you need to namespace the reference to the step (i.e. you say gunzip/unzipped_vcf not unzipped_vcf).

Lastly, it’s always a good idea to add a quick bit of documentation at the bottom so you aren’t left scratching you head 6 months from now.

Step 4. The gunzip step

cwlVersion: v1.0

class: CommandLineTool

baseCommand: [gunzip, -c]

inputs:
  gzipfile:
    type: File
    inputBinding:
      position: 1

outputs:
  unzipped_vcf:
    type: stdout

stdout: unzipped.vcf

The first thing to notice is class: CommandLineTool. This line tells cwltool to parse and run this as an execution step

Next baseCommand: [gunzip, -c] is the command and any required args. This array is joined together with spaces to construct the first part of the command. It’s worth reading how commands are run in the spec. I could have also specified the -c flag as an argument:

baseCommand: gunzip
arguments: ["-c"]

In any case gunzip -c unzips the file to stdout.

We specify the file to unzip in the inputs block:

inputs:
  gzipfile:
    type: File
    inputBinding:
      position: 1

There are two things to note here. First, the sub-block key gzipfile matches the file identifier we used in the workflow file in the in block of the gunzip step. Second, we specify a position in the inputbinding. The position is used to sort the inputs when constructing the command so the order in which inputs are specified in the CWL file doesn’t necessarily reflect the order they will appear in the command.

Finally, we specify the output file:

outputs:
  unzipped_vcf:
    type: stdout

stdout: unzipped.vcf

Since we want to capture the stdout stream from gunzip, we specify type: stdout. We could stop here and let the cwltool create a file name, but these generated file names are not very pretty. Instead, we bind stdout to the file unzipped.vcf. In practice, we will never see this file. unzipped.vcf will be created in a temporary directory during processing and deleted automatically.

Step 5. The snpeff step

Now, we will take that unzipped vcf and run SnpEff on it. Since SnpEff requires a specific environment (an installed Java Development Kit), we’ll run it in a Docker container.

#!/usr/bin/env cwl-runner

class: CommandLineTool

cwlVersion: v1.0

baseCommand: []

requirements:
  - class: DockerRequirement
    dockerImageId: andrewjesaitis/snpeff

inputs:
  genome:
    type: string
    inputBinding:
      position: 1
  input_vcf:
    type: File
    inputBinding:
      position: 2
    doc: VCF file to annotate

outputs:
  output:
    type: stdout
  stats:
    type: File
    outputBinding:
      glob: "*.html"
  genes:
    type: File
    outputBinding:
      glob: "*.txt"

stdout: output.vcf

Again this is an execution step so we use the class: CommandLineTool using v1.0 of the language.

We tell cwltool to use a Docker Container in the requirements block:

requirements:
  - class: DockerRequirement
    dockerImageId: andrewjesaitis/snpeff

There are a few ways of referencing a container. Here, we use the image we created earlier called andrewjesaitis/snpeff. Alternatively, we could specify a repo using the dockerPull key as shown in the user guide. Or you can inline the Dockerfile in your CWL file using the dockerFile key. The cwltool will then build the image using docker build.

Next we provide the necessary inputs to SnpEff:

inputs:
  genome:
    type: string
    inputBinding:
      position: 1
  input_vcf:
    type: File
    inputBinding:
      position: 2
    doc: VCF file to annotate

The SnpEff command we run in start.sh looks like java -XmX4g -jar SnpEff.jar <genome> <input_vcf>. So genome is just a string that tells SnpEff what reference to use. We provide this string later in the 1kg-job.yml file that we will pass into the cwltool command. The input_vcf is going to come from the unzipped_vcf output of our gunzip step. And again, we tell cwltool what order the arguments should be in using the position key.

Now, we want to collect our output files:

outputs:
  output:
    type: stdout
  stats:
    type: File
    outputBinding:
      glob: "*.html"
  genes:
    type: File
    outputBinding:
      glob: "*.txt"

stdout: output.vcf

First, output will contain the annotated vcf produced by SnpEff. Since SnpEff sends this to stdout, we will use the familiar stdout pattern we employed in the gunzip step. For the two summary files, we tell cwltool to perform a wildcard glob match (the same syntax that you use at the command line – i.e. rm *.doc). It will find an html and txt file and bind them to stats and genes respectively. If we expected more than one txt file, for example, we would need a more specific match or to use an array type to return multiple files.

Execution

Step 6. Let’s run this thing

Finally, we are ready to run this mini pipeline. We just need to specify the inputs – the genome and the zipped vcf – in our 1kg-job.yml file

infile:
  class: File
  path: test/chr22.truncated.nosamples.1kg.vcf.gz
genome: hg19

So we bind the relative path test/chr22.truncated.nosamples.1kg.vcf.gz to infile. Remember infile matches the reference in the in block in our snpeff-workflow.cwl file. Similarly we bind the string hg19 to genome to be passed to the SnpEff command during the second step.

Now we run it with the command:

cwltool snpeff-workflow.cwl 1kg-job.yml

It will take a minute to run, during which time we’ll see the commands that are executed and finally some JSON showing the output files. After it completes, we’ll find these new files in our directory – output.vcf, snpEff_summary.html, and snpEff_genes.txt. Notice, that we don’t find the intermediate unzipped.vcf file. Intermediate files that are not specified in the out block in the workflow are automatically deleted.

Final thoughts

So if you are like me, you might be thinking, “Andrew, this is a ton of work just to run two commands.” And I’d agree, but after you write a couple of workflows you can punch these out really fast. Also, there are a bunch of advantages that aren’t immediately obvious:

Declarative style minimizes side effects compared to imperative style
Modular; pop in the same tool to multiple workflows
Can be produced programmatically
No crazy string formatting antics
No monitoring of subprocesses
Easy to distribute you workload across the cluster

Ultimately, these advantages should mean that spend less type troubleshooting existing workflows and allow your tools to be used by colleagues more easily.

Finally, as you develop your own workflows, you are bound to run into problems. Two tips:

Make sure you check the version in the url on the CWL site. It’s easy to be digging through the spec or user’s guide only to find you are looking at draft-3 instead of v1.0
You can pass --leave-tmpdirs to the cwltool command. This is often helpful to figure out if the outputs from a step are what you think they should be.

All these scripts are available in a github repo for you to play with.

Here are the commands to get you started:

git clone git@github:andrewjesaitis/cwl-tutorial
cd cwl-tutorial
docker build --tag=andrewjesaitis/snpeff .
pip install cwltool
cwltool snpff-workflow.cwl 1kg-job.yml

Powered by coffee.