If you’ve spent any time at the command line you know that feeling where you can’t figure out how you did something a few weeks ago and want to repeat it. You’ve probably hit ctrl-r
and looked through your bash history, but alas what you need is just out of reach.
The next time you did something you thought you’d repeat, you opened up mysuperawesomeworkflow.sh
and carefully copied each command over. And it sort of works…at least you have some command reference. Then someone asks you for your bash script…gulp…you really don’t want to hand over that garbage fire.
Enter CWL
But, there is a new kid on the block when it comes to repeatable workflows – The Common Workflow Language.
What the hell is this thing?
Technically, it’s just a spec that defines a file format. But, there are some reference implementations which provide programs to parse this special format and run the commands that are specified. That’s it. Think of it as a suped-up form of bash. So instead of writing a .sh
file you’d write a .cwl
file.
The most basic and common way to run these cwl
files is with cwltool
. It’s pip installable so you can just fire up a terminal and install it by typing pip install cwltool
.
But, other tools also will take these cwl
files as workflow specifications. For example, Arvados uses the CWL to define it’s pipelines and there is talk of using it in Galaxy as well.
Combining CWL with Docker
But, the other thing about repeating past work is that it depends on how your computer was set up at the time.
- What operating system was used?
- What programs were installed?
- What version of the Java virtual machine was used?
Docker is a cool way to specify and create this context. It creates containers, which you can think of like little walled off computers that run on your computer. It does a lot of fancy things to make this efficient, but the key thing is that it creates a little sandbox for your work that looks the same every time you use the container.
So by using Docker, we can create a stable environment to run a tool in. And we can run the same sequence of commands using the CWL. All this adds up to reproducible workflows.
Let’s go
Most of the examples on the web are either too simplistic to be useful or too complex to be understood. So I’ll try to explain a moderately complex workflow that showcases some interesting “features” that you might run into.
Let’s run SnpEff on some Thousand Genomes data. I trimmed the chr22 down to just a few hundred variants and removed all the samples to clean up vcf.
In this workflow we’ll see:
- How to run a command on the native system
- How to run a command in a Docker container
- How to pass in input files
- How to pass in arguments for a command
- How to capture stdout
- How to grab a file produced by a command
- How to specific command flag
- How to use the outputs of one step as the input for the next step
First, define the context
The first thing we want to do is define the context for the analysis. To do that we create a Docker container.
Step 0. Install Docker
Head over to the Docker site to learn how.
Step 1. Create a Dockerfile
The Dockerfile defines what the container will look like. By convention these files are named Dockerfile
and live in the directory where you will build the container.
FROM openjdk:8-jdk
ENV JAVA_OPTS="-Xmx4g"
RUN apt-get update && apt-get install -y unzip wget
RUN wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
RUN unzip snpEff_latest_core.zip
RUN rm snpEff_latest_core.zip
RUN cp snpEff/snpEff.jar .
RUN cp snpEff/snpEff.config .
WORKDIR /snpEff
RUN java -jar snpEff.jar download -v hg19
COPY start.sh /opt/snpeff/
ENTRYPOINT ["sh", "/opt/snpeff/start.sh"]
So what does this do?
First, it gets a base container that has the Java Development Kit installed.
Then, it sets an environment variable JAVA_OPTS
with the flag -XmX4g
. This flag will be used to tell the Java Virtual Machine to use 4gb of memory.
Then, it installs unzip
and wget
that will be used to download and install SnpEff.
Next, it downloads and installs SnpEff. By install, we mean unzip and copy the .jar
and .config
files to /snpEff
directory in the container.
Then it switches to that directory and downloads the hg19 database. SnpEff will do this automatically at runtime, but doing it when we build the container allows us not to download the database each time we run the container.
Finally, it copies a script and tells the container to run it on startup.
To build this container and make an image we can just cd
into the directory containing the Dockerfile an run docker build --tag=andrewjesaitis/snpeff .
. This builds the image and tags it with andrewjesaitis/snpeff
. We could then push it to our container registry, but for now we just need it on our system.
Step 2. The start.sh script
So in the last step we told Docker to copy start.sh
to /opt/snpeff
. What’s in this script.
#!/usr/bin/env bash
cp -R /snpEff/* .
java $JAVA_OPTS -jar snpEff.jar "$@"
When this run the first thing that happens is that we copy the SnpEff files to our current directory. If you were to run pwd
you’d see that we are in a temp directory here so we need to copy SnpEff to where we will execute it. Then we just run SnpEff with the -XmX4g
flag that we specified as an environment variable in the Dockerfile.
Define the Workflow
Step 3. Specifying the Pipeline
So what’s this pipeline going to do? We are going to take a gzip’d 1kg vcf, unzip it, and run it through SnpEff.
Trivial, right? Yes and no.
The pipeline has a couple steps so we have to hand off files and link together a couple CWL scripts into a pipeline (or in CWL parlance a Workflow). This is complex enough to show off some of features that make the CWL useful, but it isn’t a bioinformatics core workflow that consists of 40 different steps to get buried in.
Let’s start at the high level and work our way down.
Here’s the workflow:
#!/usr/bin/env cwl-runner
class: Workflow
cwlVersion: v1.0
inputs:
genome:
type: string
infile:
type: File
doc: gzip VCF file to annotate
outputs:
outfile:
type: File
outputSource: snpeff/output
statsfile:
type: File
outputSource: snpeff/stats
genesfile:
type: File
outputSource: snpeff/genes
steps:
gunzip:
run: gunzip.cwl
in:
gzipfile:
source: infile
out: [unzipped_vcf]
snpeff:
run: snpeff.cwl
in:
input_vcf: gunzip/unzipped_vcf
genome: genome
out: [output, stats, genes]
doc: |
Annotate variants provided in a gziped VCF using SnpEff
Let’s take it line by line.
First, we label this a Workflow
. We’ll see later that the actual execution of programs is done in CommandLineTool
classes.
Next, notice that the cwlVersion
is v1.0
. You’ll see a lot of workflows around using the draft-3
spec. These versions are incompatible so you can’t mix and match.
Now the inputs
section:
inputs:
genome:
type: string
infile:
type: File
doc: gzip VCF file to annotate
Here is where you declare the “things” you’ll pass in at runtime. The section establishes the bindings that you can use to pass the parameters to workflow steps. genome
is a string that tells SnpEff what reference to use (e.g. “hg18”, “hg19”, “mm9” etc). infile
is the gziped vcf. As you can see, you can document these parameter with the doc
key.
Onto the outputs:
outputs:
outfile:
type: File
outputSource: snpeff/output
statsfile:
type: File
outputSource: snpeff/stats
genesfile:
type: File
outputSource: snpeff/genes
This section specifies what the workflow will return. We are going to capture 3 files produced by SnpEff. The file outfile
will contain the annotated VCF, statsfile
is an html document describing what SnpEff processed, and finally SnpEff produces a summary of the number of variants affecting each gene and transcript which we will reference as genesfile
. Note, that at the workflow level we don’t have to worry about the name or location of these files; we need only know what step they are coming from.
Now for the heart of the file, the steps
:
steps:
gunzip:
run: gunzip.cwl
in:
gzipfile:
source: infile
out: [unzipped_vcf]
snpeff:
run: snpeff.cwl
in:
input_vcf: gunzip/unzipped_vcf
genome: genome
out: [output, stats, genes]
Each sub-block defines a step. The step is identified by the sub-blocks key (i.e. gunzip
or snpeff
). Each step specifies a CWL file that defines the command and the input and output file mappings. Note how you can define the file using a source
key or directly with the parent key. The other quirk is that the output files need to be specified as an array. I believe this is a bug in reference implementation, since according to the spec out: unzipped_vcf
and out: [unzipped_vcf]
are equivalent. However, in the snpeff step we are returning 3 files, so they must be specified as an array. Finally, note how the references to files are constructed. If you are using one of the workflow inputs, you don’t need to namespace the reference (i.e. you just say infile
). But, to use a steps output as an input, you need to namespace the reference to the step (i.e. you say gunzip/unzipped_vcf
not unzipped_vcf
).
Lastly, it’s always a good idea to add a quick bit of documentation at the bottom so you aren’t left scratching you head 6 months from now.
Step 4. The gunzip step
cwlVersion: v1.0
class: CommandLineTool
baseCommand: [gunzip, -c]
inputs:
gzipfile:
type: File
inputBinding:
position: 1
outputs:
unzipped_vcf:
type: stdout
stdout: unzipped.vcf
The first thing to notice is class: CommandLineTool
. This line tells cwltool
to parse and run this as an execution step
Next baseCommand: [gunzip, -c]
is the command and any required args. This array is joined together with spaces to construct the first part of the command. It’s worth reading how commands are run in the spec. I could have also specified the -c
flag as an argument:
baseCommand: gunzip
arguments: ["-c"]
In any case gunzip -c
unzips the file to stdout.
We specify the file to unzip in the inputs
block:
inputs:
gzipfile:
type: File
inputBinding:
position: 1
There are two things to note here. First, the sub-block key gzipfile
matches the file identifier we used in the workflow file in the in
block of the gunzip step. Second, we specify a position
in the inputbinding
. The position
is used to sort the inputs when constructing the command so the order in which inputs are specified in the CWL file doesn’t necessarily reflect the order they will appear in the command.
Finally, we specify the output file:
outputs:
unzipped_vcf:
type: stdout
stdout: unzipped.vcf
Since we want to capture the stdout stream from gunzip, we specify type: stdout
. We could stop here and let the cwltool
create a file name, but these generated file names are not very pretty. Instead, we bind stdout to the file unzipped.vcf
. In practice, we will never see this file. unzipped.vcf
will be created in a temporary directory during processing and deleted automatically.
Step 5. The snpeff step
Now, we will take that unzipped vcf and run SnpEff on it. Since SnpEff requires a specific environment (an installed Java Development Kit), we’ll run it in a Docker container.
#!/usr/bin/env cwl-runner
class: CommandLineTool
cwlVersion: v1.0
baseCommand: []
requirements:
- class: DockerRequirement
dockerImageId: andrewjesaitis/snpeff
inputs:
genome:
type: string
inputBinding:
position: 1
input_vcf:
type: File
inputBinding:
position: 2
doc: VCF file to annotate
outputs:
output:
type: stdout
stats:
type: File
outputBinding:
glob: "*.html"
genes:
type: File
outputBinding:
glob: "*.txt"
stdout: output.vcf
Again this is an execution step so we use the class: CommandLineTool
using v1.0
of the language.
We tell cwltool
to use a Docker Container in the requirements
block:
requirements:
- class: DockerRequirement
dockerImageId: andrewjesaitis/snpeff
There are a few ways of referencing a container. Here, we use the image we created earlier called andrewjesaitis/snpeff
. Alternatively, we could specify a repo using the dockerPull
key as shown in the user guide. Or you can inline the Dockerfile in your CWL file using the dockerFile
key. The cwltool
will then build the image using docker build
.
Next we provide the necessary inputs to SnpEff:
inputs:
genome:
type: string
inputBinding:
position: 1
input_vcf:
type: File
inputBinding:
position: 2
doc: VCF file to annotate
The SnpEff command we run in start.sh
looks like java -XmX4g -jar SnpEff.jar <genome> <input_vcf>
. So genome
is just a string that tells SnpEff what reference to use. We provide this string later in the 1kg-job.yml
file that we will pass into the cwltool
command. The input_vcf
is going to come from the unzipped_vcf
output of our gunzip step. And again, we tell cwltool
what order the arguments should be in using the position
key.
Now, we want to collect our output files:
outputs:
output:
type: stdout
stats:
type: File
outputBinding:
glob: "*.html"
genes:
type: File
outputBinding:
glob: "*.txt"
stdout: output.vcf
First, output
will contain the annotated vcf produced by SnpEff. Since SnpEff sends this to stdout
, we will use the familiar stdout
pattern we employed in the gunzip step. For the two summary files, we tell cwltool
to perform a wildcard glob match (the same syntax that you use at the command line – i.e. rm *.doc
). It will find an html and txt file and bind them to stats
and genes
respectively. If we expected more than one txt file, for example, we would need a more specific match or to use an array type to return multiple files.
Execution
Step 6. Let’s run this thing
Finally, we are ready to run this mini pipeline. We just need to specify the inputs – the genome and the zipped vcf – in our 1kg-job.yml
file
infile:
class: File
path: test/chr22.truncated.nosamples.1kg.vcf.gz
genome: hg19
So we bind the relative path test/chr22.truncated.nosamples.1kg.vcf.gz
to infile
. Remember infile
matches the reference in the in
block in our snpeff-workflow.cwl
file. Similarly we bind the string hg19
to genome
to be passed to the SnpEff command during the second step.
Now we run it with the command:
cwltool snpeff-workflow.cwl 1kg-job.yml
It will take a minute to run, during which time we’ll see the commands that are executed and finally some JSON showing the output files. After it completes, we’ll find these new files in our directory – output.vcf
, snpEff_summary.html
, and snpEff_genes.txt
. Notice, that we don’t find the intermediate unzipped.vcf
file. Intermediate files that are not specified in the out
block in the workflow are automatically deleted.
Final thoughts
So if you are like me, you might be thinking, “Andrew, this is a ton of work just to run two commands.” And I’d agree, but after you write a couple of workflows you can punch these out really fast. Also, there are a bunch of advantages that aren’t immediately obvious:
- Declarative style minimizes side effects compared to imperative style
- Modular; pop in the same tool to multiple workflows
- Can be produced programmatically
- No crazy string formatting antics
- No monitoring of subprocesses
- Easy to distribute you workload across the cluster
Ultimately, these advantages should mean that spend less type troubleshooting existing workflows and allow your tools to be used by colleagues more easily.
Finally, as you develop your own workflows, you are bound to run into problems. Two tips:
- Make sure you check the version in the url on the CWL site. It’s easy to be digging through the spec or user’s guide only to find you are looking at
draft-3
instead ofv1.0
- You can pass
--leave-tmpdirs
to thecwltool
command. This is often helpful to figure out if the outputs from a step are what you think they should be.
All these scripts are available in a github repo for you to play with.
Here are the commands to get you started:
git clone git@github:andrewjesaitis/cwl-tutorial
cd cwl-tutorial
docker build --tag=andrewjesaitis/snpeff .
pip install cwltool
cwltool snpff-workflow.cwl 1kg-job.yml