If you’ve spent any time at the command line you know that feeling where you can’t figure out how you did something a few weeks ago and want to repeat it. You’ve probably hit
ctrl-r and looked through your bash history, but alas what you need is just out of reach.
The next time you did something you thought you’d repeat, you opened up
mysuperawesomeworkflow.sh and carefully copied each command over. And it sort of works…at least you have some command reference. Then someone asks you for your bash script…gulp…you really don’t want to hand over that garbage fire.
But, there is a new kid on the block when it comes to repeatable workflows – The Common Workflow Language.
What the hell is this thing?
Technically, it’s just a spec that defines a file format. But, there are some reference implementations which provide programs to parse this special format and run the commands that are specified. That’s it. Think of it as a suped-up form of bash. So instead of writing a
.sh file you’d write a
The most basic and common way to run these
cwl files is with
cwltool. It’s pip installable so you can just fire up a terminal and install it by typing
pip install cwltool.
But, other tools also will take these
cwl files as workflow specifications. For example, Arvados uses the CWL to define it’s pipelines and there is talk of using it in Galaxy as well.
Combining CWL with Docker
But, the other thing about repeating past work is that it depends on how your computer was set up at the time.
- What operating system was used?
- What programs were installed?
- What version of the Java virtual machine was used?
Docker is a cool way to specify and create this context. It creates containers, which you can think of like little walled off computers that run on your computer. It does a lot of fancy things to make this efficient, but the key thing is that it creates a little sandbox for your work that looks the same every time you use the container.
So by using Docker, we can create a stable environment to run a tool in. And we can run the same sequence of commands using the CWL. All this adds up to reproducible workflows.
Most of the examples on the web are either too simplistic to be useful or too complex to be understood. So I’ll try to explain a moderately complex workflow that showcases some interesting “features” that you might run into.
In this workflow we’ll see:
- How to run a command on the native system
- How to run a command in a Docker container
- How to pass in input files
- How to pass in arguments for a command
- How to capture stdout
- How to grab a file produced by a command
- How to specific command flag
- How to use the outputs of one step as the input for the next step
First, define the context
The first thing we want to do is define the context for the analysis. To do that we create a Docker container.
Step 0. Install Docker
Head over to the Docker site to learn how.
Step 1. Create a Dockerfile
The Dockerfile defines what the container will look like. By convention these files are named
Dockerfile and live in the directory where you will build the container.
FROM openjdk:8-jdk ENV JAVA_OPTS="-Xmx4g" RUN apt-get update && apt-get install -y unzip wget RUN wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip RUN unzip snpEff_latest_core.zip RUN rm snpEff_latest_core.zip RUN cp snpEff/snpEff.jar . RUN cp snpEff/snpEff.config . WORKDIR /snpEff RUN java -jar snpEff.jar download -v hg19 COPY start.sh /opt/snpeff/ ENTRYPOINT ["sh", "/opt/snpeff/start.sh"]
So what does this do?
First, it gets a base container that has the Java Development Kit installed.
Then, it sets an environment variable
JAVA_OPTS with the flag
-XmX4g. This flag will be used to tell the Java Virtual Machine to use 4gb of memory.
Then, it installs
wget that will be used to download and install SnpEff.
Next, it downloads and installs SnpEff. By install, we mean unzip and copy the
.config files to
/snpEff directory in the container.
Then it switches to that directory and downloads the hg19 database. SnpEff will do this automatically at runtime, but doing it when we build the container allows us not to download the database each time we run the container.
Finally, it copies a script and tells the container to run it on startup.
To build this container and make an image we can just
cd into the directory containing the Dockerfile an run
docker build --tag=andrewjesaitis/snpeff .. This builds the image and tags it with
andrewjesaitis/snpeff. We could then push it to our container registry, but for now we just need it on our system.
Step 2. The start.sh script
So in the last step we told Docker to copy
/opt/snpeff. What’s in this script.
#!/usr/bin/env bash cp -R /snpEff/* . java $JAVA_OPTS -jar snpEff.jar "$@"
When this run the first thing that happens is that we copy the SnpEff files to our current directory. If you were to run
pwd you’d see that we are in a temp directory here so we need to copy SnpEff to where we will execute it. Then we just run SnpEff with the
-XmX4g flag that we specified as an environment variable in the Dockerfile.
Define the Workflow
Step 3. Specifying the Pipeline
So what’s this pipeline going to do? We are going to take a gzip’d 1kg vcf, unzip it, and run it through SnpEff.
Trivial, right? Yes and no.
The pipeline has a couple steps so we have to hand off files and link together a couple CWL scripts into a pipeline (or in CWL parlance a Workflow). This is complex enough to show off some of features that make the CWL useful, but it isn’t a bioinformatics core workflow that consists of 40 different steps to get buried in.
Let’s start at the high level and work our way down.
Here’s the workflow:
#!/usr/bin/env cwl-runner class: Workflow cwlVersion: v1.0 inputs: genome: type: string infile: type: File doc: gzip VCF file to annotate outputs: outfile: type: File outputSource: snpeff/output statsfile: type: File outputSource: snpeff/stats genesfile: type: File outputSource: snpeff/genes steps: gunzip: run: gunzip.cwl in: gzipfile: source: infile out: [unzipped_vcf] snpeff: run: snpeff.cwl in: input_vcf: gunzip/unzipped_vcf genome: genome out: [output, stats, genes] doc: | Annotate variants provided in a gziped VCF using SnpEff
Let’s take it line by line.
First, we label this a
Workflow. We’ll see later that the actual execution of programs is done in
Next, notice that the
v1.0. You’ll see a lot of workflows around using the
draft-3 spec. These versions are incompatible so you can’t mix and match.
inputs: genome: type: string infile: type: File doc: gzip VCF file to annotate
Here is where you declare the “things” you’ll pass in at runtime. The section establishes the bindings that you can use to pass the parameters to workflow steps.
genome is a string that tells SnpEff what reference to use (e.g. “hg18”, “hg19”, “mm9” etc).
infile is the gziped vcf. As you can see, you can document these parameter with the
Onto the outputs:
outputs: outfile: type: File outputSource: snpeff/output statsfile: type: File outputSource: snpeff/stats genesfile: type: File outputSource: snpeff/genes
This section specifies what the workflow will return. We are going to capture 3 files produced by SnpEff. The file
outfile will contain the annotated VCF,
statsfile is an html document describing what SnpEff processed, and finally SnpEff produces a summary of the number of variants affecting each gene and transcript which we will reference as
genesfile. Note, that at the workflow level we don’t have to worry about the name or location of these files; we need only know what step they are coming from.
Now for the heart of the file, the
steps: gunzip: run: gunzip.cwl in: gzipfile: source: infile out: [unzipped_vcf] snpeff: run: snpeff.cwl in: input_vcf: gunzip/unzipped_vcf genome: genome out: [output, stats, genes]
Each sub-block defines a step. The step is identified by the sub-blocks key (i.e.
snpeff). Each step specifies a CWL file that defines the command and the input and output file mappings. Note how you can define the file using a
source key or directly with the parent key. The other quirk is that the output files need to be specified as an array. I believe this is a bug in reference implementation, since according to the spec
out: unzipped_vcf and
out: [unzipped_vcf] are equivalent. However, in the snpeff step we are returning 3 files, so they must be specified as an array. Finally, note how the references to files are constructed. If you are using one of the workflow inputs, you don’t need to namespace the reference (i.e. you just say
infile). But, to use a steps output as an input, you need to namespace the reference to the step (i.e. you say
Lastly, it’s always a good idea to add a quick bit of documentation at the bottom so you aren’t left scratching you head 6 months from now.
Step 4. The gunzip step
cwlVersion: v1.0 class: CommandLineTool baseCommand: [gunzip, -c] inputs: gzipfile: type: File inputBinding: position: 1 outputs: unzipped_vcf: type: stdout stdout: unzipped.vcf
The first thing to notice is
class: CommandLineTool. This line tells
cwltool to parse and run this as an execution step
baseCommand: [gunzip, -c] is the command and any required args. This array is joined together with spaces to construct the first part of the command. It’s worth reading how commands are run in the spec. I could have also specified the
-c flag as an argument:
baseCommand: gunzip arguments: ["-c"]
In any case
gunzip -c unzips the file to stdout.
We specify the file to unzip in the
inputs: gzipfile: type: File inputBinding: position: 1
There are two things to note here. First, the sub-block key
gzipfile matches the file identifier we used in the workflow file in the
in block of the gunzip step. Second, we specify a
position in the
position is used to sort the inputs when constructing the command so the order in which inputs are specified in the CWL file doesn’t necessarily reflect the order they will appear in the command.
Finally, we specify the output file:
outputs: unzipped_vcf: type: stdout stdout: unzipped.vcf
Since we want to capture the stdout stream from gunzip, we specify
type: stdout. We could stop here and let the
cwltool create a file name, but these generated file names are not very pretty. Instead, we bind stdout to the file
unzipped.vcf. In practice, we will never see this file.
unzipped.vcf will be created in a temporary directory during processing and deleted automatically.
Step 5. The snpeff step
Now, we will take that unzipped vcf and run SnpEff on it. Since SnpEff requires a specific environment (an installed Java Development Kit), we’ll run it in a Docker container.
#!/usr/bin/env cwl-runner class: CommandLineTool cwlVersion: v1.0 baseCommand:  requirements: - class: DockerRequirement dockerImageId: andrewjesaitis/snpeff inputs: genome: type: string inputBinding: position: 1 input_vcf: type: File inputBinding: position: 2 doc: VCF file to annotate outputs: output: type: stdout stats: type: File outputBinding: glob: "*.html" genes: type: File outputBinding: glob: "*.txt" stdout: output.vcf
Again this is an execution step so we use the
class: CommandLineTool using
v1.0 of the language.
cwltool to use a Docker Container in the
requirements: - class: DockerRequirement dockerImageId: andrewjesaitis/snpeff
There are a few ways of referencing a container. Here, we use the image we created earlier called
andrewjesaitis/snpeff. Alternatively, we could specify a repo using the
dockerPull key as shown in the user guide. Or you can inline the Dockerfile in your CWL file using the
dockerFile key. The
cwltool will then build the image using
Next we provide the necessary inputs to SnpEff:
inputs: genome: type: string inputBinding: position: 1 input_vcf: type: File inputBinding: position: 2 doc: VCF file to annotate
The SnpEff command we run in
start.sh looks like
java -XmX4g -jar SnpEff.jar <genome> <input_vcf>. So
genome is just a string that tells SnpEff what reference to use. We provide this string later in the
1kg-job.yml file that we will pass into the
cwltool command. The
input_vcf is going to come from the
unzipped_vcf output of our gunzip step. And again, we tell
cwltool what order the arguments should be in using the
Now, we want to collect our output files:
outputs: output: type: stdout stats: type: File outputBinding: glob: "*.html" genes: type: File outputBinding: glob: "*.txt" stdout: output.vcf
output will contain the annotated vcf produced by SnpEff. Since SnpEff sends this to
stdout, we will use the familiar
stdout pattern we employed in the gunzip step. For the two summary files, we tell
cwltool to perform a wildcard glob match (the same syntax that you use at the command line – i.e.
rm *.doc). It will find an html and txt file and bind them to
genes respectively. If we expected more than one txt file, for example, we would need a more specific match or to use an array type to return multiple files.
Step 6. Let’s run this thing
Finally, we are ready to run this mini pipeline. We just need to specify the inputs – the genome and the zipped vcf – in our
infile: class: File path: test/chr22.truncated.nosamples.1kg.vcf.gz genome: hg19
So we bind the relative path
infile matches the reference in the
in block in our
snpeff-workflow.cwl file. Similarly we bind the string
genome to be passed to the SnpEff command during the second step.
Now we run it with the command:
cwltool snpeff-workflow.cwl 1kg-job.yml
It will take a minute to run, during which time we’ll see the commands that are executed and finally some JSON showing the output files. After it completes, we’ll find these new files in our directory –
snpEff_genes.txt. Notice, that we don’t find the intermediate
unzipped.vcf file. Intermediate files that are not specified in the
out block in the workflow are automatically deleted.
So if you are like me, you might be thinking, “Andrew, this is a ton of work just to run two commands.” And I’d agree, but after you write a couple of workflows you can punch these out really fast. Also, there are a bunch of advantages that aren’t immediately obvious:
- Declarative style minimizes side effects compared to imperative style
- Modular; pop in the same tool to multiple workflows
- Can be produced programmatically
- No crazy string formatting antics
- No monitoring of subprocesses
- Easy to distribute you workload across the cluster
Ultimately, these advantages should mean that spend less type troubleshooting existing workflows and allow your tools to be used by colleagues more easily.
Finally, as you develop your own workflows, you are bound to run into problems. Two tips:
- Make sure you check the version in the url on the CWL site. It’s easy to be digging through the spec or user’s guide only to find you are looking at
- You can pass
cwltoolcommand. This is often helpful to figure out if the outputs from a step are what you think they should be.
All these scripts are available in a github repo for you to play with.
Here are the commands to get you started:
git clone git@github:andrewjesaitis/cwl-tutorial cd cwl-tutorial docker build --tag=andrewjesaitis/snpeff . pip install cwltool cwltool snpff-workflow.cwl 1kg-job.yml