Developer - Example pipeline
This document/tutorial will show you how to add a new pipeline to biopet. The minimum requirement is having:
- A clean biopet checkout from git
- Texteditor or IntelliJ IDEA
Adding pipeline folder
Via commandline:
cd biopet/public/
mkdir -p mypipeline/src/main/scala/nl/lumc/sasc/biopet/pipelines/mypipeline
Adding maven project
Adding a pom.xml
to biopet/public/mypipeline
folder. The example below is the minimum required POM definition
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>Biopet</artifactId>
<groupId>nl.lumc.sasc</groupId>
<version>0.5.0-SNAPSHOT</version>
<relativePath>../</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>
<inceptionYear>2015</inceptionYear>
<artifactId>MyPipeline</artifactId>
<name>MyPipeline</name>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetCore</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetToolsExtensions</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.10</artifactId>
<version>2.2.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Initial pipeline code
In biopet/public/mypipeline/src/main/scala/nl/lumc/sasc/biopet/pipelines/mypipeline
create a file named HelloPipeline.scala
with the following contents:
package nl.lumc.sasc.biopet/pipelines.mypipeline
import nl.lumc.sasc.biopet.core.PipelineCommand
import nl.lumc.sasc.biopet.utils.config.Configurable
import nl.lumc.sasc.biopet.core.summary.SummaryQScript
import org.broadinstitute.gatk.queue.QScript
class HelloPipeline(val root: Configurable) extends QScript with SummaryQScript {
def this() = this(null)
/** Only required when using [[SummaryQScript]] */
def summaryFile = new File(outputDir, "hello.summary.json")
/** Only required when using [[SummaryQScript]] */
def summaryFiles: Map[String, File] = Map()
/** Only required when using [[SummaryQScript]] */
def summarySettings = Map()
// This method can be used to initialize some classes where needed
def init(): Unit = {
}
// This method is the actual pipeline
def biopetScript: Unit = {
// Executing a tool like FastQC, calling the extension in `nl.lumc.sasc.biopet.extensions.Fastqc`
val fastqc = new Fastqc(this)
fastqc.fastqfile = config("fastqc_input")
fastqc.output = new File(outputDir, "fastqc.txt")
add(fastqc)
}
}
object HelloPipeline extends PipelineCommand
Looking at the pipeline, you can see that it inherits from QScript
. QScript
is the fundamental class which gives access to the Queue scheduling system. In addition SummaryQScript
(trait) will add another layer of functions which provides functions to handle and create summary files from pipeline output.
class HelloPipeline(val root: Configurable
, our pipeline is called HelloPipeline and is taking a root
with configuration options passed down to Biopet via a JSON specified on the commandline (--config).
def biopetScript: Unit = {
}
One can start adding pipeline components in biopetScript
, this is the programmatically equivalent to the main
method in most popular programming languages. For example, adding a QC tool to the pipeline like FastQC
. Look at the example shown above.
Setting up the pipeline is done within the pipeline itself, fine-tuning is always possible by overriding in the following way:
val fastqc = new Fastqc(this)
fastqc.fastqfile = config("fastqc_input")
fastqc.output = new File(outputDir, "fastqc.txt")
// change kmers settings to 9, wrap with `Some()` because `fastqc.kmers` is a `Option` value.
fastqc.kmers = Some(9)
add(fastqc)
Config setup
For our new pipeline, one should setup the (default) config options.
Since our pipeline is called HelloPipeline
, the root of the namespace for our pipeline will be called hellopipeline
(lowercaps).
{
"output_dir": "/home/user/mypipelineoutpt",
"hellopipeline": {
}
}
Test pipeline
Summary output
Any pipeline that mixes in SummaryQscript
will produce a summary json.
This summary json usually contains statistics and some output results.
By mixing in SummaryQscript
, the new pipeline needs to implement three functions:
summaryFile: File
summaryFiles: Map[String, File]
summarySettings: Map[String, Any]
Of those three, summaryFile
is the most important one, and should point to the file where the summary will be written to.
The summaryFiles
function should contain any extra files one would like to add to the summary.
Files are listed in a separate files
JSON object, and will by default include any executables used in the pipelines.
The summarySettings
function should contain any extra settings one would like to add to the summary.
Settings are listed in a separate settings
JSON object.
Apart from these fields, the summary JSON will be populated with statistics from tool extensions that mix in Summarizable
.
To populate these statistics, one has to call addSummarizable
on the tool.
For instance, let's go back to the fastqc
example. The original declaration was:
val fastqc = new Fastqc(this)
fastqc.fastqfile = config("fastqc_input")
fastqc.output = new File(outputDir, "fastqc.txt")
// change kmers settings to 9, wrap with `Some()` because `fastqc.kmers` is a `Option` value.
fastqc.kmers = Some(9)
add(fastqc)
To add the fastqc summary to our summary JSON all we have to do is write the following line afterwards:
addSummarizable(fastqc)
Summary statistics for fastqc will then end up in a stats
JSON object in the summary.
See the tool tutorial for how to make a tool extension produce any summary output.