05 December 2014

Divide-and-conquer in a #Makefile : recursivity and #parallelism.

This post is my notebook about implementing a divide-and-conquer strategy in GNU make.
Say you have a list of 'N' VCFs files. You want to create a list of:

  • common SNPs in vcf1 and vcf2
  • common SNPs in vcf3 and the previous list
  • common SNPs in vcf4 and the previous list
  • (...)
  • common SNPs in vcfN and the previous list
Yes, I know I can do this using:grep -v '^#' f.vcf|cut -f 1,2,4,5 | sort | uniq

Using a linear Makefile it could look like:

list2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list3: vcf3 list2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list4: vcf4 list3
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list5: vcf5 list4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
(...)

We can speed-up the workflow using the parallel option of make -j (number-of-parallel-jobs) and using a divide-and-conquer strategy. Here, the targets 'list1_2' and 'list3_4' can be processed independently in parallel.

list1_2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list3_4: vcf3 vcf4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list1_4: list1_2 list3_4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

By using the internal 'Make' functions $(eval), $(call), $(shell) we can define a recursive method "recursive". This method takes two arguments which are the 0-based indexes of a VCF in the list of VCFs. Here is the Makefile:
Running the makefile:
$ make 
gunzip -c Sample18.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target0_1
gunzip -c Sample13.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target1_2
LC_ALL=C comm -12 target0_1 target1_2 > target0_2
gunzip -c Sample1.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target2_3
gunzip -c Sample19.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target3_4
gunzip -c Sample12.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target4_5
LC_ALL=C comm -12 target3_4 target4_5 > target3_5
LC_ALL=C comm -12 target2_3 target3_5 > target2_5
LC_ALL=C comm -12 target0_2 target2_5 > target0_5
gunzip -c Sample17.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target5_6
gunzip -c Sample16.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target6_7
LC_ALL=C comm -12 target5_6 target6_7 > target5_7
gunzip -c Sample9.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target7_8
gunzip -c Sample15.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target8_9
gunzip -c Sample5.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target9_10
LC_ALL=C comm -12 target8_9 target9_10 > target8_10
LC_ALL=C comm -12 target7_8 target8_10 > target7_10
LC_ALL=C comm -12 target5_7 target7_10 > target5_10
LC_ALL=C comm -12 target0_5 target5_10 > target0_10
gunzip -c Sample14.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target10_11
gunzip -c Sample3.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target11_12
LC_ALL=C comm -12 target10_11 target11_12 > target10_12
gunzip -c Sample11.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target12_13
gunzip -c Sample2.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target13_14
gunzip -c Sample6.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target14_15
LC_ALL=C comm -12 target13_14 target14_15 > target13_15
LC_ALL=C comm -12 target12_13 target13_15 > target12_15
LC_ALL=C comm -12 target10_12 target12_15 > target10_15
gunzip -c Sample20.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target15_16
gunzip -c Sample10.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target16_17
LC_ALL=C comm -12 target15_16 target16_17 > target15_17
gunzip -c Sample4.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target17_18
gunzip -c Sample8.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target18_19
gunzip -c Sample7.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target19_20
LC_ALL=C comm -12 target18_19 target19_20 > target18_20
LC_ALL=C comm -12 target17_18 target18_20 > target17_20
LC_ALL=C comm -12 target15_17 target17_20 > target15_20
LC_ALL=C comm -12 target10_15 target15_20 > target10_20
LC_ALL=C comm -12 target0_10 target10_20 > target0_20
and here is the generated workflow (drawn with make2graph ).
:
That's it
Pierre.

04 December 2014

XML+XSLT = #Makefile -based #workflows for #bioinformatics

I've recently read some conversations on Twitter about Makefile-based bioinformatics workflows. I've suggested on biostars.org (Standard simple format to describe a bioinformatics analysis pipeline) that a XML file could be used to describe a model of data and XSLT could transform this model to a Makefile-based workflow. I've already explored this idea in a previous post (Generating a pipeline of analysis (Makefile) for NGS with xslt. ) and in our lab, we use JSON and jsvelocity to generate our workflows (e.g: https://gist.github.com/lindenb/3c07ca722f793cc5dd60).
In the current post, I'll describe my github repository containing a complete XML+XSLT example: https://github.com/lindenb/ngsxml.

Download the data

git clone "https://github.com/lindenb/ngsxml.git"


The model
The XML model is self-explanatory:
<?xml version="1.0" encoding="UTF-8"?>
<model name="myProject" description="my project" directory="OUT">
  <project name="Proj1">
    <sample name="Sample1">
      <fastq>
        <for>test/fastq/sample_1_01_R1.fastq.gz</for>
        <rev>test/fastq/sample_1_01_R2.fastq.gz</rev>
      </fastq>
      <fastq id="groupid2" lane="2" library="lib1" platform="ILMN" median-size="98">
        <for>test/fastq/sample_1_02_R1.fastq.gz</for>
        <rev>test/fastq/sample_1_02_R2.fastq.gz</rev>
      </fastq>
    </sample>
    <sample name="Sample2">
      <fastq>
        <for>test/fastq/sample_2_01_R1.fastq.gz</for>
        <rev>test/fastq/sample_2_01_R2.fastq.gz</rev>
      </fastq>
    </sample>
  </project>
</model>


Validating the model:
This XML document is possibly validated with a XML schema:
$ xmllint  --schema xsd/schema01.xsd --noout test/model01.xml
test/model01.xml validates


Generating the Makefile:
The XML document is transformed into a Makefile using the following XSLT stylesheet: https://github.com/lindenb/ngsxml/blob/master/stylesheets/model2make.xsl
xsltproc --output Makefile stylesheets/model2make.xsl test/model01.xml
The Makefile:
# Name
#  myProject
# Description:
#  my project
# 
include config.mk
OUTDIR=OUT
BINDIR=$(abspath ${OUTDIR})/bin


# if tools are undefined
bwa.exe ?=${BINDIR}/bwa-0.7.10/bwa
samtools.exe ?=${BINDIR}/samtools-0.1.19/samtools
bcftools.exe ?=${BINDIR}/samtools-0.1.19/bcftools/bcftools
tabix.exe ?=${BINDIR}/tabix-0.2.6/tabix
bgzip.exe ?=${BINDIR}/tabix-0.2.6/bgzip

.PHONY=all clean all_bams all_vcfs

all: all_vcfs


all_vcfs:  \
 $(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz


all_bams:  \
 $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam \
 $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam

#
# VCF for project 'Proj1'
# 
$(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz : $(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam) \
 $(addsuffix .fai,${REFERENCE}) ${samtools.exe}  ${bgzip.exe} ${tabix.exe} ${bcftools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} mpileup -uf ${REFERENCE} $(basename $(filter %.bai,$^)) | \
 ${bcftools.exe} view -vcg - > $(basename $@)  && \
 ${bgzip.exe} -f $(basename $@) && \
 ${tabix.exe} -f -p vcf $@
 
    



#
# index final BAM for Sample 'Sample1'
# 
$(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} index  $<
#
# prepare final BAM for Sample 'Sample1'
# 
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam
 mkdir -p $(dir $@) && \
 cp  $< $@


$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} rmdup  $<  $@



#
# merge BAMs 
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam :  \
  $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam \
  $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
  ${samtools.exe} merge -f $@ $(filter %.bam,$^)
  



 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_1_01_R1.fastq.gz and test/fastq/sample_1_01_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam : \
 test/fastq/sample_1_01_R1.fastq.gz  \
 test/fastq/sample_1_01_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:idp20678948\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \
  ${REFERENCE} \
  test/fastq/sample_1_01_R1.fastq.gz \
  test/fastq/sample_1_01_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 





 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_1_02_R1.fastq.gz and test/fastq/sample_1_02_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam : \
 test/fastq/sample_1_02_R1.fastq.gz  \
 test/fastq/sample_1_02_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \
  ${REFERENCE} \
  test/fastq/sample_1_02_R1.fastq.gz \
  test/fastq/sample_1_02_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 




#
# index final BAM for Sample 'Sample2'
# 
$(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} index  $<
#
# prepare final BAM for Sample 'Sample2'
# 
$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam
 mkdir -p $(dir $@) && \
 cp  $< $@


$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} rmdup  $<  $@





 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_2_01_R1.fastq.gz and test/fastq/sample_2_01_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam : \
 test/fastq/sample_2_01_R1.fastq.gz  \
 test/fastq/sample_2_01_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:idp20681172\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \
  ${REFERENCE} \
  test/fastq/sample_2_01_R1.fastq.gz \
  test/fastq/sample_2_01_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 




$(addsuffix .fai,${REFERENCE}): ${REFERENCE} ${samtools.exe}
 ${samtools.exe} faidx $<

$(addsuffix .bwt,${REFERENCE}): ${REFERENCE} ${bwa.exe}
 ${bwa.exe} index $<


${BINDIR}/bwa-0.7.10/bwa :
 rm -rf $(BINDIR)/bwa-0.7.10/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/bwa-0.7.10.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/bwa-0.7.10.tar.bz2 && \
 make -C $(dir $@)

${BINDIR}/samtools-0.1.19/bcftools/bcftools: ${BINDIR}/samtools-0.1.19/samtools

${BINDIR}/samtools-0.1.19/samtools  : 
 rm -rf $(BINDIR)/samtools-0.1.19/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/samtools-0.1.19.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/samtools-0.1.19.tar.bz2 && \
 make -C $(dir $@)


${BINDIR}/tabix-0.2.6/bgzip : ${BINDIR}/tabix-0.2.6/tabix


${BINDIR}/tabix-0.2.6/tabix  : 
 rm -rf $(BINDIR)/tabix-0.2.6/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/tabix-0.2.6.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/tabix-0.2.6.tar.bz2 && \
 make -C $(dir $@) tabix bgzip

clean:
 rm -rf ${BINDIR}

Drawing the Workflow:
The workflow is drawn with https://github.com/lindenb/makefile2graph.


Running Make:
And here is the output of make:
rm -rf /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/
bwa-0.7.10/
bwa-0.7.10/bamlite.c
(...)
gcc -g -Wall -Wno-unused-function -O2 -msse -msse2 -msse3 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS QSufSort.o bwt_gen.o bwase.o bwaseqio.o bwtgap.o bwtaln.o bamlite.o is.o bwtindex.o bwape.o kopen.o pemerge.o bwtsw2_core.o bwtsw2_main.o bwtsw2_aux.o bwt_lite.o bwtsw2_chain.o fastmap.o bwtsw2_pair.o main.o -o bwa -L. -lbwa -lm -lz -lpthread
make[1]: Entering directory `/home/lindenb/src/ngsxml'
/home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa index test/ref/ref.fa
rm -rf /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/
samtools-0.1.19/
samtools-0.1.19/.gitignore
samtools-0.1.19/AUTHORS
(...)
gcc -g -Wall -O2 -o bamcheck bamcheck.o -L.. -lm -lbam -lpthread -lz
make[3]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/misc'
make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19'
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11671724\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \
  test/ref/ref.fa \
  test/fastq/sample_1_01_R1.fastq.gz \
  test/fastq/sample_1_01_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \
  test/ref/ref.fa \
  test/fastq/sample_1_02_R1.fastq.gz \
  test/fastq/sample_1_02_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
  /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools merge -f OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 cp  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index  OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11673828\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \
  test/ref/ref.fa \
  test/fastq/sample_2_01_R1.fastq.gz \
  test/fastq/sample_2_01_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted.bam  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 cp  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index  OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam
/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools faidx test/ref/ref.fa
rm -rf /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ tabix bgzip
tabix-0.2.6/
tabix-0.2.6/ChangeLog
tabix-0.2.6/Makefile
(...)
make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6'
mkdir -p OUT/Projects/Proj1/VCF/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools mpileup -uf test/ref/ref.fa OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam | \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/bcftools/bcftools view -vcg - > OUT/Projects/Proj1/VCF/Proj1.vcf  && \
 /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/bgzip -f OUT/Projects/Proj1/VCF/Proj1.vcf && \
 /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/tabix -f -p vcf OUT/Projects/Proj1/VCF/Proj1.vcf.gz
make[1]: Leaving directory `/home/lindenb/src/ngsxml'
At the end, a VCF is generated
##fileformat=VCFv4.1
##samtoolsVersion=0.1.19-44428cd
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
chr4_gl000194_random 1973 . C G 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,3,0;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2462 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2492 . G C 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2504 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
That's it,

Pierre

30 October 2014

Visualizing @GenomeBrowser liftOver/chain files using animated #SVG

I wrote a tool to visualize some UCSC "chain/liftOver" files as an animated SVG file. This tool is available on github at:

"A liftOver file is a chain file, where for each region in the genome the alignments of the best/longest syntenic regions are used to translate features from one version of a genome to another.".

SVG Elements and CSS styles can be animated in a SVG file (see http://www.w3.org/TR/SVG/animate.html ) using the <animate/> element.

For example the following SVG snippet

  • defines a rectangle(x=351,y=35,width=6,height=5).
  • at t=60secs the opacity will change from 0 to 0.7 for 2secs
  • the position 'x' will move from x=351 to x=350 starting at t=62secs for 16 seconds
  • the position 'y' will move from y=35 to y=36 starting at t=62secs for 16 seconds
  • the 'width' will grow from width=6 to width=100 starting at t=62secs for 16 seconds
  • at t=78secs the opacity will change from 0.7 to 0 for 2secs
<rect x="351" y="35" width="6" height="15">
        <animate attributeType="CSS" attributeName="opacity" begin="60" dur="2" from="0" to="0.7" repeatCount="1" fill="freeze"/>
        <animate attributeType="XML" attributeName="x" begin="62" dur="16" from="351" to="350" repeatCount="1" fill="freeze"/>
        <animate attributeType="XML" attributeName="y" begin="62" dur="16" from="35" to="36" repeatCount="1" fill="freeze"/>
        <animate attributeType="XML" attributeName="width" begin="62" dur="16" from="6" to="100" repeatCount="1" fill="freeze"/>
        <animate attributeType="CSS" attributeName="opacity" begin="78" dur="2" from="0.7" to="0" repeatCount="1" fill="freeze"/>
</rect>

A demo hg16-hg17-hg18-hg19-hg38 was posted here: http://cardioserve.nantes.inserm.fr/~lindenb/liftover2svg/hg16ToHg38.svg




That's it,

Pierre.

16 October 2014

IGVFox: Integrative Genomics Viewer control through mozilla Firefox

I've just pushed IGVFox 0.1 an add-on for Firefox, controlling IGV, the Integrative Genomics Viewer.
This add-on allows the users to set the genomic position of IGV by just clicking a hyperlink in a HTML page. The source code is available on github at https://github.com/lindenb/igvfox and a first release is available as a *.xpi file at https://github.com/lindenb/igvfox/releases.


That's it,

Pierre

30 September 2014

Using the Ensembl Regulatory Build to annotate some VCF files

via UCSC Genome Browser project announcements: "Data from the Ensembl Regulatory Build are now available in the UCSC Genome Browser as a public track hub for both hg19 and hg38. This track hub contains promoters and their flanking regions, enhancers, and many other regulatory features predicted across a number of cell lines using annotated segmentation states".
For example looking at chr21:33037019-33037021 returns the following screen:

Those new annotations are deployed by the Sanger Institute as a UCSC track hub. By the way, those file can be directly handled using the UCSC standalone tools:
$ bigWigSummary -type=mean -udcDir=.  \
  "http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/segmentation_summaries/Segway_17/1.bw" \
  chr1 1  110301 1

1.23587
I wrote a java tool for the annotation of VCFs with those files. This tool uses the BigWig library for java ( https://code.google.com/p/bigwig/ ) and is available at: https://github.com/lindenb/jvarkit/wiki/VcfEnsemblReg.
Here is an example with the following VCF:
##fileformat=VCFv4.1
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chr21 33037029 . C T 6.20 . . GT:PL:DP:GQ 1/1:35,3,0:1:4
VcfEnsemblReg is invoked:
$  java -jar dist/vcfensemblreg.jar in.vcf > out.vcf
Here is the content of out.vcf:
##fileformat=VCFv4.1
##INFO=<ID=AP2ALPHA,Number=1,Type=Float,Description="Overlap summary of AP2ALPHA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2ALPHA.bw">
##INFO=<ID=AP2GAMMA,Number=1,Type=Float,Description="Overlap summary of AP2GAMMA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2GAMMA.bw">
##INFO=<ID=ATF3,Number=1,Type=Float,Description="Overlap summary of ATF3 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/ATF3.bw">
##INFO=<ID=BAF155,Number=1,Type=Float,Description="Overlap summary of BAF155 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF155.bw">
##INFO=<ID=BAF170,Number=1,Type=Float,Description="Overlap summary of BAF170 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF170.bw">
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chr21 33037029 . C T 6.20 . BuildOverview=ctcf_45704|CTCFBindingSite;Segway_17_1=3.0;Segway_17_14=7.0;Segway_17_24=3.0;Segway_17_6=1.0;Segway_17_7=2.0;Segway_17_8=1.0;Segway_17_A549_projected=ctcf_45704|InactiveRegions;Segway_17_A549_segments=14_gene_79558|TranscriptionAssociated;Segway_17_DND41_projected=ctcf_45704|InactiveRegions;Segway_17_DND41_segments=1_distal_17115|DistalEnhancer;Segway_17_GM12878_projected=ctcf_45704|InactiveRegions;Segway_17_GM12878_segments=1_distal_29075|DistalEnhancer;Segway_17_H1HESC_projected=ctcf_45704|ActiveCTCFBindingSite;Segway_17_H1HESC_segments=8_ctcf_27831|DistalCTF;Segway_17_HELAS3_projected=ctcf_45704|InactiveRegions;Segway_17_HELAS3_segments=6_distal_76536|DistalEnhancer;Segway_17_HEPG2_projected=ctcf_45704|InactiveRegions;Segway_17_HEPG2_segments=1_distal_21535|DistalEnhancer;Segway_17_HMEC_projected=ctcf_45704|InactiveRegions;Segway_17_HMEC_segments=14_gene_44998|TranscriptionAssociated;Segway_17_HSMMT_projected=ctcf_45704|InactiveRegions;Segway_17_HSMMT_segments=24_gene_70780|TranscriptionAssociated;Segway_17_HSMM_projected=ctcf_45704|InactiveRegions;Segway_17_HSMM_segments=24_gene_80902|TranscriptionAssociated;Segway_17_HUVEC_projected=ctcf_45704|InactiveRegions;Segway_17_K562_projected=ctcf_45704|InactiveRegions;Segway_17_K562_segments=14_gene_68692|TranscriptionAssociated;Segway_17_MONO_projected=ctcf_45704|InactiveRegions;Segway_17_MONO_segments=14_gene_35200|TranscriptionAssociated;Segway_17_NHA_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_segments=14_gene_57366|TranscriptionAssociated;Segway_17_NHEK_projected=ctcf_45704|InactiveRegions;Segway_17_NHEK_segments=24_gene_95458|TranscriptionAssociated;Segway_17_NHLF_projected=ctcf_45704|InactiveRegions;Segway_17_NHLF_segments=14_gene_59524|TranscriptionAssociated;Segway_17_OSTEO_projected=ctcf_45704|InactiveRegions;Segway_17_OSTEO_segments=14_gene_61575|TranscriptionAssociated GT:PL:DP:GQ 1/1:35,3,0:1:4
Here are the new fields in the INFO column:
Segway_17_1 3.0
Segway_17_14 7.0
Segway_17_24 3.0
Segway_17_6 1.0
Segway_17_7 2.0
Segway_17_8 1.0
Segway_17_A549_projected ctcf_45704|InactiveRegions
Segway_17_A549_segments 14_gene_79558|TranscriptionAssociated
Segway_17_DND41_projected ctcf_45704|InactiveRegions
Segway_17_DND41_segments 1_distal_17115|DistalEnhancer
Segway_17_GM12878_projected ctcf_45704|InactiveRegions
Segway_17_GM12878_segments 1_distal_29075|DistalEnhancer
Segway_17_H1HESC_projected ctcf_45704|ActiveCTCFBindingSite
Segway_17_H1HESC_segments 8_ctcf_27831|DistalCTF
Segway_17_HELAS3_projected ctcf_45704|InactiveRegions
Segway_17_HELAS3_segments 6_distal_76536|DistalEnhancer
Segway_17_HEPG2_projected ctcf_45704|InactiveRegions
Segway_17_HEPG2_segments 1_distal_21535|DistalEnhancer
Segway_17_HMEC_projected ctcf_45704|InactiveRegions
Segway_17_HMEC_segments 14_gene_44998|TranscriptionAssociated
Segway_17_HSMMT_projected ctcf_45704|InactiveRegions
Segway_17_HSMMT_segments 24_gene_70780|TranscriptionAssociated
Segway_17_HSMM_projected ctcf_45704|InactiveRegions
Segway_17_HSMM_segments 24_gene_80902|TranscriptionAssociated
Segway_17_HUVEC_projected ctcf_45704|InactiveRegions
Segway_17_K562_projected ctcf_45704|InactiveRegions
Segway_17_K562_segments 14_gene_68692|TranscriptionAssociated
Segway_17_MONO_projected ctcf_45704|InactiveRegions
Segway_17_MONO_segments 14_gene_35200|TranscriptionAssociated
Segway_17_NHA_projected ctcf_45704|InactiveRegions
Segway_17_NHDFAD_projected ctcf_45704|InactiveRegions
Segway_17_NHDFAD_segments 14_gene_57366|TranscriptionAssociated
Segway_17_NHEK_projected ctcf_45704|InactiveRegions
Segway_17_NHEK_segments 24_gene_95458|TranscriptionAssociated
Segway_17_NHLF_projected ctcf_45704|InactiveRegions
Segway_17_NHLF_segments 14_gene_59524|TranscriptionAssociated
Segway_17_OSTEO_projected ctcf_45704|InactiveRegions
Segway_17_OSTEO_segments 14_gene_61575|TranscriptionAssociated

OK, now I've got a VCF containing those 'Ensembl Regulatory' annotations. What can I do with this ? I've currently no idea :-)

That's it,
Pierre

08 September 2014

It's a kind of magic(5)

The following question was recently asked on biostars.org : Tool that detects data types:"More specifically I am interested for detecting between SAM, BAM, FASTA, FASTQ (if possible descriminating between these types), BED, BED12, BED15, GFF, GFF2, GFF3). The detection should be performed without just looking at the extension of the file....":
I suggested to create a new magic file for the linux file command. As an example, a BAM file starts (position=0) with the magic bytes BAM\1. A magic file definition 'bam' for BAM would be:

0 string BAM\1 BAM file v1.0
Compile the new magic file:
file -C -m bam
Now use this magic:
file -z -m bam.mgc file.bam
file.bam: BAM file v1.0 (data)

Now, I've started a github repo containing some 'magic' patterns for bioinformatics (Fastq, blast, bam, bigwig, etc... ): .
(My current problem is to prioritize some results like differentiating a 'Nucleotide' and a 'Protein' Fasta sequence ( http://unix.stackexchange.com/questions/154096/file1-and-magic5-prioritizing-a-result).)

That's it,
Pierre

03 September 2014

Parallelizing GNU #Make 4 in a #SLURM infrastructure/cluster

SLURM (https://computing.llnl.gov/linux/slurm/slurm.html) is The Simple Linux Utility for Resource Management . It's an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

A patch for GNU Make version 3.81 is available as part of the SLURM distribution in https://github.com/SchedMD/slurm/blob/master/contribs/make.slurm.patch This patch will use SLURM to launch tasks across a job's current resource allocation.

The patch for Make 'just' wraps the command into srun: ( "srun is the command sending a parallel job on cluster managed by SLURM. )

Index: job.c
===================================================================
--- job.c (revision 321)
+++ job.c (working copy)
@@ -1959,6 +1959,22 @@
void
child_execute_job (int stdin_fd, int stdout_fd, char **argv, char **envp)
{
+/* PARALLEL JOB LAUNCH VIA SLURM */
+ if (getenv("SLURM_JOB_ID")) {
+ int i;
+ static char *argx[128];
+ argx[0] = "srun";
+ argx[1] = "-N1";
+ argx[2] = "-n1";
+ for (i=0; ((i<124)&&(argv[i])); i++) {
+ argx[i+3] = argv[i];
+ }
+ if (i<124) {
+ argx[i+3] = NULL;
+ argv = argx;
+ }
+ }
+/* END OF SLURM PATCH */
if (stdin_fd != 0)
(void) dup2 (stdin_fd, 0);
if (stdout_fd != 1)

GNU-Make version 4 was recently released. This new version comes with a number of improvements like GNU Guile integration, Loadable objects (see http://plindenbaum.blogspot.fr/2014/08/a-gnu-make-plug-in-for-illumina-fastqs.html ). It also allows to specify the default shell to be invoked (see http://plindenbaum.blogspot.fr/2014/01/parallelizing-rstats-using-make.html )

http://www.gnu.org/software/make/manual/make.html : The program used as the shell is taken from the variable SHELL. If this variable is not set in your makefile, the program /bin/sh is used as the shell. The argument(s) passed to the shell are taken from the variable .SHELLFLAGS. The default value of .SHELLFLAGS is -c normally, or -ec in POSIX-conforming mode.

So, if you want to parallelize GNU-Make with SLURM you can wrap the shell into srun using SHELL and .SHELLFLAGS. Here is an example, creating and concatenating 100 files containing the hostname:

ifdef SLURM_JOB_ID
SHELL=srun
.SHELLFLAGS= -N1 -n1  bash -c 
endif
NUMBERS=$(shell seq 1 100 )
TARGETS=  $(addsuffix .test,${NUMBERS} )


.PHONY:  all clean

define TEST

$(addsuffix .test,$(1)) : 
        echo -n  $(1) " " > $$@ && hostname >> $$@
        @sleep 5

endef


all: ${TARGETS}
        cat $^

$(foreach N,$(NUMBERS), $(eval $(call TEST,$(N) ) ) )

clean:
        rm -f ${TARGETS}

now invoke Make with SLURM and the option -j ( Allow -j N jobs at once ):

$ make -j 10
echo -n  1  " " > 1.test && hostname >> 1.test
echo -n  2  " " > 2.test && hostname >> 2.test
echo -n  3  " " > 3.test && hostname >> 3.test
echo -n  4  " " > 4.test && hostname >> 4.test
echo -n  5  " " > 5.test && hostname >> 5.test
echo -n  6  " " > 6.test && hostname >> 6.test
echo -n  7  " " > 7.test && hostname >> 7.test
(...)
echo -n  100  " " > 100.test && hostname >> 100.test
cat 1.test 2.test 3.test 4.test 5.test 6.test 7.test 8.test (...)
1  node004
2  node003
3  node001
4  node002
5  node002
6  node001
7  node002
8  node001
9  node001
10  node002
(...)
92  node004
93  node001
94  node001
95  node001
96  node001
97  node002
98  node001
99  node001
100  node001
That's it,
Pierre

08 August 2014

A GNU-make plug-in for the #Illumina FASTQs.

The latest version of GNU-Make http://www.gnu.org/software/make/ provides many advanced capabilities, including many useful functions. However, it does not contain a complete programming language and so it has limitations. Sometimes these limitations can be overcome through use of the shell function to invoke a separate program, although this can be inefficient. On systems which support dynamically loadable objects, you can write your own extension in any language (which can be compiled into such an object) and load it to provide extended capabilities ( see http://www.gnu.org/software/make/manual/make.html#Loading-Objects )

Building a plug-in for the Illumina FASTQs.

from http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

Illumina FASTQ files use the following naming scheme:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

For example, the following is a valid FASTQ file name:

NA10831_ATCACG_L002_R1_001.fastq.gz

Here I'm writing a set of new functions for makefile to extract the different parts (sample, lane...) of a fastq file-name:

The code is available on github.com at

First a struct holding the parts of the file is created:

enum E_IlluminaComponent
    {
    E_sampleName,
    E_barcodeSequence,
    E_lane,
    E_readNumber,
    E_setNumber
    };

typedef struct illumina_scheme_t
    {
    char* filename;
    char* components[NUM_ILLUMINA_COMPONENTS];
    } IlluminaScheme,*IlluminaSchemePtr ;

and a function parsing the filenames is created:

IlluminaSchemePtr IlluminaSchemeNew(const char* filename)
    {
    ...
    }

when the plugin llumina is loaded as a dynamic C library, the method llumina_gmk_setup is called,
and we tell make about the new functions with gmk_add_function(name,callback,min_args,max_args,no_expand_content) :

int illumina_gmk_setup ()
  {
   gmk_add_function ("illumina_sample",illumina_sample, 1, 1, 0);
   gmk_add_function ("illumina_lane",illumina_lane, 1, 1, 0);
   (...)
  }

A function registered with make must match the gmk_func_ptr type.
It will be invoked with three parameters: name (the name of the function), argc (the number of arguments to the function), and argv (an array of pointers to arguments to the function). The last pointer (that is, argv[argc]) will be null (0).
The return value of the function is the result of expanding the function.

char* illumina_sample(const char *function_name, unsigned int argc, char **argv)
    {
    /** extract the filename(s), build and return a string containing the samples */
    }

Compiling

the plugin must be compiled as a dynamic C library.

Note: The manual says this step can also be generated in the final 'Makefile' (via load ./illumina.so) but I was not able to compile a missing library (illumina.so cannot open shared object file: No such file or directory)

so I compiled it by hand:

gcc -Wall -I/path/to/sources/make-4.0 -shared -fPIC -o illumina.so illumina.c

Test

here is the makefile:

SAMPLES=  NA10831_ATCACG_L002_R1_001.fastq.gz \
      hello \
      NA10832_ATCACG_L002_R1_001.fastq.gz \
      NA10831_ATCACG_L002_R2_001.fastq.gz \
      NA10832_ATCACG_L002_R2_001.fastq.gz \
      NA10833_ATCAGG_L003_R1_003.fastq.gz \
      NA10833_ATCAGG_L003_R1_003.fastq.gz \
      ERROR_ATCAGG_x003_R3_0z3.fastq.gz \
      false

all:
    @echo "SAMPLES: " $(illumina_sample  ${SAMPLES} )
    @echo "BARCODES: " $(illumina_barcode  ${SAMPLES} )
    @echo "LANE: " $(illumina_lane  ${SAMPLES} )
    @echo "READ: " $(illumina_read  ${SAMPLES} )
    @echo "SET: " $(illumina_set  ${SAMPLES} )

output:

$ make
SAMPLES:  NA10831 NA10832 NA10833
BARCODES:  ATCACG ATCAGG
LANE:  L002 L003
READ:  R1 R2
SET:  001 003

That's it,

Pierre

30 July 2014

writing #rstats bindings for bwa-mem, my notebook.

I wanted to learn how to bind a C library to R, so I've created the following bindings for BWA. My code is available on github at :

Most of the C code was inspired from Heng Li's code https://github.com/lh3/bwa/blob/master/example.c.

A short description of the C code

In https://github.com/lindenb/rbwa/blob/master/rbwa.c:
struct RBwaHandler
This structure holds a pointer to the bwa-index (bwaidx_t) and to the options for bwa (mem_opt_t).
RBwaOpen(filename)
This methods opens the bwa index, wrap the pointer in a 'R' tructure using R_MakeExternalPtr and registers a destructor '_RBwaFinalizer' to be called by the garbage manager.
_RBwaFinalizer(handler)
This is the destructor called by the garbage manager. It calls 'RBwaClose'
RBwaClose(handler)
retrieves the pointer to the 'struct RBwaHandler' using 'R_ExternalPtrAddr', disposes the resources, free the RBwaHandler using 'R_ClearExternalPtr'
RBwaMap(handler,DNAsequence)
This is the workhorse of the code: it retrieves the pointer to the 'struct RBwaHandler' using 'R_ExternalPtrAddr', creates a "data.frame" with 6 columns (chromosome, position, strand, mapq, NM, secondary), maps the DNA sequence by calling bwa::mem_align1 and insert the hits in the data.frame.

The R code

See https://github.com/lindenb/rbwa/blob/master/rbwa.R. This R code loads the dynamic C libraries and declares the R functions calling the previous C functions using .Call:
  • bwa.open(filename)
  • bwa.close(bwt)
  • bwa.map(bwt,dnasequence)

Example

source("rbwa.R")
bwt <- bwa.open("test_files/chrM.fa")

for(s in c(
 "GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT",
 "GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT",
 "CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT",
 "TACTAAACCC",
 "GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT" 
 ))
 {
 print(paste("QUERY:",s));
 hits<-bwa.map(bwt,s)
 print(hits)
 }

bwa.close(bwt);
here is the R output:
> source("rbwa.R")
> 
> bwt <- bwa.open("test_files/chrM.fa")
> 
> for(s in c(
+   "GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT",
+   "GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT",
+   "CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT",
+   "TACTAAACCC",
+   "GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT" 
+   ))
+  {
+  print(paste("QUERY:",s));
+  hits<-bwa.map(bwt,s)
+  print(hits)
+  }
[1] "QUERY: GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT"
  chrom pos strand mapq NM secondary
1  chrM 650      1   60  0         0
[1] "QUERY: GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT"
  chrom pos strand mapq NM secondary
1  chrM 650      1   60  2         0
[1] "QUERY: CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT"
  chrom  pos strand mapq NM secondary
1  chrM 3100      0   60  4         0
[1] "QUERY: TACTAAACCC"
[1] chrom     pos       strand    mapq      NM        secondary
<0 rows> (or 0-length row.names)
[1] "QUERY: GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT"
[1] chrom     pos       strand    mapq      NM        secondary
<0 rows> (or 0-length row.names)
> 
> bwa.close(bwt);
[1] TRUE

That's it,

Pierre

29 July 2014

Including the hash for the current git-commit in a C program

Say you wrote the following simple C program:
It includes a file "githash.h" that would contain the hash for the current commit in Git:



Because you're working with a Makefile, the file "githash.h" is generated by invoking 'git rev-parse HEAD ':



the file "githash.h" loooks like this:




But WAIT that is not so simple, once the file 'githash.h' has been created it won't be updated by Make as it already exists. This file should be removed each time 'git commit' is invoked. We can do this by creating POST COMMIT git hook: we create a bash script named ".git/hooks/post-commit" removing 'githash.h:



don't forget make it executable: `chmod +x .git/hooks/post-commit`

Now, each time 'git commit' is called, the file githash.h for the previous git-commit will be deleted !


That's it,

Pierre








05 July 2014

Pushed : makefile2graph , creating a graph of dependencies from GNU-Make.

I pushed makefile2graph at https://github.com/lindenb/makefile2graph. This is the standalone and 'C' implementation of a program I first wrote in java in 2012. The program creates a graph of dependencies from GNU-Make and its output is a graphiz-dot file or a Gexf/Gephi-XML file.

Usage

$ make -Bnd | make2graph > output.dot
$ make -Bnd | make2graph -x > gephi.gexf.xml 

Example

Here is the output of makefile2graph for Tabix:
$ cd tabix-0.2.5
$ make -Bnd |make2graph
digraph G {
n1[label="", color="green"];
n2[label="Makefile", color="green"];
n4[label="all", color="red"];
n3[label="all-recur", color="red"];
n23[label="bedidx.c", color="green"];
n22[label="bedidx.o", color="red"];
n9[label="bgzf.c", color="green"];
n10[label="bgzf.h", color="green"];
n8[label="bgzf.o", color="red"];
n27[label="bgzip", color="red"];
n29[label="bgzip.c", color="green"];
n28[label="bgzip.o", color="red"];
n18[label="index.c", color="green"];
n17[label="index.o", color="red"];
n20[label="khash.h", color="green"];
n16[label="knetfile.c", color="green"];
n11[label="knetfile.h", color="green"];
n15[label="knetfile.o", color="red"];
n24[label="kseq.h", color="green"];
n21[label="ksort.h", color="green"];
n13[label="kstring.c", color="green"];
n14[label="kstring.h", color="green"];
n12[label="kstring.o", color="red"];
n6[label="lib", color="red"];
n7[label="libtabix.a", color="red"];
n26[label="main.c", color="green"];
n25[label="main.o", color="red"];
n5[label="tabix", color="red"];
n19[label="tabix.h", color="green"];
n2 -> n1 ; 
n4 -> n1 ; 
n3 -> n1 ; 
(..)
}

That's it
Pierre

02 July 2014

Pushed today: verticalize , an everyday linux command to verticalize tab-delimited files

FYI: this morning, I pushed verticalize on github. Verticalize is an everyday simple command to 'verticalize' delimited files: https://github.com/lindenb/verticalize.

$ curl "https://raw.githubusercontent.com/ekg/vcflib/master/samples/sample.vcf" |\
  grep -vE "^##" |\
  verticalize

>>> 2
$1    #CHROM : 19
$2       POS : 111
$3        ID : .
$4       REF : A
$5       ALT : C
$6      QUAL : 9.6
$7    FILTER : .
$8      INFO : .
$9    FORMAT : GT:HQ
$10  NA00001 : 0|0:10,10
$11  NA00002 : 0|0:10,10
$12  NA00003 : 0/1:3,3
<<< 2

>>> 3
$1    #CHROM : 19
$2       POS : 112
$3        ID : .
$4       REF : A
$5       ALT : G
$6      QUAL : 10
$7    FILTER : .
$8      INFO : .
$9    FORMAT : GT:HQ
$10  NA00001 : 0|0:10,10
$11  NA00002 : 0|0:10,10
$12  NA00003 : 0/1:3,3
<<< 3



.

That's it,
Pierre

28 May 2014

Uniprot → SVG

My colleague @Solena recently asked me to write a tool that would help her to prepare a large number of figures for an article. The tool I wrote fetches an entry for a given Uniprot accession and creates a SVG diagram , editable in Inkscape. It is available at


That's it,

Pierre

22 May 2014

Breaking the " same origin security policy" with CORS. An example with @GenomeBrowser / DAS.

Jerven Bolleman recently taught me about the CORS/Cross-origin resource sharing:



"Cross-origin resource sharing (CORS) is a mechanism that allows many resources (e.g. fonts, JavaScript, etc.) on a web page to be requested from another domain outside of the domain the resource originated from. In particular, JavaScript's AJAX calls can use the XMLHttpRequest mechanism. Such "cross-domain" requests would otherwise be forbidden by web browsers, per the same origin security policy."

I've created a page testing if some bioinformatics web-service support CORS. This page is available at : http://lindenb.github.io/pages/cors/index.html

Interestingly NCBI, Uniprot and UCSC support CORS. As an example, the following <form> fetches a DNA sequence using the DAS server of the UCSC and display it:



The script:



That's it
Pierre

20 May 2014

A nodejs-based REST server for the UCSC @GenomeBrowser



Node.js provides a simple mechanism to write a REST server. As an exercise, I wrote a REST server for the mysql server of the UCSC genome bowser. The code is available on github at:


Starting the server


$ cd bionode
$ node ucsc/ucsc.js
Server running at http://localhost:8080/

METHOD: /schema/databases



Lists the available databases :e.g: http://localhost:8080/schemas/databases


[
"information_schema",
"ailMel1",
"allMis1",
"anoCar1",

(...)
"visiGene",
"xenTro1",
"xenTro2",
"xenTro3"
]


This method accepts a parameter callback for JSON-P : e.g: http://localhost:8080/schemas/databases?callback=handle


handle([
"information_schema",
"ailMel1",
"allMis1",
"anoCar1",
(...)
"visiGene",
"xenTro1",
"xenTro2",
"xenTro3"
]);


METHOD: /schema/:database/tables


Lists the available tables for a given database :e.g: http://localhost:8080/schemas/anoCar1/tables


[
"all_mrna",
"author",
"blastHg18KG",
"cds",
(...)
"xenoRefFlat",
"xenoRefGene",
"xenoRefSeqAli"
]

This method accepts a parameter callback for JSON-P : e.g: http://localhost:8080/schemas/anoCar1/tables?callback=handle


handle([
"all_mrna",
"author",
"blastHg18KG",
"cds",
"cell",
(...)
"xenoRefFlat",
"xenoRefGene",
"xenoRefSeqAli"
]);

METHOD: /schema/:database/:table


Returns a schema for the given database.table. E.g: http://localhost:8080/schemas/anoCar1/xenoMrna



{"database":"anoCar1","table":"xenoMrna","fields":[{"name":"bin","type":"smallint(5) unsigned","key":""},{"name":"matches","type":"int(10) unsigned","key":""},{"name":"misMatches","type":"int(10) unsigned","key":""},{"name":"repMatches","type":"int(10) unsigned","key":""},{"name":"nCount","type":"int(10) unsigned","key":""},{"name":"qNumInsert","type":"int(10) unsigned","key":""},{"name":"qBaseInsert","type":"int(10) unsigned","key":""},{"name":"tNumInsert","type":"int(10) unsigned","key":""},{"name":"tBaseInsert","type":"int(10) unsigned","key":""},{"name":"strand","type":"char(2)","key":""},{"name":"qName","type":"varchar(255)","key":"MUL"},{"name":"qSize","type":"int(10) unsigned","key":""},{"name":"qStart","type":"int(10) unsigned","key":""},{"name":"qEnd","type":"int(10) unsigned","key":""},{"name":"tName","type":"varchar(255)","key":"MUL"},{"name":"tSize","type":"int(10) unsigned","key":""},{"name":"tStart","type":"int(10) unsigned","key":""},{"name":"tEnd","type":"int(10) unsigned","key":""},{"name":"blockCount","type":"int(10) unsigned","key":""},{"name":"blockSizes","type":"longblob","key":""},{"name":"qStarts","type":"longblob","key":""},{"name":"tStarts","type":"longblob","key":""}]}


This method accepts a parameter callback for JSON-P : e.g: http://localhost:8080/schemas/anoCar1/xenoMrna?callback=handler


handler({"database":"anoCar1","table":"xenoMrna","fields":[{"name":"bin","type":"smallint(5) unsigned","key":""},{"name":"matches","type":"int(10) unsigned","key":""},{"name":"misMatches","type":"int(10) unsigned","key":""},{"name":"repMatches","type":"int(10) unsigned","key":""},{"name":"nCount","type":"int(10) unsigned","key":""},{"name":"qNumInsert","type":"int(10) unsigned","key":""},{"name":"qBaseInsert","type":"int(10) unsigned","key":""},{"name":"tNumInsert","type":"int(10) unsigned","key":""},{"name":"tBaseInsert","type":"int(10) unsigned","key":""},{"name":"strand","type":"char(2)","key":""},{"name":"qName","type":"varchar(255)","key":"MUL"},{"name":"qSize","type":"int(10) unsigned","key":""},{"name":"qStart","type":"int(10) unsigned","key":""},{"name":"qEnd","type":"int(10) unsigned","key":""},{"name":"tName","type":"varchar(255)","key":"MUL"},{"name":"tSize","type":"int(10) unsigned","key":""},{"name":"tStart","type":"int(10) unsigned","key":""},{"name":"tEnd","type":"int(10) unsigned","key":""},{"name":"blockCount","type":"int(10) unsigned","key":""},{"name":"blockSizes","type":"longblob","key":""},{"name":"qStarts","type":"longblob","key":""},{"name":"tStarts","type":"longblob","key":""}]});


METHOD: /ucsc/:database/:table/:column/:key


Fetch the rows for a given database.name having a :column==:key . The :column must be indexed. E.g: http://localhost:8080/ucsc/anoCar1/ensGene/name/ENSACAT00000004346


[
{"bin":592,"name":"ENSACAT00000004346","chrom":"scaffold_111","strand":"-","txStart":991522,"txEnd":996396,"cdsStart":991522,"cdsEnd":996396,"exonCount":3,"exonStarts":"991522,995669,995976,","exonEnds":"991954,995972,996396,","score":0,"name2":"PELO","cdsStartStat":"cmpl","cdsEndStat":"cmpl","exonFrames":"0,0,0,"}
]

This method accepts a parameter callback for JSON-P : e.g: http://localhost:8080/ucsc/anoCar1/ensGene/name/ENSACAT00000004346?callback=handler


handler([
{"bin":592,"name":"ENSACAT00000004346","chrom":"scaffold_111","strand":"-","txStart":991522,"txEnd":996396,"cdsStart":991522,"cdsEnd":996396,"exonCount":3,"exonStarts":"991522,995669,995976,","exonEnds":"991954,995972,996396,","score":0,"name2":"PELO","cdsStartStat":"cmpl","cdsEndStat":"cmpl","exonFrames":"0,0,0,"}
]);

METHOD: /ucsc/:database/:table?chrom=?&start=?&end=?




Fetch the rows for a given genomic database.name overlapping the given range. This method uses the UCSC-bin index if it is available; E.g: http://localhost:8080/ucsc/anoCar1/ensGene?chrom=scaffold_111&start=600000&end=900000


[
{"bin":589,"name":"ENSACAT00000003906","chrom":"scaffold_111","strand":"-","txStart":594783,"txEnd":614216,"cdsStart":595000,"cdsEnd":614201,"exonCount":9,"exonStarts":"594783,601291,601744,603640,604745,604865,609139,611740,614097,","exonEnds":"595105,601406,601813,603736,604771,604942,609173,611840,614216,","score":0,"name2":"DPM1","cdsStartStat":"cmpl","cdsEndStat":"cmpl","exonFrames":"0,2,2,2,0,1,0,2,0,"},
{"bin":589,"name":"ENSACAT00000003908","chrom":"scaffold_111","strand":"+","txStart":614382,"txEnd":615600,"cdsStart":614382,"cdsEnd":615600,"exonCount":1,"exonStarts":"614382,","exonEnds":"615600,","score":0,"name2":"MOCS3","cdsStartStat":"incmpl","cdsEndStat":"cmpl","exonFrames":"0,"},
{"bin":589,"name":"ENSACAT00000003918","chrom":"scaffold_111","strand":"-","txStart":638920,"txEnd":642127,"cdsStart":638920,"cdsEnd":642127,"exonCount":2,"exonStarts":"638920,641368,","exonEnds":"639691,642127,","score":0,"name2":"KCNG1","cdsStartStat":"cmpl","cdsEndStat":"cmpl","exonFrames":"0,0,"},
{"bin":591,"name":"ENSACAT00000003920","chrom":"scaffold_111","strand":"+","txStart":814576,"txEnd":826972,"cdsStart":814576,"cdsEnd":826972,"exonCount":3,"exonStarts":"814576,825125,826845,","exonEnds":"814594,825247,826972,","score":0,"name2":"ENSACAG00000003945","cdsStartStat":"incmpl","cdsEndStat":"cmpl","exonFrames":"0,0,2,"},
{"bin":591,"name":"ENSACAT00000004042","chrom":"scaffold_111","strand":"-","txStart":849731,"txEnd":881887,"cdsStart":849731,"cdsEnd":881887,"exonCount":24,"exonStarts":"849731,851343,855421,856165,857842,858090,861054,861943,862949,863773,865029,865639,867414,868216,872220,873601,874396,876850,877105,877711,878919,879681,881320,881738,","exonEnds":"849809,851460,855511,856279,857947,858201,861157,862027,863026,863866,865171,865722,867525,868368,872360,873738,874600,876994,877263,877850,878993,879847,881471,881887,","score":0,"name2":"ITGA2","cdsStartStat":"incmpl","cdsEndStat":"incmpl","exonFrames":"0,0,0,0,0,0,2,2,0,0,2,0,0,1,2,0,0,0,1,0,1,0,2,0,"},
{"bin":591,"name":"ENSACAT00000004050","chrom":"scaffold_111","strand":"-","txStart":883724,"txEnd":897808,"cdsStart":883724,"cdsEnd":897808,"exonCount":5,"exonStarts":"883724,885433,889264,889742,897701,","exonEnds":"883858,885548,889356,889852,897808,","score":0,"name2":"ENSACAG00000004086","cdsStartStat":"incmpl","cdsEndStat":"incmpl","exonFrames":"1,0,1,2,0,"}
]

This method accepts a parameter callback for JSON-P : e.g: http://localhost:8080/ucsc/anoCar1/ensGene?chrom=scaffold_111&start=600000&end=900000&callback=handler


handler([
{"bin":589,"name":"ENSACAT00000003906","chrom":"scaffold_111","strand":"-","txStart":594783,"txEnd":614216,"cdsStart":595000,"cdsEnd":614201,"exonCount":9,"exonStarts":"594783,601291,601744,603640,604745,604865,609139,611740,614097,","exonEnds":"595105,601406,601813,603736,604771,604942,609173,611840,614216,","score":0,"name2":"DPM1","cdsStartStat":"cmpl","cdsEndStat":"cmpl","exonFrames":"0,2,2,2,0,1,0,2,0,"},
{"bin":589,"name":"ENSACAT00000003908","chrom":"scaffold_111","strand":"+","txStart":614382,"txEnd":615600,"cdsStart":614382,"cdsEnd":615600,"exonCount":1,"exonStarts":"614382,","exonEnds":"615600,","score":0,"name2":"MOCS3","cdsStartStat":"incmpl","cdsEndStat":"cmpl","exonFrames":"0,"},
{"bin":589,"name":"ENSACAT00000003918","chrom":"scaffold_111","strand":"-","txStart":638920,"txEnd":642127,"cdsStart":638920,"cdsEnd":642127,"exonCount":2,"exonStarts":"638920,641368,","exonEnds":"639691,642127,","score":0,"name2":"KCNG1","cdsStartStat":"cmpl","cdsEndStat":"cmpl","exonFrames":"0,0,"},
{"bin":591,"name":"ENSACAT00000003920","chrom":"scaffold_111","strand":"+","txStart":814576,"txEnd":826972,"cdsStart":814576,"cdsEnd":826972,"exonCount":3,"exonStarts":"814576,825125,826845,","exonEnds":"814594,825247,826972,","score":0,"name2":"ENSACAG00000003945","cdsStartStat":"incmpl","cdsEndStat":"cmpl","exonFrames":"0,0,2,"},
{"bin":591,"name":"ENSACAT00000004042","chrom":"scaffold_111","strand":"-","txStart":849731,"txEnd":881887,"cdsStart":849731,"cdsEnd":881887,"exonCount":24,"exonStarts":"849731,851343,855421,856165,857842,858090,861054,861943,862949,863773,865029,865639,867414,868216,872220,873601,874396,876850,877105,877711,878919,879681,881320,881738,","exonEnds":"849809,851460,855511,856279,857947,858201,861157,862027,863026,863866,865171,865722,867525,868368,872360,873738,874600,876994,877263,877850,878993,879847,881471,881887,","score":0,"name2":"ITGA2","cdsStartStat":"incmpl","cdsEndStat":"incmpl","exonFrames":"0,0,0,0,0,0,2,2,0,0,2,0,0,1,2,0,0,0,1,0,1,0,2,0,"},
{"bin":591,"name":"ENSACAT00000004050","chrom":"scaffold_111","strand":"-","txStart":883724,"txEnd":897808,"cdsStart":883724,"cdsEnd":897808,"exonCount":5,"exonStarts":"883724,885433,889264,889742,897701,","exonEnds":"883858,885548,889356,889852,897808,","score":0,"name2":"ENSACAG00000004086","cdsStartStat":"incmpl","cdsEndStat":"incmpl","exonFrames":"1,0,1,2,0,"}
]);


That's it,

Pierre