07 January 2011

SVG image in wikipedia + Zoom.it =a Zoomable "Tree of life"

Zoom.it is a free service for viewing and sharing high-resolution imagery. You give us the link to any image ( including SVG, pdfs) on the web along with a nice short URL.. As a test, I used the SVG file "Tree of life with genome size.svg" on wikipedia and here is the awesome result generated by Zoom.it:



That's it,
Piere

05 January 2011

Template:Infobox biodatabase

I've just started creating a wikipedia infobox to annotate the biological databases in wikipedia. If many articles use this template, then it will be possible to parse the them and to create a list of the databases providing some web services, some SPARQL endpoints, having a download area etc...
The infobox itself is still a draft, so feel free to modify it or to suggest some other fields in the 'Talk' page.



that's it,

Pierre

Coding a CXF web service translating a DNA to a protein. My notebook

Apache CXF is a Web Services framework. In this post, I'll will describe how I implemented a Web Service translating a DNA to a protein using the web server Apache Tomcat and the CXF libraries.

Defining the interface

First a simple java interface bio.Translate is needed to describe the service. This simple service receives a string (the dna) and returns a string (the peptide). The annotations will be used by CXF to name the parameters in the WSDL file (see later):

Implementing the service

bio.TranslateImpl implements bio.Translate. The setter/getter for ncbiString will be used by a configuration file to specify a genetic code (standard, mitochondrial) for this service. The methods initIt and cleanUp could be used to acquire and to release some resources for the service when it is created and/or disposed.

Configuring the service

CXF uses the libraries of the Spring framework (I blogged about spring here ). A XML config file beans.xml makes it easy to configure two java beans for the 'standard genetic code' and the 'mitochondrial code'. In this config file, we also tell Spring about the two methods initIt and cleanUp. Those two beans will be used by two Web Services

Defining the CXF application for Tomcat

The following web.xml file only tells tomcat, the web server, to use the CXFServlet to listen to the SOAP queries.

Compile & Deploy

Installing a CXF web service requires many libraries and at the end, the size of the deployed 'war' file was 8.5Mo(!). Currently, my structure for the current project is:
./translate/WEB-INF/classes/bio/TranslateImpl.java
./translate/WEB-INF/classes/bio/Translate.java
./translate/WEB-INF/beans.xml
./translate/WEB-INF/web.xml
The service was compiled and deployed using the following Makefile:
cxf.lib=apache-cxf-2.3.1/lib
all:
mkdir -p translate/WEB-INF/lib
javac -d translate/WEB-INF/classes -sourcepath translate/WEB-INF/classes translate/WEB-INF/classes/bio/TranslateImpl.java
cp ${cxf.lib}/cxf-2.3.1.jar \
${cxf.lib}/geronimo-activation_1.1_spec-1.1.jar \
${cxf.lib}/geronimo-annotation_1.0_spec-1.1.1.jar \
${cxf.lib}/geronimo-javamail_1.4_spec-1.7.1.jar \
${cxf.lib}/geronimo-servlet_3.0_spec-1.0.jar \
${cxf.lib}/geronimo-ws-metadata_2.0_spec-1.1.3.jar \
${cxf.lib}/geronimo-jaxws_2.2_spec-1.0.jar \
${cxf.lib}/geronimo-stax-api_1.0_spec-1.0.1.jar \
${cxf.lib}/jaxb-api-2.2.1.jar \
${cxf.lib}/jaxb-impl-2.2.1.1.jar \
${cxf.lib}/neethi-2.0.4.jar \
${cxf.lib}/saaj-api-1.3.jar \
${cxf.lib}/saaj-impl-1.3.2.jar \
${cxf.lib}/wsdl4j-1.6.2.jar \
${cxf.lib}/XmlSchema-1.4.7.jar \
${cxf.lib}/xml-resolver-1.2.jar \
${cxf.lib}/aopalliance-1.0.jar \
${cxf.lib}/spring-core-3.0.5.RELEASE.jar \
${cxf.lib}/spring-beans-3.0.5.RELEASE.jar \
${cxf.lib}/spring-context-3.0.5.RELEASE.jar \
${cxf.lib}/spring-web-3.0.5.RELEASE.jar \
${cxf.lib}/commons-logging-1.1.1.jar \
${cxf.lib}/spring-asm-3.0.5.RELEASE.jar \
${cxf.lib}/spring-expression-3.0.5.RELEASE.jar \
${cxf.lib}/spring-aop-3.0.5.RELEASE.jar \
translate/WEB-INF/lib
jar cvf translate.war -C translate .
mv translate.war path-to-tomcat/webapps

Checking the URL

We can see that the service was correctly deployed by pointing a web browser at http://localhost:8080/translate/, where we can see the two services:
Available SOAP services:
Translate
  • translate
Endpoint address: http://localhost:8080/translate/translateMit
WSDL : {http://bio/}TranslateService
Target namespace: http://bio/
Translate
  • translate
Endpoint address: http://localhost:8080/translate/translateStd
WSDL : {http://bio/}TranslateService
Target namespace: http://bio/

Here, the URLs link to the WSDL definition for the web service:

Creating a client

For creating a client consuming this service, I first used the code generated by CXF's wsdl2java but there was a bug with one of the generated classe (it is a known bug feature) so here, I'm going to use the standard ${JAVA_HOME}/bin/wsimport.
> wsimport -p generated -d client -keep "http://localhost:8080/translate/translateStd?wsdl"
parsing WSDL...
generating code...
compiling code...
I wrote a java client MyClient.java using this generated API:

Compiling

> cd client
> javac MyClient.java

Running

> java MyClient
EFIDHSIAC*


That's it,

Pierre

Don't mask this sequence, please.

I recently asked on Biostar if it would be worth to mask the non-genic sequence before aligning the short reads on the reference after an exome sequencing. Although I was convinced by the answer of lh3, I was curious to observe the difference with some real data.

I've downloaded two fastqs files from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA20772/sequence_read/ERR004053_*.recal.fastq.gz and the sequence for the human chromosome chr1 from the UCSC.

One copy of chr1.fa was masked using the UCSC knownGene table +/- 10kb using a custom software. The bases were replaced by a 'N' if they were not contained in a genic region+/-10kb.

  • Number of 'N' in chr1 without masking:22,250,000
  • Number of 'N' in chr1 with masking:108,399,153


Then, the two fastq were aligned on each chr1 (masked/not masked) and the mutations were called with 'samtools pileup':
bwa-0.5.9rc1/bwa index chr1.fa
bwa-0.5.9rc1/bwa aln chr1.fa ERR004053_1.recal.fastq.gz > aln1.sai
bwa-0.5.9rc1/bwa aln chr1.fa ERR004053_2.recal.fastq.gz > aln2.sai
bwa-0.5.9rc1/bwa sampe chr1.fa aln1.sai aln2.sai ERR004053_1.recal.fastq.gz ERR004053_2.recal.fastq.gz > file.sam
samtools-0.1.10/samtools faidx chr1.fa
samtools-0.1.10/samtools view -b -t chr1.fa.fai file.sam > file.bam
samtools-0.1.10/samtools sort file.bam sorted
samtools-0.1.10/samtools index sorted.bam
samtools-0.1.10/samtools pileup -vcf chr1.fa sorted.bam |\
awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' |\
cut -d ' ' -f 1-4 |\
sort | uniq > pileup.txt


At the end:
  • Number of mutations (no masking): 24921

  • Number of mutations (masking): 26062

  • Number of mutations common in 'masking'/'no masking': 13573

  • Number of mutations unique in 'no masking': 12489

  • Number of mutations unique in 'masking': 11348


'chr1:100005960 c/A': a mutation from the 'masked' sequence but not found in 'not-masked':

chr1 masked

  100005921 100005931 100005941  100005951 100005961 100005971
tgctaattggtcagattggagatggaatca*tggggggtcgacgtgaggttttcttgctgtcttct
....G.........G............... MM.......R.A....RK.................
.. ,,,,,,,,,,,,,,cac,,,,,,,,,a,,,,a,,, ,,,,,,,,,,,,,
.... ..........*CA.......A.A.......N....... ,,,,,,
.. ..........*CA.......A.A............... ,,,,
G...G..G.. ..........*CA.......A.A...............
....G.........G..T.. ...........T.................
....G....... .....A.....T.................
,,,,g,,,,,,,,,g,,,,,,,,,,,,,,,*,,,,, ..............G........

chr1 NOT masked

  100005931 100005941 100005951 100005961 100005971 100005981
tcagattggagatggaatcatggggggtcgacgtgaggttttcttgctgtcttctgttcctgggtg
..........CA.......R.A.....K............................
..........CA.......A.A.......N.......
...........T.........................
.....A.....T.........................
.A..................................
..............G...................
in this case, it is visiblethat the reads have been more correctly aligned on the non-masked sequence.


That's it

Pierre

01 January 2011

My tool to annotate VCF files.

The Variant Call Format is a text file format generated by many tools for NGS. It contains meta-information lines, a header line, and then data lines describing how the mutations were called. I don't like this format because it cannot be used to store some hierachical annotations (like json or xml), nevertheless it is a de facto standard.

I wrote a tool called vcfannotator to append a set of annotations from the UCSC database to a VCF file. As I wanted to keep this tool simple and without any dependencies, it only uses the flat files available from the download area at the UCSC.

This tools appends several informations:

  • A prediction of the mutation: is it in the cDNA ? in an intron ? in an exon ? is it a non-synonymous mutation ? was a stop codon lost or gained ? is there any consequence on the splicing process ? Here, the UCSC DAS server and the KnonwGenes table are used to retrieve the genomic DNA and the structure of the gene.
  • dbSNP: was this mutation found before in dbSNP ?
  • Personal genomes: was this mutation found before in Venter's genome ? In Watson's genome ? etc...
  • The mapability (Uniqueness of Reference Genome ): gives an idea about how the context is unique in the genome
  • genomicSuperDups: is this mutation located in a segmental genomic duplication ?
  • tfbsConsSites: is this mutation located in the site of a transcription factor ?
  • phastCons44way: how is the mutation conserved within a multiple alignments of 44 vertebrate genomes to the human genome ?


The tool is available on github at: https://github.com/lindenb/jsandbox in jsandbox/src/sandbox/VCFAnnotator.java.

Compilation

> cd jsandbox
> ant vcfannotator

Execution & Options

> java -jar dist/vcfannotator.jar -h
VCF annotator
Pierre Lindenbaum PhD. 2010. http://plindenbaum.blogspot.com
Options:
-b ucsc.build default:hg18
-m mapability.
-g genomicSuperDups
-p basic prediction
-c phastcons prediction (phastCons44way)
-t transcription factors sites prediction
-pg personal genomes
-snp <id> add ucsc <id> must be present in "http://hgdownload.cse.ucsc.edu/goldenPath/<ucscdb>/database/<id>.txt.gz" e.g. snp129
-log <level> one value from java.util.logging.Level default:OFF
-proxyHost <host>
-proxyPort <port>

Example

Input

##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1105366 . T C 99 0 NS=471;DP=16179;AC=8;AN=942;HWE=0.0685233
1 1105373 . C T 99 0 NS=469;DP=15318;AC=5;AN=938;HWE=0.0267954
1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276
1 1109206 . G A 99 0 NS=696;DP=54701;AC=2;AN=1392;HWE=0.0028777
1 1109262 . C T 99 0 NS=696;DP=55783;AC=12;AN=1392;HWE=0.104349
1 1110233 . C G 99 0 NS=694;DP=34301;AC=43;AN=1388;HWE=4.91397
1 1110240 . T A 99 0 NS=695;DP=35846;AC=3;AN=1390;HWE=0.00648883
1 1110294 rs1320571 G A 99 0 NS=608;DP=36050;AC=246;AN=1216;dbSNP;HM;HWE=3.05358
1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402
1 1110358 . T C 99 0 NS=694;DP=24596;AC=24;AN=1388;HWE=0.422309
1 1110366 . G A 99 0 NS=697;DP=22511;AC=16;AN=1394;HWE=0.185781
1 3537495 . C T 99 0 NS=697;DP=25302;AC=15;AN=1394;HWE=0.163165
1 3537910 . G A 99 0 NS=605;DP=15896;AC=34;AN=1210;HWE=0.46683
1 3537996 rs2760321 T C 99 0 NS=571;DP=15376;AC=1040;AN=1142;dbSNP;HM;HWE=4.28852
(...)

Command:

java -jar dist/vcfannotator.jar -pg -m -g -p -c -t -snp snp130 -log ALL input.vcf
## wait !

Result

##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
##SNP130=table snp130 from UCSC
##INFO=NA12878,1,String,"NA12878's Personal genome"
##INFO=NA12891,1,String,"NA12891's Personal genome"
(...)
chr1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:935|kgId:uc001acy.1|position.protein:312|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 11|type:EXON_CODING_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:716|kgId:uc001acz.1|position.protein:239|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 7|type:EXON_CODING_SYNONYMOUS
(...)
chr1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|splicing:SPLICING_DONOR|pos.cdna:1399|kgId:uc001acy.1|position.protein:467|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 13|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|pos.cdna:1180|kgId:uc001acz.1|position.protein:394|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 9|type:EXON_CODING_NON_SYNONYMOUS
(...)
chr1 113068276 . C G 99 0 NS=697;DP=12774;AC=7;AN=1394;HWE=0.0353282;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:FAM19A3|wild.codon:CCA|wild.aa:P|pos.cdna:451|kgId:uc001ecu.1|position.protein:151|strand:+|mut.aa:R|mut.codon:CGA|exon.name:Exon 4|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:FAM19A3|wild.codon:ACC|wild.aa:T|pos.cdna:383|kgId:uc001ecv.1|position.protein:128|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 4|type:EXON_CODING_SYNONYMOUS
(...)
Interestingly, for this last mutation chr1:113068276, there are two transcripts at the same position with two different translation frames, so there are two predictions: one synonymous mutation and one non-synonymous mutation.


That's it,

Pierre