23 February 2011

Creating a custom shape for DIA: my notebook.


In the this post, I'll describe how to create a custom shape for DIA, a diagram creation program.
  • The logo of the NCBI (149x183) was downloaded from commons.wikimedia.org.
  • Inkscape was used to transform this picture to a vectorial drawing (SVG).
  • In ${HOME}/.dia/sheet, I wrote the following sheet describing the package Bio and containing one object named "Bio - NCBI"
    <?xml version="1.0" encoding="iso-8859-1"?>
    <sheet xmlns="http://www.lysator.liu.se/~alla/dia/dia-sheet-ns">
    <name>Bio</name>
    <description>Bio</description>
    <contents>
    <object name="Bio - NCBI">
    <description>Shape</description>
    </object>
    </contents>
    </sheet>

  • I moved the icon ncbi.png to ${HOME}/.dia/shapes/ncbi.png,
  • The object "Bio - NCBI" was defined in ${HOME}/.dia/shapes/ncbi.shape. This object contains the SVG drawing and defines 8 points for connecting the objects.
    <?xml version="1.0" encoding="UTF-8" ?>
    <shape xmlns="http://www.daa.com.au/~james/dia-shape-ns" xmlns:svg="http://www.w3.org/2000/svg">
    <name>Bio - NCBI</name>
    <description>NCBI</description>
    <icon>ncbi.png</icon>
    <connections>
    <point x="0" y="0"/>
    <point x="149" y="0"/>
    <point x="149" y="183"/>
    <point x="0" y="183"/>
    <point x="0" y="91"/>
    <point x="149" y="91"/>
    <point x="74" y="0"/>
    <point x="74" y="183"/>
    </connections>
    <aspectratio type="fixed"/>
    <textbox x1="0" y1="0" x2="149" y2="183"/>
    <svg:svg version="1.0" width="149" height="183" style="stroke-width:0.1;">
    <svg:rect x="0" y="0" width="149" height="183" fill="#326598" stroke="black"/>
    <svg:path fill-rule="nonzero"
    d="M 32.313683,145 L 30.030665,145 L 30.545674,138.75 C 30.828929,135.3125 31.265318,130.22805 31.515428,127.45121 L 31.970174,122.40243 L 30.428383,120.20121 L 28.886592,118 L 31.717493,118 L 34.548394,118 L 38.811071,123.75 C 41.155544,126.9125 45.126964,131.75 47.636449,134.5 L 52.19915,139.5 L 51.763672,128.75 L 51.328193,118 L 53.08395,118 L 54.839708,118 L 54.291101,127.75 C 53.989367,133.1125 53.450636,139.29908 53.093921,141.49795 L 52.445348,145.4959 L 42.972674,134.37203 L 33.5,123.24817 L 33.208078,125.37408 C 33.047521,126.54334 33.294279,131.4375 33.756429,136.25 L 34.596701,145 L 32.313683,145 z M 74.569665,144.99626 L 70.5,144.99252 L 67.366714,143.08194 C 65.643407,142.03112 63.505907,140.13267 62.616714,138.86317 L 61,136.55499 L 61,131.5 L 61,126.44501 L 62.704109,123.97251 C 63.641368,122.61263 65.103868,120.9448 65.954109,120.26622 C 66.804349,119.58765 69.23818,118.54415 71.362623,117.94735 L 75.225246,116.86226 L 78.931032,117.48835 L 82.636819,118.11444 L 82.196448,121.11487 L 81.756078,124.1153 L 79.839084,122.05765 L 77.92209,120 L 73.017749,120 L 68.113408,120 L 66.556704,122.22251 L 65,124.44501 L 65,130.08614 L 65,135.72727 L 67.636364,138.36364 L 70.272727,141 L 74.702105,141 L 79.131483,141 L 81.065741,139.96482 C 82.129584,139.39547 83,139.16704 83,139.45721 C 83,139.74738 82.018849,141.11321 80.819665,142.49239 L 78.639329,145 L 74.569665,144.99626 z M 94.5,144.99626 L 88.5,145 L 88.5,131.60088 L 88.5,118.20176 L 94.125,117.56537 L 99.75,116.92897 L 102.29025,118.08638 L 104.83049,119.2438 L 105.3639,121.28357 L 105.89732,123.32335 L 105.00326,124.99392 C 104.51152,125.91273 103.39759,127.25506 102.52786,127.97688 L 100.94653,129.28927 L 104.00685,131.15036 L 107.06716,133.01145 L 107.64143,134.50796 L 108.21569,136.00447 L 107.58101,138.53326 L 106.94633,141.06204 L 103.72316,143.02728 L 100.5,144.99252 L 94.5,144.99626 z M 98.55,142.92105 L 101.6,143 L 102.8,141.8 L 104,140.6 L 104,137.87258 L 104,135.14516 L 101.36514,133.07258 L 98.73028,131 L 95.86514,131 L 93,131 L 93,136.41667 L 93,141.83333 L 94.25,142.33772 C 94.9375,142.61513 96.8725,142.87763 98.55,142.92105 z M 95.9433,130 L 98.88659,130 L 100.4433,127.77749 C 101.29948,126.55512 102,124.98012 102,124.27749 C 102,123.57487 101.1,122.1 100,121 L 98,119 L 95.5,119 L 93,119 L 93,124.5 L 93,130 L 95.9433,130 z M 115.9613,145.00071 L 113.42259,145 L 114.02313,138.25 L 114.62367,131.5 L 114.03912,124.75 L 113.45457,118 L 115.97729,118.00113 L 118.5,118.00225 L 118.5,131.50184 L 118.5,145.00143 L 115.9613,145.00071 z M 76.75,101.46083 L 73,102.09063 L 73,97.564366 L 73,93.038106 L 75.25,92.582276 C 76.4875,92.331566 79.975,91.875246 83,91.568236 C 86.025,91.261226 90.21672,90.528556 92.31493,89.940086 L 96.12987,88.870136 L 98.06493,86.935066 L 100,85 L 100,83.328583 L 100,81.657166 L 95.41708,77.265957 L 90.83416,72.874747 L 94.88928,74.943513 C 97.11959,76.081334 100.5319,78.406133 102.4722,80.109732 L 106,83.207184 L 106,85.669334 C 106,87.023516 105.54172,88.987776 104.98161,90.034366 L 103.96322,91.937246 L 99.73161,94.409336 L 95.5,96.881426 L 88,98.856226 C 83.875,99.942366 78.8125,101.11444 76.75,101.46083 z M 56.5,85 C 56.225,85 56,84.775 56,84.5 C 56,84.225 56.225,84 56.5,84 C 56.775,84 57,84.225 57,84.5 C 57,84.775 56.775,85 56.5,85 z M 54.885619,84 C 54.613887,84 52.271243,82.655334 49.679742,81.011853 L 44.967924,78.023706 L 43.857046,75.585595 L 42.746168,73.147483 L 43.383017,70.610074 L 44.019866,68.072665 L 48.259933,65.595618 L 52.5,63.11857 L 60,61.089291 L 67.5,59.060011 L 77,57.97532 C 82.225,57.378739 88.83491,56.235687 91.68868,55.435205 L 96.87736,53.979783 L 98.43868,52.418463 L 100,50.857143 L 100,48.773735 L 100,46.690328 L 95.75,42.646974 L 91.5,38.603621 L 96.84169,41.510641 L 102.18337,44.417662 L 104.09169,46.843689 L 106,49.269717 L 106,52.062055 L 106,54.854392 L 102.25,57.955856 L 98.5,61.05732 L 90.7182,63.522397 L 82.936408,65.987474 L 71.089745,67.528685 C 64.574081,68.376351 57.776383,69.49054 55.983749,70.004661 C 54.191116,70.518781 51.886422,71.697799 50.862208,72.6247 L 49,74.309975 L 49,76.03633 L 49,77.762685 L 52.189838,80.881343 C 53.944249,82.596604 55.15735,84 54.885619,84 z M 89.5,73 C 89.225,73 89,72.775 89,72.5 C 89,72.225 89.225,72 89.5,72 C 89.775,72 90,72.225 90,72.5 C 90,72.775 89.775,73 89.5,73 z M 58.5,53 C 58.225,53 58,52.775 58,52.5 C 58,52.225 58.225,52 58.5,52 C 58.775,52 59,52.225 59,52.5 C 59,52.775 58.775,53 58.5,53 z M 56.929999,51.990616 C 56.693499,51.985455 54.183802,50.652796 51.352893,49.029152 L 46.205786,46.077073 L 44.102893,43.403678 L 42,40.730283 L 42,39.053703 L 42,37.377123 L 44.038155,35.18942 C 45.159141,33.986183 47.850527,32.09658 50.019013,30.990301 L 53.961714,28.978884 L 62.230857,26.996448 C 66.778886,25.906108 71.5125,25.010859 72.75,25.007006 L 75,25 L 75,29.338975 L 75,33.67795 L 65.916942,34.796298 L 56.833885,35.914646 L 53.166942,37.707323 L 49.5,39.5 L 49.195591,42.125505 L 48.891182,44.751009 L 53.12559,48.375505 C 55.454514,50.368977 57.166498,51.995777 56.929999,51.990616 z M 90.5,39 C 90.225,39 90,38.775 90,38.5 C 90,38.225 90.225,38 90.5,38 C 90.775,38 91,38.225 91,38.5 C 91,38.775 90.775,39 90.5,39 z"
    fill="white" stroke="none" />


    </svg:svg>
    </shape>
  • run dia ...



Et voila!


That's it,

Pierre

22 February 2011

A flex scanner extracting the metadata from a PDF file.

4 years ago, I played with the adobe XMP library to extract the XMP metadata from a set of PDF files.

Today, I was suprised to simply display the XMP data contained in a PDF from Nature by using the following command line:

curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
strings |\
grep -A 100 "<x:xmpmeta"


I've generalized this process by implementing a GNU-flex scanner that prints the XML content between two <xmp:xmpmeta/> tags. The source code is available on github at: https://github.com/lindenb/ccsandbox/blob/master/src/xmpextractor.l.

Compilation

flex -f -B --read xmpextractor.l
gcc -o xmpextractor lex.yy.c

Testing with "PLOS"

Harper MA, Chen Z, Toy T, Machado IMP, Nelson SF, et al. (2011) Phenotype Sequencing: Identifying the Genes That Cause a Phenotype Directly from Pooled Sequencing of Independent Mutants. PLoS ONE 6(2): e16517. doi:10.1371/journal.pone.0016517:

curl -s "http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0016517&representation=PDF" |\
./xmpextractor

<?xml version="1.0" encoding="UTF-8"?>
<XMP>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP toolkit 2.9.1-13, framework 1.6">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:iX="http://ns.adobe.com/iX/1.0/">
<rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" pdf:Producer="Acrobat Distiller 7.0 (Windows)"/>
<rdf:Description xmlns:xap="http://ns.adobe.com/xap/1.0/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xap:CreateDate="2011-02-14T08:21:58+08:00" xap:CreatorTool="3B2 Total Publishing System 7.51n/W" xap:ModifyDate="2011-02-17T14:32:08+08:00" xap:MetadataDate="2011-02-17T14:32:08+08:00"/>
<rdf:Description xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xapMM:DocumentID="uuid:f98295b5-0980-42f2-884e-6cecd2d75c90" xapMM:InstanceID="uuid:d226393a-bfaf-4a2b-8c28-08215008694e"/>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" dc:format="application/pdf">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">pone.0016517 1..16</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

Hum, nothing really interesting here.

Testing with "Nature Reviews Cardiology"

A test with Percutaneous coronary intervention in the elderly. Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184
curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
./xmpextractor


<?xml version="1.0" encoding="UTF-8"?>
<XMP><x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:identifier>doi:10.1038/nrcardio.2010.184</dc:identifier>
<dc:creator>
<rdf:Seq>
<rdf:li>Tracy Y. Wang</rdf:li>
<rdf:li>Antonio Gutierrez</rdf:li>
<rdf:li>Eric D. Peterson</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Alt>
</dc:description>
<dc:publisher>
<rdf:Bag>
<rdf:li>Nature Publishing Group</rdf:li>
</rdf:Bag>
</dc:publisher>
<dc:rights>
<rdf:Alt>
<rdf:li xml:lang="x-default">&#xA; © 2010 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.</rdf:li>
</rdf:Alt>
</dc:rights>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Percutaneous coronary intervention in the elderly</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Adobe PDF Library 8.0</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/">
<prism:copyright>© 2010 Nature Publishing Group</prism:copyright>
<prism:doi>10.1038/nrcardio.2010.184</prism:doi>
<prism:eIssn>1759-5010</prism:eIssn>
<prism:endingPage>90</prism:endingPage>
<prism:issn>1759-5002</prism:issn>
<prism:number>2</prism:number>
<prism:publicationName>Nature Publishing Group</prism:publicationName>
<prism:rightsAgent>permissions@nature.com</prism:rightsAgent>
<prism:startingPage>79</prism:startingPage>
<prism:volume>8</prism:volume>
<prism:publicationDate>
<rdf:Bag>
<rdf:li>2010-12-07</rdf:li>
</rdf:Bag>
</prism:publicationDate>
<prism:url>
<rdf:Bag>
<rdf:li>http://dx.doi.org/10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</prism:url>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreateDate>2011-01-10T10:09:23+05:30</xmp:CreateDate>
<xmp:CreatorTool/>
<xmp:Label>Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</xmp:Label>
<xmp:MetadataDate>2011-01-14T18:25:45+05:30</xmp:MetadataDate>
<xmp:ModifyDate>2011-01-14T18:25:45+05:30</xmp:ModifyDate>
<xmp:Identifier>
<rdf:Bag>
<rdf:li>doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</xmp:Identifier>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/">
<xmpRights:Marked>True</xmpRights:Marked>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:51421664-c9c6-4657-9fbe-318ac969ca26</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:c65fa1fb-b6f3-4a61-8615-07b04871749e</xmpMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

That's more interesting isn't it ?

That's it,

Pierre

21 February 2011

Tabix: Fast retrieval of sequence features from generic TAB-delimited files

This post is about Tabix whose paper has been recently published in Bioinformatics:Bioinformatics. 2011 Jan 5.
Tabix: Fast retrieval of sequence features from generic TAB-delimited files.
Li H.
Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA.


Download &Install

  • Download tabix from
  • Extract & compile:
    > bunzip2 tabix-0.2.3.tar.bz2
    > tar xf tabix-0.2.3.tar
    > cd tabix-0.2.3
    > make

Testing with UCSC knownGene

I downloaded the table KnownGene from the UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownGene.txt.gz
The file was g-unzipped and re-zipped with tabix-0.2.3/bgzip
gunzip knownGene.txt.gz
tabix-0.2.3/bgzip knownGene.txt

knownGene was then indexed with tabix. In this file, chromosome is column 2, txStart is column 4 and txEnd is column 5
tabix-0.2.3/tabix -s 2 -b 4 -e 5 knownGene.txt.gz

Listing the chromosomes:

tabix-0.2.3/tabix knownGene.txt.gz -l
chr1
chr10
chr11
chr12
chr13
chr13_random
(...)
Dumping by region(s):
tabix-0.2.3/tabix knownGene.txt.gz chr2:200000-250000
uc002qvu.1 chr2 - 208154 239852 208154 208154 8 208154,214863,219965,221022,223100,224159,237537,239730, 209001,214920,220044,221191,223229,224272,237602,239852, uc002qvu.1
uc002qvv.1 chr2 - 208154 246690 208810 232800 12 208154,214863,219965,221022,223100,224159,232797,233502,237537,239730,243004,246206, 209001,214920,220044,221191,223229,224272,232871,233562,237602,239844,243115,246690, Q96HL8-3 uc002qvv.1
uc002qvw.1 chr2 - 208154 250702 208154 208154 11 208154,219965,220546,223100,224159,232797,233502,237537,239730,243004,250084, 209001,220044,221191,223229,224272,232871,233562,237602,239844,243115,250702, uc002qvw.1
uc002qvx.1 chr2 - 208154 254024 208810 253984 10 208154,214863,219965,221022,223100,224159,237537,239730,243004,253983, 209001,214920,220044,221191,223229,224272,237602,239844,243115,254024, Q96HL8 uc002qvx.1
uc002qvy.1 chr2 - 208154 254024 208810 253984 9 208154,219965,221022,223100,224159,237537,239730,243004,253983, 209001,220044,221191,223229,224272,237602,239844,243115,254024, Q96HL8-2 uc002qvy.1
uc002qvz.1 chr2 - 208154 254392 208154 208154 10 208154,214867,219965,221022,223100,224159,237537,239730,243004,254083, 209001,214920,220044,221191,223229,224272,237602,239844,243115,254392, uc002qvz.1
uc002qwa.1 chr2 - 208154 254743 208154 208154 12 208154,214863,219965,221022,223100,224159,237537,239730,243004,250084,252200,254702, 209001,214920,220044,221191,223229,224272,237602,239844,243115,251130,252786,254743, uc002qwa.1
uc010ewe.1 chr2 - 208154 254810 208810 232800 11 208154,219965,221022,223100,224159,232797,233502,237537,239730,243004,254781, 209001,220044,221191,223229,224272,232871,233562,237602,239844,243115,254810, Q96HL8-4 uc010ewe.1
uc002qwb.2 chr2 - 229562 232178 229562 229562 1 229562, 232178, uc002qwb.2
uc002qwc.1 chr2 - 233502 252786 233502 233502 6 233502,237537,239730,243004,250084,252630, 233562,237602,239844,243115,250702,252786, uc002qwc.1

Using the JAVA API

The java API for tabix is quite immature: there is only one java file, no package has been declared and some useful functions (chr2tid... )are private. The following java program uses this API to find the 'knownGenes' overlaping the SNPs in UCSC/snp130 :
import java.io.BufferedReader;
import java.io.InputStreamReader;

import java.net.URL;
import java.util.zip.GZIPInputStream;

public class Test {
public static void main(String[] args) {
try {
BufferedReader in=new BufferedReader(

new InputStreamReader(new GZIPInputStream(new URL(
"http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130.txt.gz").openStream())
));
TabixReader tabix=new TabixReader("knownGene.txt.gz");
String line;
while((line=in.readLine())!=null)
{
String tokensSnp[]=line.split("[\t]");
TabixReader.Iterator iter=tabix.query(tokensSnp[1]+":"+tokensSnp[2]+"-"+tokensSnp[3]);//UGLY !
String lineKg;
while (iter != null && (lineKg = iter.next()) != null)
{
String tokensKg[]=lineKg.split("[\t]");
System.out.println(tokensKg[0]+"\t"+tokensSnp[4]+"\t"+tokensSnp[1]+"\t"+tokensSnp[2]+"\t"+tokensSnp[3]);
}
}
} catch (Exception e)
{
e.printStackTrace();
}
}
}
Compilation:
javac -cp /path/to/sam/library/sam-1.36.jar TabixReader.java Test.java

Execution:
java -cp /path/to/sam/library/sam-1.36.jar:. Test

Result:
uc001aaa.2 rs2441671 chr1 1271 1272
uc009vip.1 rs2441671 chr1 1271 1272
uc001aaa.2 rs9803797 chr1 1271 1272
uc009vip.1 rs9803797 chr1 1271 1272
uc001aaa.2 rs2758124 chr1 1319 1320
uc009vip.1 rs2758124 chr1 1319 1320
uc001aaa.2 rs3950659 chr1 1332 1333
uc009vip.1 rs3950659 chr1 1332 1333
uc001aaa.2 rs3877545 chr1 1370 1371
uc009vip.1 rs3877545 chr1 1370 1371
(...)

... and for an unknown reason, the program crashed later...
(...)
uc009ybq.1 rs35852236 chr10 135341235 135341236
uc009ybq.1 rs36132849 chr10 135341235 135341236
uc009ybq.1 rs71238288 chr10 135341235 135341236
uc009ybq.1 rs71238289 chr10 135341255 135341256
uc009ybq.1 rs36155773 chr10 135341256 135341257
uc009ybq.1 rs71189011 chr10 135341267 135341268
uc009ybq.1 rs34231788 chr10 135341287 135341288
java.lang.ArrayIndexOutOfBoundsException: -1
at TabixReader.query(TabixReader.java:325)
at TabixReader.query(TabixReader.java:373)
at Test.main(Test.java:20)


That's it,

Pierre

18 February 2011

A Data Scraper for Amazonia (expression)


Today, we had a lecture about the "human induced pluripotent stem cells", presented by John De Vos. He introduced Amazonia, a free web atlas that allows an easy query of public human transcriptome data. Although there is no web service (REST/SOAP) to access this data, I was interested in getting some profiles of expression from this database as it is something I've failed to achieve with NCBI/GEO.

I wrote the following java scraper:

  • Line 84: we search for a gene name
  • 88: if there is a http redirection, the gene has been found
  • 96: the HTML page is downloaded
  • 100-112: fix the HTML to create a valid XML document
  • 133: transform the HTML page to a DOM document
  • 135-151: use XPATH to find the images and the labels
  • 189-211; put the data into a java/SWING Dialog

Compilation

javac AmazoniaRobot.java

Execution

java AmazoniaRobot EIF4G1
Et voilà:


That's it !

Pierre

16 February 2011

Generating a WIG file containing the coverage data of a BAM file

The following (experimental) C-code generates a 'Wiggle Track Format' (WIG) file containing the coverage/depth data extracted from a BAM file.

It uses the C API of samtools to loop over the mapped short reads of the BAM file, and for each short-read, it scans the CIGAR description and increase the number of hits seen on the chromosome at a given position.

It takes as an optional UCSC knownGene.txt file to restrict the output to the known translated regions.

The C code



Test


This Makefile generates a random pair of FASTQ files for the human chr22, maps them with BWA , creates a BAM and generates the WIG file:

Result

The resulting WIG file was uploaded as a custom track in the UCSC genome browser:


That's it,

Pierre

08 February 2011

Visualizing my twitter network with Zoom.it

I wrote a small Java tool to download my twitter network as a GEXF file. This tool is available on github at:


java -jar twittergraph.jar -o twittergraph.gexf 7431072 #my twitter ID


This tool doesn't use the OAuth API, so it have to wait for a few minutes, and retry to connect, every times it reaches the twitter API quotas (150 requests per hour). In the end it took one night to download the data from my network (~390 friends).

<gexf
xmlns="http://www.gexf.net/1.1draft"
xmlns:viz="http://www.gexf.net/1.1draft/viz"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.1"
xsi:schemaLocation="http://www.gexf.net/1.1draft http://www.gexf.net/1.1draft/gexf.xsd">

<meta lastmodifieddate="2011-02-04">
<creator>Gephi 0.7</creator>
<description/>
</meta>
<graph defaultedgetype="directed" timeformat="double" mode="dynamic">
<attributes class="node" mode="static">
<attribute id="name" title="name" type="string"/>
<attribute id="screenName" title="screenName" type="string"/>
<attribute id="imageUrl" title="imageUrl" type="string">
<default>http://a3.twimg.com/sticky/default_profile_images/default_profile_1_reasonably_small.png</default>
</attribute>
<attribute id="location" title="location" type="string"/>
<attribute id="description" title="description" type="string"/>
<attribute id="protectedProfile" title="protectedProfile" type="boolean"/>
<attribute id="friends" title="friends" type="integer"/>
<attribute id="followers" title="followers" type="integer"/>
<attribute id="listed" title="listed" type="integer"/>
<attribute id="utc_offset" title="utc offset" type="integer"/>
<attribute id="statuses_count" title="statuses count" type="integer"/>
</attributes>
<nodes>
<node id="6612402" label="sciencebase">
<attvalues>
<attvalue for="name" value="David Bradley"/>
<attvalue for="screenName" value="sciencebase"/>
<attvalue for="imageUrl" value="http://a3.twimg.com/profile_images/1142396198/twitter-blue-bradley_normal.jpg"/>
<attvalue for="location" value="Cambridge, UK"/>
<attvalue for="description" value="Science Writer David Bradley based in Cambridge, UK. Physical and life sciences news and views + technology, internet, web commentary."/>
<attvalue for="protectedProfile" value="false"/>
<attvalue for="friends" value="2022"/>
<attvalue for="followers" value="9197"/>
<attvalue for="listed" value="1065"/>
<attvalue for="utc_offset" value="0"/>
<attvalue for="statuses_count" value="7526"/>
</attvalues>
</node>
<node id="19344270" label="EMBOcomm">
<attvalues>
<attvalue for="name" value="Suzanne Beveridge"/>
<attvalue for="screenName" value="EMBOcomm"/>
<attvalue for="imageUrl" value="http://a0.twimg.com/profile_images/1189685782/S_Beveridge5100_normal.JPG"/>
<attvalue for="location" value="Heidelberg"/>
<attvalue for="description" value="Follow me for the latest from EMBO, the European Molecular Biology Organization"/>
<attvalue for="protectedProfile" value="false"/>
<attvalue for="friends" value="396"/>
<attvalue for="followers" value="697"/>
<attvalue for="listed" value="59"/>
<attvalue for="utc_offset" value="3600"/>
<attvalue for="statuses_count" value="632"/>
</attvalues>
</node>
<node id="20153702" label="walshtp">
<attvalues>
<attvalue for="name" value="Tom Walsh"/>
<attvalue for="screenName" value="walshtp"/>
<attvalue for="imageUrl" value="http://a3.twimg.com/profile_images/644287976/IMG_0815_normal.JPG"/>
<attvalue for="location" value="Dundee, Scotland"/>
<attvalue for="description" value="Scientific programmer and sysadmin. "/>
<attvalue for="protectedProfile" value="false"/>
<attvalue for="friends" value="129"/>
<attvalue for="followers" value="99"/>
<attvalue for="listed" value="8"/>
<attvalue for="utc_offset" value="0"/>
<attvalue for="statuses_count" value="783"/>
</attvalues>
</node>
<node id="15150655" label="konradfoerstner">
<attvalues>
<attvalue for="name" value="Konrad Förstner"/>
<attvalue for="screenName" value="konradfoerstner"/>
<attvalue for="imageUrl" value="http://a3.twimg.com/profile_images/643611092/konrad_avantar2_normal.jpeg"/>
<attvalue for="location" value="here and there"/>
<attvalue for="description" value="Idealist, Scientist, Includist, Data analyst, Open Source|Data|Access, Coder, Command line friend, CouchSurfer, Konrad"/>
<attvalue for="protectedProfile" value="false"/>
<attvalue for="friends" value="266"/>
<attvalue for="followers" value="167"/>
<attvalue for="listed" value="17"/>
<attvalue for="utc_offset" value="3600"/>
<attvalue for="statuses_count" value="1948"/>
</attvalues>
</node>

(...)

<edge id="E3811" source="14899756" target="14295341">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4816" source="14899756" target="19542750">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4830" source="14899756" target="60065276">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E339" source="14899756" target="617133">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4807" source="14899756" target="15276911">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4822" source="14899756" target="26506721">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4824" source="14899756" target="27023131">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4819" source="14899756" target="22406785">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4808" source="14899756" target="16170580">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E1237" source="14899756" target="4339911">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4828" source="14899756" target="56564230">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4826" source="14899756" target="33838201">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4815" source="14899756" target="19002481">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
</edges>
</graph>
</gexf>



The GEXF file was then opened with Gephi, processed with the ForceAtlas algorithm and exported as a PDF file.

The PDF file was uploaded on scribd: http://www.scribd.com/doc/48415306/My-Twitter-Network


I then, downloaded the PDF from scribd.com, quickly copied the URL of the generated PDF and pasted it into http://zoom.it/.

Here is the result ! :-)



That's it !

Pierre

03 February 2011

Using the #Gephi toolkit to draw a graph from PSI-MI data, my notebook

GEPHI, interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs. Recently, a java library , the Gephi toolkit has been released and was used by LinkedIn for generating its inMaps.

I've been playing with the Toolkit, to generate a graph from a PSI-MI file downloaded from EMBL-Strings. The source code is available on github:



Shortly: the program reads the PSIMI-XML, uses XPATH to find the nodes and the edges, creates the graph (I could also have used XSLT too, to create an internal GEXF (the native file format for Gephi) file from the PIS-MI file), applies a layout algorithm (YifanHuLayout) and outputs the result (SVG, PDF, GEXF... ). Nevertheless it was not clear how I could change the output to insert a hyperlink, to change the background color, etc... The online javadoc was missing many informations.

Compilation

javac -cp /path/to/gephi-toolkit.jar -sourcepath src -d src src/sandbox/PsimiWithGephi.java

Generating a PDF for HOPX

I've downloaded the psi-mi.xml for HOPX from the embl-strings database and run the program.
java -cp /path/to/gephi-toolkit.jar:src sandbox.PsimiWithGephi -o ~/result.pdf file.xml

Result


Viewing the result with Flash

The API can export a GEXF file too and it can be visualized using a flash application named GEXF Explorer:

Note: "Sigma" is another viewer available for gexf: http://ofnodesandedges.com/sigma-neighborhoods-exploration/

That's it,

Pierre

Testing if a BAM file is sorted using the samtools C API.

"Is my BAM file sorted ? is there any tool (in samtools ?) to quickly know whether a BAM file has been sorted or not ? should I run `samtools sort` ?": this is the question I asked yesterday on BioStar.
Brad Chapman suggested to use 'samtools index' as an unsorted BAM would return an error.
On my side I wrote the following 'C' program using the Samtools API:

This program opens one or more BAM file , loop over all the reads and test they are sorted on chromosome/position. The program returns 'EXIT_SUCCESS (0)' if all the files were sorted, or EXIT_FAILURE.

Compilation


FLAG64= -m64
gcc ${FLAG64} -o bamsorted -O3 -I ${SAMDIR} -L ${SAMDIR} bamsorted.c -lbam -lz

Test


Using samtools index

> for F in `find . -name "*.bam"`; do echo $F; samtools index ${F} ; done
GALAXY/galaxy-dist/test-data/sam_merge_out2.bam
GALAXY/galaxy-dist/test-data/sam_merge_in1.bam
GALAXY/galaxy-dist/test-data/srma_out1.bam
GALAXY/galaxy-dist/test-data/srma_out2.bam
GALAXY/galaxy-dist/test-data/sam_merge_in3.bam
GALAXY/galaxy-dist/test-data/sam_merge_in2.bam
GALAXY/galaxy-dist/test-data/1.bam
GALAXY/galaxy-dist/test-data/srma_in1.bam
GALAXY/galaxy-dist/test-data/sam_to_bam_out1.bam
GALAXY/galaxy-dist/test-data/3.bam
[bam_index_core] the alignment is not sorted (HWI-EAS91_1_30788AAXX:1:1:799:192): 14955 > 8420 in 46-th chr
GALAXY/galaxy-dist/test-data/srma_in3.bam
GALAXY/galaxy-dist/test-data/sam_pileup_in1.bam
GALAXY/galaxy-dist/test-data/sam_to_bam_out2.bam
GALAXY/galaxy-dist/lib/galaxy/datatypes/test/1.bam
GALAXY/galaxy-dist/lib/galaxy/datatypes/test/3.bam
[bam_index_core] the alignment is not sorted (HWI-EAS91_1_30788AAXX:1:1:799:192): 14955 > 8420 in 46-th chr

Using my C code

> for F in `find . -name "*.bam"`; do ./bamsorted ${F} ; done
GALAXY/galaxy-dist/test-data/sam_merge_out2.bam OK
GALAXY/galaxy-dist/test-data/sam_merge_in1.bam OK
GALAXY/galaxy-dist/test-data/srma_out1.bam OK
GALAXY/galaxy-dist/test-data/srma_out2.bam OK
GALAXY/galaxy-dist/test-data/sam_merge_in3.bam OK
GALAXY/galaxy-dist/test-data/sam_merge_in2.bam OK
GALAXY/galaxy-dist/test-data/1.bam OK
GALAXY/galaxy-dist/test-data/srma_in1.bam OK
GALAXY/galaxy-dist/test-data/sam_to_bam_out1.bam OK
GALAXY/galaxy-dist/test-data/3.bam Unsorted: On Reference[45]="chrM" position=8420 after 14955.
GALAXY/galaxy-dist/test-data/srma_in3.bam OK
GALAXY/galaxy-dist/test-data/sam_pileup_in1.bam OK
GALAXY/galaxy-dist/test-data/sam_to_bam_out2.bam OK
GALAXY/galaxy-dist/lib/galaxy/datatypes/test/1.bam OK
GALAXY/galaxy-dist/lib/galaxy/datatypes/test/3.bam Unsorted: On Reference[45]="chrM" position=8420 after 14955.

That's it,

Pierre