YOKOFAKUN: February 2009

24 February 2009

1000 human genomes ! How !#@$* do you manage 1000 genomes ??!

The 1000 genomes project (http://www.1000genomes.org/page.php) has started to release its data. The 1000 Genomes Project is an international research consortium formed to create the most detailed and medically useful picture to date of human genetic variation. The project involves sequencing the genomes of approximately 1200 people from around the world

The data are already available on the ftp site. TONS of short reads are here. For example here are the first lines of a FASTQ file ftp://ftp.1000genomes.ebi.ac.uk/data/NA12878/sequence_read/SRR001113.recal.fastq.gz.

@BI:080213_SL-XAH_0001_FC2037KAAXX:7:1:123:322
TTTCTAAACAGAAAATATTATGAAAGCTGCCATTTTTAATTCATTTT
+
AACCDCCDEDEDDFFDDDDDDCGDFEEC=>@'B@BBA-7<'###+1'
@BI:080213_SL-XAH_0001_FC2037KAAXX:7:1:114:773
TACTCAGTGTAGTGAATTCACGTCAGAGTGAGAGCTATACTTTTTAT
+
AACCDCCDEDEDDFFDDDD>DCGDDDDCEDCAB@4.A??<231%#5#
@BI:080213_SL-XAH_0001_FC2037KAAXX:7:1:121:413
GGAACTTTACAAGCTGATTTTCAAATTTATGTAGATATACCCCCTTC
+
AACCDCCDEDDDDFFDDDDDDC1>@DDC8DC<#%%#*'#
@BI:080213_SL-XAH_0001_FC2037KAAXX:7:1:115:199
TAAAAAAAAAATATTAAAGCAAGTTAGAATCTTCAACTATCTTAATT
+
AACCDCCDEDEDDFFDDDDDDCGDDDDCEDCAB@BBA??<*31'%#1
@BI:080213_SL-XAH_0001_FC2037KAAXX:7:1:121:735
CTCTATTTCCAATAGACATAAATATATTGAATATTTCCCATTACCCT
+
AACCDCCDEDE>DFFDDDDDDCGDDDDC@DCAB@BBA??<2))(1##
@BI:080213_SL-XAH_0001_FC2037KAAXX:7:1:117:705
TGCAGTGCTAATTTTATTTTTTTTAATAATTTATTTAATTTCTCCTT
+
AACCDCCDEDEDDFFDDDDDDCGDDDDCEDCAB@BBA??<2&1#(11
@BI:080213_SL-XAH_0001_FC2037KAAXX:7:1:122:436
TGATGGGCAGGAAGGAGATAGCCTTCTTCACTGTTCTCTGTTTTTCT
+
AACC=CC@ED1&:.6-8>6?D:)FD+)C%46$=@E',?,<&&%&(##
@BI:080213_SL-XAH_0001_FC2037KAAXX:7:1:119:679
TAATATCTTGTCAGTGGCAGTCGCGGAATTGTCTCTGTGGTTTTTTC
(...)

size of the file: 1.325.489.979 octets
number of lines: 36.193.608
number of short reads: 9.048.402
number of distinct short reads without any N : 8.591.003
number of short reads without any ‘N’: 6.301.673
number of distinct short reads without any ‘N’: 6.260.140

How many fastq files are there ? 8215 fastq files.
(Just curious: how do you backup this amount of data ?)

And, now, here HE comes: the USER.
The USER and his !#@$ questions.
AND HE WANTS SOME ANSWERS.

For an given rs number , where is it located on those 1000 genomes ?

Here is a new SNP in my favorite gene. Is it a true SNP? I want to see the short reads and their qualities aligned for this SNP

For a given SNP, what is its location on the reference sequence ? on Watson's sequence ? on celera assembly ?

I want to see all the snps in a given region for a given individual and his parents

Is this mutation (e.g. substitution) on this individual is the same on another individual (e.g. insertion) ?

How do you store and manage this amount of information ? In a classical RDBM ?! My colleague Mario Foglio is currently looking for another alternative(s).

BIG DATA !

Pierre

21 February 2009

Genes for Translation mapped on a Hilbert Curve

In this post I describe how I mapped a list of genes involved in the Translational process on a Hilbert Curve .
This post was mostly inspired by Gareth Palidwor's recent post titled "Mapping genomes to a Hilbert Curve" see http://www.palidwor.com/blog/?p=123 (and yes, Paulo is right: , a blog can be a source of inspiration)

Wikipedia: A Hilbert curve is a continuous fractal space-filling curve first described by the German mathematician David Hilbert in 1891.

I've played with this idea and written a simple program mapping a list of genes on a Curve.

Finding the genes

First we need to get the name of genes involved in the Translational process. I've used the anonymous mysql GO server at EBI. I'm not a GO guru, so just tell me if the query below was not correct.
The position of those genes was then found using the anonymous mysql server of the UCSC. This mapping information is roughly available in the hg18.refGene table. I've restricted the search to the chromosome chr1.
At the end, my shell looks like this:

mysql -N -hmysql.ebi.ac.uk -ugo_select -pamigo -P4085 go_latest -A -e "SELECT DISTINCT
        gene_product.symbol
FROM
        term
INNER JOIN graph_path
        ON (term.id=graph_path.term1_id)
INNER JOIN association
        ON (graph_path.term2_id=association.term_id)
INNER JOIN gene_product
        ON (association.gene_product_id=gene_product.id)
INNER JOIN species
        ON (gene_product.species_id=species.id)
WHERE
        term.name = 'translation'  AND
        species.ncbi_taxa_id = 9606" |\
gawk 'BEGIN {printf("select distinct name2,txStart,txEnd from refGene where chrom=\"chr1\" and name2 in (\"_\" ");}
        {
        printf(",\"%s\"",$1);
        }
END     {
        printf(") order by txStart;\n");
        }' |\
mysql -h genome-mysql.cse.ucsc.edu -D hg18 -u genome -A

Drawing the map

Then I mapped this list of gene on a Hilbert Curve drawn as SVG. I won't show the code here, it was mostly copied from the code provided on the wikipedia-page. The only major modification was the one used to convert a position on the genome to the curve. For those interested, I put the code here:http://code.google.com/p/lindenb/source/browse/trunk/proj/tinytools/src/org/lindenb/tinytools/HilbertSequence.java

Result

The curve is the human chromosome chr1.
The red lines are the genes.
Moving the mouse over a red line shows a balloon with the name of the gene

That's it.

'

19 February 2009

I get email

Dear Dr. Pierre Lindenbaum,

We are great fun of your blog and inspired several times by your ideas!

We, the Database Center for Life Science (DBCLS) in Japan, had been working on the standardization and integration of bioinformatics resources, and organized a BioHackathon meeting last year in Japan.

Recently, we decided to organize the 2nd hackathon that is more focused on end-user clients and mash-up of the existing services (to fully utilize the effort made by previous hackathon).

For this purpose, we think your insight is invaluable for fruitful discussions, therefore, I'd like to invite you to the BioHackathon 2009, which will be held in Tokyo and Okinawa (Southern-most island in Japan) from March 15th to 21th, 2009.

We will cover all the expenses for your flights and accommodations.

Please let me know if your schedule is available for the week.

Thanks in advance.

Wooooooooooooooooooooo !!! :-)

You know what ?

but... ;-)

12 February 2009

Au revoir Albert Barillé

Au revoir Albert Barillé (1921-2009), et merci pour tout.

Darwin et l'évolution partie 1
envoyé par dimi1000

11 February 2009

My tribute to Charles. Darwin Day 12 Feb 2009.

http://www.darwinday.org/

Ah, shhh...., there is a type. It's Hari, not Hary... Saperlipopette !!!

10 February 2009

A standalone XUL application translating a DNA. My notebook

In this post, I will present how I wrote a standalone XUL application: it's a simple GUI translating a DNA sequence to a proteic sequence. (Wikipedia:) XUL stands for "XML User Interface Language". It's a language developed by the Mozilla project which operates in Mozilla cross-platform applications, it relies on multiple existing web standards and technologies, including CSS, JavaScript, and DOM. Such reliance makes XUL relatively easy to learn for people with a background in web-programming and design.
A XUL standalone application has not the barreers of security of the web browsers: It can open/save/write/read a file, create a database (via sqlite/mozstorage), etc...

This post was mostly written with the help of the xulrunner tutorial (https://developer.mozilla.org/en/Getting_started_with_XULRunner).

The complete source code of this application is available at: http://code.google.com/p/lindenb/source/browse/#svn/trunk/proj/tinyxul/translate.

File hierarchy

tinyxul/
  +translate/
    application.ini
    +chrome/
       chrome.manifest
       +translate
          translate.xul
    +defaults
          +preferences/
             prefs.js

application.ini: provides metadata that allows XULRunner to launch the application properly.
prefs.js: the preferences file
translate.xul: the xul file containing the layout of the application and, in this case, the javascript methods

translate/application.ini

This file provides metadata that allows XULRunner to launch the application properly.

[App]
Vendor=lindenb
Name=translate
Version=0.1
BuildID=20090209
ID=translation@pierre.lindenbaum.fr
[Gecko]
MinVersion=1.8

translate/defaults/preferences/prefs.js

The preference file contains among other things, the URI of the main xul window to be opened when the application starts

pref("toolkit.defaultChromeURI", "chrome://translate/content/translate.xul");

translate/chrome/translate/translate.xul

The window is defined with the XML/XUL layout. At the end, it will look like this:

The application is a <window>, it contains a <menubar>, two <textbox> (for the DNA and the proteic sequences) and a <menulist> where the user select the Genetic Code (standard, mitochondrial...)

<?xml version="1.0"?>
<?xml-stylesheet href="chrome://global/skin/" type="text/css"?>
<window id="main" title="Translate" width="800" height="600"
        xmlns:html="http://www.w3.org/1999/xhtml"
        xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul"
        onload="init()"
        >
        <script>(...)
        </script>
        <toolbox flex="1">
                <menubar id="sample-menubar">
                        <menu label="File">
                                <menupopup id="file-popup">
                                        <menuitem label="New" oncommand="doNewWindow();"/>
                                        <menuitem label="Save As..." oncommand="doSaveAs();"/>
                                        <menuseparator/>
                                        <menuitem label="Exit" oncommand="window.close();"/>
                                </menupopup>
                        </menu>
                </menubar>
        </toolbox>
        <vbox flex="13">
                <hbox><label control="dnaseq" value="Your DNA Sequence" label="Your DNA Sequence" /><label id="dnalength"  flex="1" value="0 pb"/></hbox>
                <textbox flex="6" id="dnaseq"  rows="5" multiline="true" onchange="doTranslate()" oninput="doTranslate()"/>

                <hbox>
                        <label flex="1" control="protseq" value="The Translated Sequence"/>
                        <label id="protlength" flex="1" value="0 AA"/>
                        <menulist flex="1" oncommand="currentGeneticCode=GeneticCode[selectedIndex];doTranslate();">
                            <menupopup id="gcpopup">
                            </menupopup>
                        </menulist>
                </hbox>
                <textbox  flex="6" id="protseq" rows="5"  multiline="true" readOnly="true"/>

        </vbox>
</window>

An array of genetic codes is stored as a javascript array (BTW, thanks to Brad Chapman and PJ Davis for their suggestions):

/* via ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt */
 var GeneticCode=[
{
name: "Standard" ,
ncbieaa :  "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG"
},
{
name: "Vertebrate Mitochondrial" ,
ncbieaa :  "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG"
},
{
name: "Yeast Mitochondrial" ,
ncbieaa :  "FFLLSSSSYY**CCWWTTTTPPPPHHQQRRRRIIMMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
},
{
name: "Bacterial and Plant Plastid" ,
ncbieaa :  "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
}
];

Each time the DNA sequence is modified, the method doTranslate is called.

function doTranslate()
        {
        var ncbieaa= currentGeneticCode.ncbieaa ;
        var dna=document.getElementById('dnaseq').value;
        var prot="";
        var i=0;
        var aa="";
        var dnalength=0;
        var protlength=0;
        while(i&lt; dna.length)
                {
                var c= dna.charAt(i++);
                if("\n\t \r".indexOf(c)!=-1)
                        {
                        continue;
                        }
                dnalength++;
                aa+=c;
                if(aa.length==3)
                        {
                        prot+=translation(ncbieaa,aa);
                        protlength++;
                        if(protlength % 50==0) { prot+="\n";}
                        aa="";
                        }
                }
        document.getElementById('protseq').value = prot;
        document.getElementById('protlength').value = protlength+" AA";
        document.getElementById('dnalength').value = dnalength+" pb";
        }

And because, this is a standalone application, the user can SAVE the sequence of the protein.

function doSaveAs()
        {
        try {
        const nsIFilePicker = Components.interfaces.nsIFilePicker;

        var fp = Components.classes["@mozilla.org/filepicker;1"]
                   .createInstance(nsIFilePicker);
        fp.init(window, "Save As...", nsIFilePicker.modeSave);


        var rv = fp.show();
        if (!(rv == nsIFilePicker.returnOK || rv == nsIFilePicker.returnReplace) ) return;
        var file = fp.file;


        var foStream = Components.classes["@mozilla.org/network/file-output-stream;1"].
                                 createInstance(Components.interfaces.nsIFileOutputStream);


        foStream.init(file, 0x02 | 0x08 | 0x20, 0666, 0);

        var data=document.getElementById('protseq').value;
        foStream.write(data, data.length);
        foStream.close();


        } catch(err){ alert(err.message);}
        }

Testing the application

To launch the application, call xulrunner

xulrunner translate/application.ini

Or firefox with the app option

firefox -app translate/application.ini

I guess there should have a way to package this application in a jar/zip, but I still haven't found a way to to this.

That's it !
Pierre

YOKOFAKUN