30 November 2009

Playing with Erlang (II)

transcripting a DNA to a RNA sequence with an anonymous function


lists:map(fun($T)->$U;(OTHER)->OTHER end,"AATAGCTGATCGACAATGTTAGCTAGGC").
>"AAUAGCUGAUCGACAAUGUUAGCUAGGC"

Testing if a symbol is an acid nucleic

> ACIDNUCLEICS="ATGCatgc".
"ATGCatgc"
> lists:member($A,ACIDNUCLEICS).
true
> lists:member($P,ACIDNUCLEICS).
false

Filtering a sequence with lists:filter

> IsAcidNucleic=fun(BASE)->lists:member(BASE,ACIDNUCLEICS) end.
#Fun<erl_eval.6.13229925>
> lists:filter( IsAcidNucleic, "1 ACGT ACGT ACGT ACGT ACGT ACGT 36").
"ACGTACGTACGTACGTACGTACGT"

Generating an array of numbers

> lists:seq(1,10).
[1,2,3,4,5,6,7,8,9,10]
> lists:seq(65,65+24).
"ABCDEFGHIJKLMNOPQRSTUVWXY"
The following notation says: add 32 to each item 'X' of the sequence 65 to 89.
> [X+32 || X<-lists:seq(65,65+24)].
"abcdefghijklmnopqrstuvwxy"
Create a list of structures {number,'X'} for each item 'X' of the sequence 1 to 5.
> [{number,X}|| X<-lists:seq(1,5)].
[{number,1},{number,2},{number,3},{number,4},{number,5}]
Create a list of structures {number,'X'} for each item 'X' of the sequence 1 to 100 where the modulus of X/15 is '0'.
> [{number,X}|| X<-lists:seq(1,100), X rem 15 =:= 0].
[{number,15},
{number,30},
{number,45},
{number,60},
{number,75},
{number,90}]
Create a list of all the pairs(x,y) for x and y between 1 and 4:
> [{pair,X,Y}|| X<-lists:seq(1,4),Y<-lists:seq(1,4)].
[{pair,1,1},
{pair,1,2},
{pair,1,3},
{pair,1,4},
{pair,2,1},
{pair,2,2},
{pair,2,3},
{pair,2,4},
{pair,3,1},
{pair,3,2},
{pair,3,3},
{pair,3,4},
{pair,4,1},
{pair,4,2},
{pair,4,3},
{pair,4,4}]
Create a list of all the pairs(x,y) for x and y between 1 and 4 having X==Y:
> [{pair,X,Y}|| X<-lists:seq(1,4),Y<-lists:seq(1,4),X=:=Y].
[{pair,1,1},{pair,2,2},{pair,3,3},{pair,4,4}]
A few is_XXXX functions
> is_integer("AAA").
false
> is_integer($A).
true
> is_list(1).
false
> is_list("AZAZAZ").
true
> is_atom(1).
false
> is_atom(hello).
true
> is_tuple(1).
false
> is_tuple({person,{name,"Pierre"}}).
true
A few functions:
% get the header of a list
> hd([100,99,98,97]).
100
% get the 3rd element of a list
> lists:nth(3,["EcoRI","HindIII","BamHI","PstI"]).
"BamHI"
>size({test,{name,"Hello"},{age,3}}).
3
> element(2,{test,{name,"Hello"},{age,3}}).
{name,"Hello"}
> abs(-99).
99
Translating a DNA to a protein.
The file bio.erl:
-module(bio).
-export([translate/1]).

%The standard genetic code table
geneticCode()->"FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG".

%convert a base to an index in the table
base2index($T)-> 0;
base2index($C)-> 1;
base2index($A)-> 2;
base2index($G)-> 3.

%convert a codon to an index, lookup the table and return the amino acid
codon2AminoAcid(A1,A2,A3) -> lists:nth(
(base2index(A1)*16 + base2index(A2)*4 + base2index(A3))+1,
geneticCode())
.

%return an empty array if the argument is an empty sequence or a sequence with less than 3 bases
translate([])->[];
translate([_])->[];
translate([_,_])->[];
%translate the 3 first bases and append the remaining translation.
translate([A1,A2,A3|REMAIN])->[codon2AminoAcid(A1,A2,A3)|translate(REMAIN)].

Invoking bio:translate:
c(bio).

bio:translate("ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAG").
"MERQKRKADIEK"

Records


A record file defines the structure of an object. Example: the following file snp.hrl defines the 'class' snp with its default values:
-record(snp,{
name="rs",
chrom="?",
chromStart=0,
chromEnd=0,
avHet= -1
}).
We can now use this class in an erlang shell:
%load the structure of a SNP
rr("snp.hrl").
[snp]
%create an empty instance of a snp
EMPTY_SNP= #snp{}.
#snp{name = "rs",chrom = "?",chromStart = 0,chromEnd = 0,
avHet = -1}

%create and fill an new instance of a snp
RS84=#snp{name="rs84",chrom="chr7",chromStart=25669317,chromEnd=25669318}.
#snp{name = "rs84",chrom = "chr7",chromStart = 25669317,
chromEnd = 25669318,avHet = -1}

%re-use the content of RS84 and fill the value of 'avHet'
RS84_2=RS84#snp{avHet=0.475045}.
#snp{name = "rs84",chrom = "chr7",chromStart = 25669317,
chromEnd = 25669318,avHet = 0.475045}

%extract the name and the avHet for RS84_2
#snp{name=NAME_RS84,avHet=AVHET_RS84}=RS84_2.
NAME_RS84.
"rs84"
AVHET_RS84.
0.475045


That's it
Pierre

26 November 2009

Playing with Erlang (I)

I'm currently reading Joe Armstrong's "Programming Erlang". Here are a couple of notes about ERLANG.

Starting and stopping the Erlang shell


:~> erl
Erlang R13B01 (erts-5.7.2) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.2 (abort with ^G)
1> halt().
:~>

Simple Math

Input:
2*(3+4).
PI=3.14159.
R=2.
SURFACE=PI*R*R.
R=3.
Output:
1> 14
2> 3.14159
3> 2
4> 12.56636
##Variables in erlang are immutable R3 cannot be changed
5> ** exception error: no match of right hand side value 3

Two structure defining a SNP and a Gene are created with a Tuple:
RS94={snp,
{chrom,chr6},
{chromStart,98339675},
{chromEnd,98339676},
{name,"rs94"},
{strand,plus},
{func,unknown},
{avHet,0}
}.
RS47={snp,
{chrom,chr7},
{chromStart,11547645},
{chromEnd,11547646},
{name,"rs47"},
{strand,minus},
{func,missence},
{avHet,0.246}
}.
NP_056019={gene,
{chrom,chr7},
{txStart,11380695},
{txEnd,11838349},
{cdsStart,11381945},
{csdEnd,11838097},
{name,"NP_056019"},
{strand,minus},
{exonCount,27}
}.

Values are extracted using '_' as a placeholder for the unwanted variables.
{_,{_,_},{_,_},{_,_},{_,NAME_OF_RS94},{_,_},{_,_},{_,_}}=RS94.
NAME_OF_RS94.
"rs94"

Create a list of SNP:
LIST_OF_SNP1=[RS94,RS47].
Add P_056019 to this list and create a list of genomic objects:
LIST2=[NP_056019|LIST_OF_SNP1].
Extract the first and second element of LIST2, put the remaining list in LIST3:

[ITEM1,ITEM2|LIST3]=LIST2.
ITEM1.
{gene,{chrom,chr7},
{txStart,11380695},
{txEnd,11838349},
{cdsStart,11381945},
{csdEnd,11838097},
{name,"NP_056019"},
{strand,minus},
{exonCount,27}}

ITEM2.
{snp,{chrom,chr6},
{chromStart,98339675},
{chromEnd,98339676},
{name,"rs94"},
{strand,plus},
{func,unknown},
{avHet,0}}

LIST3.
[{snp,{chrom,chr7},
{chromStart,11547645},
{chromEnd,11547646},
{name,"rs47"},
{strand,minus},
{func,missence},
{avHet,0.246}}]

A String is just an Array of integer:
ALPHABET=[65,66,67,68,69,70].
"ABCDEF"

The dollar( $ ) notation is used to get the 'int' value of a 'char'.

HELLO=[$H,$e,$l,$l,$o].
"Hello"

Methods & Functions


A file named "bio.erl" is created. This file is a erlang module that contains a kind of polymorphic function distance returning the length of a given object (its number of arguments is '/1'). If this object is identified as a atom=snp the value "chromEnd-chromStart" is returned. Else, if atom=gene the value "txEnd-txStart" is returned.
-module(bio).
-export([distance/1]).
distance({snp,
{chrom,_},
{chromStart,START},
{chromEnd,END},
{name,_},
{strand,_},
{func,_},
{avHet,_}}
)-> END - START;
distance({gene,
{chrom,_},
{txStart,START},
{txEnd,END},
{cdsStart,_},
{csdEnd,_},
{name,_},
{strand,_},
{exonCount,_}
})-> END - START.

We now use this module:
c(bio).
RS94={snp,
{chrom,chr6},
{chromStart,98339675},
{chromEnd,98339676},
{name,"rs94"},
{strand,plus},
{func,unknown},
{avHet,0}
}.


RS47={snp,
{chrom,chr7},
{chromStart,11547645},
{chromEnd,11547646},
{name,"rs47"},
{strand,minus},
{func,missence},
{avHet,0.246}
}.

NP_056019={gene,
{chrom,chr7},
{txStart,11380695},
{txEnd,11838349},
{cdsStart,11381945},
{csdEnd,11838097},
{name,"NP_056019"},
{strand,minus},
{exonCount,27}
}.

bio:distance(RS94).
1
bio:distance(RS47).
1
bio:distance(NP_056019).
457654

We now want to calculate the GC percent of a DNA: the bio.erl file is modified as follow:
-module(bio).
-export([gcPercent/1]).
-export([distance/1]).
(...)

gc($A) -> 0.0;
gc($T) -> 0.0;
gc($C) -> 1.0;
gc($G) -> 1.0;
gc([])->0;
gc([BASE|REMAIN])->gc(BASE)+gc(REMAIN).

gcPercent(ADN)->100.0*(gc(ADN)/erlang:length(ADN)).

Here the method gc returns '1' or '0' if the argument is a base; returns 0 if the array is empty, or return the sum of the gc(first character of the string) plus the gc(remaining string). The method gcPercent divide the sum of gc by the length of the string and multiply it by 100.
c(bio).
bio:gcPercent("GCATG").
60.0



That's it.
Pierre

A Java implementation of Jan Aerts' LocusTree

This post is a description of my implementation of Jan Aerts' LocusTree algorithm (I want to thank Jan, our discussion and his comments were as great source of inspiration) based on BerkeleyDB-JE, a Key/Value datastore. This implementation has been used to build a genome browser displaying its data with the SVG format. In brief: splicing each chromosome using a dichotomic approach allows to quickly find all the features in a given genomic region for a given resolution. A count of the total number of objects in the descent of each child node is used to produce a histogram of the number of objects smaller than the given resolution.
Your browser does not support the <CANVAS> element !

JSON/Metadata

All the information is stored in BerkeleyDB and I've used JSON to add some metadata about each object. The JSON is serialized, gzipped and stored in BerkeleyDB.
Your browser does not support the <CANVAS> element !

Organism

Each organism is defined by an ID and a Name. The Key of the BerkeleyDB is the organism.id.
Your browser does not support the <CANVAS> element !

The organisms are loaded in the database using a simple XML file:
<organisms>
<organism id="36">
<name>hg18</name>
<description>Human Genome Build v.36</description>
<metadata>{"taxon-id":9606}</metadata>
</organism>
</organisms>


Chromosome

Each chromosome is defined by an ID, a Name, its length and its organism-ID. The Key in berkeleyDB is the chromosome ID.
Your browser does not support the <CANVAS> element !

The chromosomes are loaded in the database using a simple XML file:
<chromosomes organism-id="36">
<chromosome id="1">
<name>chr1</name>
<metadata>{"size":247249719,"type":"autosomal"}</metadata>
</chromosome>
<chromosome id="10">
<name>chr10</name>
<metadata>{"size":135374737,"type":"autosomal"}</metadata>
</chromosome>
(...)
</chromosomes>



Track

Each track is defined by an ID and a Name. The Key in berkeleyDB is the track ID.
Your browser does not support the <CANVAS> element !

The descriptions of the tracks are loaded in the database using a simple XML file:
<tracks>
<track id="1">
<name>cytobands</name>
<description>UCSC cytobands</description>
</track>
<track id="2">
<name>knownGene</name>
<description>UCSC knownGene</description>
</track>
<track id="3">
<name>snp130</name>
<description>dbSNP v.130</description>
</track>
<track id="4">
<name>snp130Coding</name>
<description>UCSC coding Snp</description>
</track>
<track id="5">
<name>all_mrna</name>
<description>UCSC All mRNA</description>
</track>
</tracks>

Nodes


Each LocusTree Node (LTNode) is linked to a Chromosome and a Track using a database named 'TrackChrom'. Here the Key of the BerkeleyDB is a composite key (chromosome/track).
Your browser does not support the <CANVAS> element !

The structure of a LTNode is described below. Each node contains a link to its parent, the links to its children as well as a set of genomic entities whose length is greater or equals that 'this.length'.
Your browser does not support the <CANVAS> element !


To load the content of each LocusTree, I've defined a simple java interface called LTStreamLoader which looks like this:
public interface LTLoader
{
public MappedObject getMappedObject();
public String getChromosome();
public Set<String> getKeywords();
}
public interface LTStreamLoader
extends LTLoader
{
public void open(String uri) throws IOException;
public void close() throws IOException;
public boolean next() throws IOException;
}
An instance of this interface is used to load the content of a tab delimited file as defined in the following XML file:
<loaders>
<load organism-id="36" track-id="5" class-loader="fr.cephb.locustree.loaders.UCSCAllMrnaLoader" limit="10000">
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/all_mrna.txt.gz
</load>
<load organism-id="36" track-id="4" class-loader="fr.cephb.locustree.loaders.UCSCSnpCodingLoader" limit="10000">
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130CodingDbSnp.txt.gz
</load>
<load organism-id="36" track-id="1" class-loader="fr.cephb.locustree.loaders.UCSCCytoBandsLoader">
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz
</load>
<load organism-id="36" track-id="2" class-loader="fr.cephb.locustree.loaders.UCSCKnownGeneLoader">
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownGene.txt.gz
</load>
<load organism-id="36" track-id="3" class-loader="fr.cephb.locustree.loaders.UCSCSnpLoader" limit="10000">
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130.txt.gz
</load>
</loaders>
It took about 3H00 to load 'snp130.txt.gz' and the size of the indexed BerkeleyDB/LocusTree database was 16Go (ouch!).

Building the Genome Browser

The locus tree database was used to create (yet another) Genome Browser. My current implementation runs smoothly under Apache Tomcat. The SVG vectorial format was used to draw and hyperlink the data. Here is a screenshot of the first version I wrote one week ago. As you can see, the objects that were too small to be drawn, were displayed within a histogram.
Later, I've added some labels.
And my latest version uses the JSON metadata available in each objet to display the spliced structure of the genes:
The browser is fast (sorry, I cannot show it at this time) but I need to play with the config of BerkeleyDB to speed up the insertions and reduce the size of the database.

That's it.
Pierre

NB: The figures of this post were created using SVGToCanvas.

22 November 2009

A tool converting SVG to Canvas

This blogging platform, Blogger.com, does not display its content in XHTML: in consequence, it is not possible to embed a SVG picture in the body of a post. Solution: I've created a tool called SVGToCanvas converting a SVG document to a Canvas script. The software is available here:

. For example, the picture below was a SVG file converted to canvas (yes, have a look at the html source).

That's it.
Pierre

Inspired from "Retour du Bal", Alfred Roll (1846-1919). Musee des Beaux-Arts de Nantes.