29 May 2010

From mRNA to GO

In this post , I've just copied the solutions I gave on http://biostar.stackexchange.com/ for the following problem:"How do I do simple GO term lookup given a gene (or mRNA) identifier?".
The best answer was given by Neil who suggested to use the BiomartR package.

On my side, I submitted two solutions:

1st solution: use a recursive XSLT stylesheet

The following XSLT stylesheet gets the GeneID from the mRNA, download the XML document describing this gene and extracts the identifiers for Gene Ontology :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xsl:stylesheet version="1.0" xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

<xsl:output method="text" encoding="ISO-8859-1"/>

<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="Dbtag[Dbtag_db='GeneID']">
<xsl:for-each select="Dbtag_tag/Object-id/Object-id_id">
<xsl:variable name="genid" select="."/>
<xsl:variable name="url" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=',$genid)"/>
<xsl:apply-templates select="document($url,//Other-source)" mode="go"/>
</xsl:for-each>
</xsl:template>

<xsl:template match="Other-source" mode="go">
<xsl:if test="Other-source_src/Dbtag/Dbtag_db='GO'">
<xsl:value-of select="concat('GO:',Other-source_src/Dbtag/Dbtag_tag/Object-id/Object-id_id)"/>
<xsl:text> </xsl:text>
<xsl:value-of select="Other-source_anchor"/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>

<xsl:template match="text()">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()" mode="go">
</xsl:template>

</xsl:stylesheet>

Running the stylesheet with xsltproc

Result:

GO:5524 ATP binding
GO:8026 ATP-dependent helicase activity
GO:3725 double-stranded RNA binding
GO:4519 endonuclease activity
GO:4386 helicase activity
GO:16787 hydrolase activity
GO:46872 metal ion binding
GO:166 nucleotide binding
GO:5515 protein binding
GO:4525 ribonuclease III activity
GO:6396 RNA processing
GO:1525 angiogenesis
GO:48754 branching morphogenesis of a tube
GO:35116 embryonic hindlimb morphogenesis
GO:31047 gene silencing by RNA
GO:30324 lung development
GO:31054 pre-microRNA processing
GO:30422 production of siRNA involved in RNA interference
GO:19827 stem cell maintenance
GO:30423 targeting of mRNA for destruction involved in RNA interference
GO:16442 RNA-induced silencing complex
GO:5737 cytoplasm
GO:5622 intracellular

2nd solution: using the ensembl.org mysql server

After doing some little reverse engineering with the SQL schema of ensembl.org, I wrote the following SQL query:
use homo_sapiens_core_48_36j;

select distinct
GENE_ID.stable_id as "ensembl.gene",
RNA_ID.stable_id as "ensembl.transcript",
PROT_ID.stable_id as "ensembl.translation",
GO.acc as "go.acc",
GO.name as "go.name",
GOXREF.linkage_type as "evidence"
from

ensembl_go_54.term as GO,
external_db as EXTDB0,
external_db as EXTDB1,
object_xref as OX0,
object_xref as OX1,
xref as XREF0,
xref as XREF1,
transcript as RNA,
transcript_stable_id as RNA_ID,
gene as GENE,
gene_stable_id as GENE_ID,
translation as PROT,
translation_stable_id as PROT_ID,
go_xref as GOXREF
where
XREF0.dbprimary_acc="NM_030621" and
XREF0.external_db_id=EXTDB0.external_db_id and
EXTDB0.db_name="RefSeq_dna" and
OX0.xref_id=XREF0.xref_id and
RNA.gene_id=GENE.gene_id and
GENE.gene_id= GENE_ID.gene_id and
RNA.transcript_id=OX0.ensembl_id and
RNA_ID.transcript_id=RNA.transcript_id and
PROT.transcript_id = RNA.transcript_id and
OX1.ensembl_id=PROT.translation_id and
PROT.translation_id=PROT_ID.translation_id and
OX1.ensembl_object_type='Translation' and
OX1.xref_id=XREF1.xref_id and
GOXREF.object_xref_id=OX1.object_xref_id and
XREF1.external_db_id=EXTDB1.external_db_id and
EXTDB1.db_name="GO" and
GO.acc=XREF1.dbprimary_acc

order by GO.acc;

Execute

mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306 < query.sql

Result

ensembl.geneensembl.transcriptensembl.translationgo.accgo.name
ENSG00000100697ENST00000393063ENSP00000376783GO:0000166nucleotide binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0001525angiogenesis
ENSG00000100697ENST00000393063ENSP00000376783GO:0003676nucleic acid binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0003677DNA binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0003723RNA binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0003725double-stranded RNA binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0004386helicase activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0004519endonuclease activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0004525ribonuclease III activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0005515protein binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0005524ATP binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0005622intracellular
ENSG00000100697ENST00000393063ENSP00000376783GO:0006396RNA processing
ENSG00000100697ENST00000393063ENSP00000376783GO:0008026ATP-dependent helicase activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0016787hydrolase activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0019827stem cell maintenance
ENSG00000100697ENST00000393063ENSP00000376783GO:0030324lung development
ENSG00000100697ENST00000393063ENSP00000376783GO:0030422RNA interference, production of siRNA
ENSG00000100697ENST00000393063ENSP00000376783GO:0030423RNA interference, targeting of mRNA for destruction
ENSG00000100697ENST00000393063ENSP00000376783GO:0031047gene silencing by RNA
ENSG00000100697ENST00000393063ENSP00000376783GO:0035116embryonic hindlimb morphogenesis
ENSG00000100697ENST00000393063ENSP00000376783GO:0035196gene silencing by miRNA, production of miRNAs
ENSG00000100697ENST00000393063ENSP00000376783GO:0048754branching morphogenesis of a tube


Note: my solutions don't list all the GO terms visible in http://www.ensembl.org/Homo_sapiens/Transcript/(...)/ENSG00000100697. E.g.: I can see GO:0016442 on the web page, but not in my result ). The ensembl schema is complex, and I might have missed some links ("left join" ?) between the tables (anybody knows ?).UPDATE:The results should be the same as ensembl when using "homo_sapiens_core_58_37c" (Thank you @joachimbaran. )

That's it

Pierre

27 May 2010

Binning the genome.

Some tables in the UCSC genome mysql database contains a column named Bin.

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18

mysql> select bin,name,chrom,chromStart,chromEnd from snp130 where chromStart>10000 and chromEnd<20000 limit 10;
+-----+------------+-------+------------+----------+
| bin | name | chrom | chromStart | chromEnd |
+-----+------------+-------+------------+----------+
| 585 | rs12354060 | chr1 | 10003 | 10004 |
| 585 | rs56336884 | chr1 | 10003 | 10004 |
| 585 | rs3950650 | chr1 | 10020 | 10021 |
| 585 | rs3960288 | chr1 | 10020 | 10021 |
| 585 | rs3971686 | chr1 | 10028 | 10029 |
| 585 | rs4030269 | chr1 | 10028 | 10029 |
| 585 | rs3950649 | chr1 | 10053 | 10054 |
| 585 | rs3960289 | chr1 | 10053 | 10054 |
| 585 | rs3962424 | chr1 | 10053 | 10054 |
| 585 | rs12239663 | chr1 | 10068 | 10069 |
+-----+------------+-------+------------+----------+
10 rows in set (0.21 sec)
I wondered what is that column (it is not a primary key) and after asking for more information on http://biostar.stackexchange.com, I eventually found that this column was an index described by Jim Kent in "The Human Genome Browser at UCSC", doi: 10.1101/gr.229102 . Genome Res. 2002. 12:996-1006 and on the UCSC Wiki. Citing:"Binning scheme for optimizing database accesses for genomic annotations that cover a particular region of the genome( ...) Features are put in the smallest bin in which they fit. (...) Typically, almost all features are in the smaller bins, and in the most common usage scenarios only the contents of a few of these smaller bins need to be examined. (...) Since all of these bins are in sizes of powers of two, the calculation of the bin number is a simple matter of bit shifting of the chromStart and chromEnd coordinates".


Kent & al. The Human Genome Browser at UCSC", doi: 10.1101/gr.229102 . Genome Res. 2002. 12:996-1006


Jim Kent's C implementation is small and elegant (it uses some clever bit shifts operators (see source code). I've tried to implement my own java version (less elegant (it uses a recursive function and I'm not Jim Kent) but more readeable (at least for me)).
/** Given start,end in chromosome coordinates assign it
* a bin. A range goes into the smallest bin it will fit in. */
public int getBin(int chromStart,int chromEnd)
{
int genomicLength=getMaxGenomicLengthLevel();
return calcBin(chromStart,chromEnd,0,0,0,0,1,0,genomicLength);
}

/** length for level 0 */
protected int getMaxGenomicLengthLevel() { return 536870912;/* 2^29 */}
/** maximum level in Jim Kent's algorithm */
protected int getMaxLevel() { return 4;}
/** how many children for one node ? */
protected int getChildrenCount() { return 8;}

private int calcBin(
final int chromStart,
final int chromEnd,
int binId,
int level,
int binRowStart,
int rowIndex,
int binRowCount,
int genomicPos,
int genomicLength
)
{

if(
chromStart>=genomicPos &&
chromEnd<= (genomicPos+genomicLength))
{
if(level>=getMaxLevel()) return binId;

int childLength=genomicLength/getChildrenCount();
int childBinRowCount=binRowCount*getChildrenCount();
int childRowBinStart=binRowStart+binRowCount;
int firstChildIndex=rowIndex*getChildrenCount();
int firstChildBin=childRowBinStart+firstChildIndex;
for(int i=0;i< getChildrenCount();++i)
{
int n= calcBin(
chromStart,
chromEnd,
firstChildBin+i,
level+1,
childRowBinStart,
firstChildIndex+i,
childBinRowCount,
genomicPos+i*childLength,
childLength
);
if(n!=-1)
{
return n;
}
}
return binId;
}

return -1;
}
The same kind of function can also be used to find all the bins that should be scanned for finding the objects lying under a given genomic region.

Testing this 'bin' thing


Connecting to the UCSC:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18

Count the number of SNPs on chr1 between 10000 and 20000.
select count(*) from snp130 where chrom="chr1" and chromStart>10000 and chromEnd<20000 ;
+----------+
| count(*) |
+----------+
| 487 |
+----------+
1 row in set (7.35 sec)


Using my piece of java code, I've found that the bins covering the region chr1:10000-20000 are:
  • 0
  • 1
  • 585
  • 9
  • 73

OK, now let's redo the first query using the bin.
select count(*) from snp130 where chrom="chr1" and chromStart>10000 and chromEnd<20000 and bin in (0,1,585,9,73);
+----------+
| count(*) |
+----------+
| 487 |
+----------+
1 row in set (0.21 sec)
"0.21 sec" vs "7.35 sec" : conclusion: using the bin increases the speed of 35x.

That's it
Pierre

19 May 2010

First rule of Bioinfo-Club

The first rule of Bioinfo-Club is:

If you don't like a
Standard File Format,
create a new one


and, I don't like the VCF format. VCF is a text file format (most likely stored in a compressed manner). It contains metainformation lines, a header line, and then data lines each containing information about a position in the genome.. A VCF file looks like this:
##fileformat=VCFv3.3
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##INFO=DP,1,Integer,"Total Depth"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=SB,1,Float,"Strand Bias"
##reference=chrN.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr.markdup.bam chr1.sorted.bam
chr1 10109 . A T 94.50 . AB=0.67;AC=2;AF=0.50;AN=4;DP=14;Dels=0.00;HRun=0;MQ=19.70;MQ0=4;QD=6.75;SB=-64.74 GT:DP:GL:GQ 0/1:4:-8.74,-2.21,-6.8
0:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chr1 10578 . G A 120.87 . AC=4;AF=1.00;AN=4;DP=4;Dels=0.00;HRun=0;MQ=50.00;MQ0=0;QD=30.22;SB=-87.10 GT:DP:GL:GQ 1/1:2:-8.92,-1.60,-1.00:6.02 1/1:2:-8.92,-1.60,-1.00:6.02
(...)


The problem is that I want to add some informations about each variation in this file: it started to become messy when I wanted to annotate the mutations (position in the protein ? in the cDNA ? is it non-synonymous ?, which exon ? which gene ? what are the GO terms ?, etc...). So I created a tool named 'VCFToXML' (source available here) transforming the VCF format into XML:
cat file.vcf | java -jar vcf2xml.jar > vcf.xml

The XML-VCF now looks like this:
.
Of course, it is less compact and less easy to process with the common unix tools (cut/sort/awk/etc...). But with the help of a second tool named 'vcfannot' (source available here), I can now add some structured annotations (here some information about UCSC/knownGenes and UCSC/segDups):

A first XSLT stylesheet: back to VCF

Great ! I can use XSLT to transform this XML into another file format. My first XSLT stylesheet xvcf2vcf.xsl transforms the XML-VCF back to the plain text format:
xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2vcf.xsl vcf.xml

##fileformat=VCFv3.3
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=DP,1,Integer,"Total Depth"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=SB,1,Float,"Strand Bias"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##FORMAT=GT,1,String,"Genotype"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##(...)
##reference=chrN.fa
##source=UnifiedGenotyper
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chrN.markdup.bam chrN.sorted.bam
chrN 10109 . A T 94.5 . AC=2;HRun=0;AB=0.67;Dels=0.00;DP=14;QD=6.75;AF=0.50;MQ0=4;MQ=19.70;SB=-64.74;AN=4 GT:DP:GL:GQ 1/1:4:-8.74,-2.21,-6.80:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chrN 10578 . G A 120.87 . AC=4;HRun=0;Dels=0.00;DP=4;QD=30.22;AF=1.00;MQ0=0;MQ=50.00;SB=-87.10;AN=4 GT:DP:GL:GQ 0/1:2:-8.92,-1.60,-1.00:6.02 0/1:2:-8.92,-1.60,-1.00:6.02

Second XSLT stylesheet: insert into SQL


The second XSL Stylesheet xvcf2sql.xsl generates a mySQL script for inserting the VCF into a database:
xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml |\
mysql -D test


Then, for example, the following query scans the database and returns the genotypes/info for each sample:
select
V.chrom,
V.position,
V.ref,
V.alt,
S.name,
concat(left(T.descr,10),"...") as "Format",
F.content
from
vcf_variant as V,
vcf_sample as S,
vcf_call as C,
vcf_callfeature as F,
vcf_format as T
where
C.variant_id=V.id and
C.sample_id=S.id and
F.call_id=C.id and
T.id=F.format_id;

Result:


(xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml; cat query.sql )|\
mysql -D test

chrom position ref alt name Format content
chrN 10109 A T chr.markdup.bam Genotype... 1/1
chrN 10109 A T chr.markdup.bam Read Depth... 4
chrN 10109 A T chr.markdup.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chr.markdup.bam Genotype Q... 45.90
chrN 10109 A T chrN.sorted.bam Genotype... 0/1
chrN 10109 A T chrN.sorted.bam Read Depth... 4
chrN 10109 A T chrN.sorted.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chrN.sorted.bam Genotype Q... 45.90
chrN 10578 G A chr.markdup.bam Genotype... 0/1
chrN 10578 G A chr.markdup.bam Read Depth... 2
chrN 10578 G A chr.markdup.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chr.markdup.bam Genotype Q... 6.02
chrN 10578 G A chrN.sorted.bam Genotype... 0/1
chrN 10578 G A chrN.sorted.bam Read Depth... 2
chrN 10578 G A chrN.sorted.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chrN.sorted.bam Genotype Q... 6.02


That's it
Pierre

17 May 2010

Elementary school for Bioinformatics indexing/extracting a sequence in the FASTA format

Not Exactly Rocket Science: Inspired by the faidx command in Samtools, I've created a simple basic tool indexing some FASTA sequences. Based on the fact that all the lines in a fasta should (hopefully) have the very same length, it should be straightforward to get any sub-sequence knowing the starting index of a named sequence in a file.
The first tool (source code available here) indexes the sequences in one more fasta file and produces a RDF file (more is better) describing each entry:

java -jar fastaindexer.jar ~/rotavirus.fa

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:fi="uri:fr.inserm.umr915:fastaidx:1.0:" xmlns:dc="http://purl.org/dc/elements/1.1/">
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611561</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>14</fi:start>
<fi:end>1104</fi:end>
<fi:length>1075</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|122938370</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>1120</fi:start>
<fi:end>1728</fi:end>
<fi:length>600</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|78064499</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>1743</fi:start>
<fi:end>20623</fi:end>
<fi:length>18615</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|66774335</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>20638</fi:start>
<fi:end>39518</fi:end>
<fi:length>18615</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|10242107</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>39533</fi:start>
<fi:end>40208</fi:end>
<fi:length>666</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611570</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>40224</fi:start>
<fi:end>40897</fi:end>
<fi:length>664</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611569</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>40913</fi:start>
<fi:end>42583</fi:end>
<fi:length>1647</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611567</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>42599</fi:start>
<fi:end>43673</fi:end>
<fi:length>1059</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611565</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>43689</fi:start>
<fi:end>45505</fi:end>
<fi:length>1791</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611563</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>45521</fi:start>
<fi:end>47252</fi:end>
<fi:length>1707</fi:length>
</fi:Index>
</rdf:RDF>

From this index, any substring can be extracted for a given sequence:

FastaIndex.java
/**
* returns a sequence from a RandomAccessFile for this index
* @param in a fasta file opened with a RandomAccessFile
* @param start the start offset of the sequence (first base=0)
* @param length number of symbols to read
* @return an array of bytes containing the sequence
* @throws IOException
*/
public byte[] getSequence(
RandomAccessFile in,
int start,
int length
) throws IOException
{
if(start<0) throw new IllegalArgumentException("start<0:"+start);
if(start+length> getLength()) throw new IllegalArgumentException("start="+start+" length="+length+"> size="+getLength());
/* the array of byte where we store the final sequence */
byte seq[]=new byte[length];

/** the fasta row index */
int row_index=start/this.lineSize;
/** index of 'start' in this current row */
int index_in_row=start-row_index*this.lineSize;
/** prepare a buffer to read some bytes from the fasta file */
byte buffer[]=new byte[FastaIndex.getBufferSize()];
/** number of bytes in the buffer */
int buffer_length=0;
/** current position in the buffer */
int index_in_buffer=0;
/** current position in the final array of bytes (sequence) */
int index_in_seq=0;
/** move the IO cursor to the beginning of the sequence */
in.seek(
this.getSeqStart()+
row_index*(this.lineSize+1)+
index_in_row
);
while(length>0)
{
/* buffer empty ? fill it */
if(buffer_length==0)
{
buffer_length=in.read(buffer);
index_in_buffer=0;
if(buffer_length<=0) throw new IOException("cannot fill buffer");
}
/* number of byte to copy into the sequence */
int n_to_copy= Math.min(buffer_length-index_in_buffer,Math.min(length,this.lineSize-index_in_row));
//System.err.println("n_to_copy="+n_to_copy+" index_in_row="+index_in_row+" \""+new String(buffer,index_in_buffer,n_to_copy)+"\"");

/* copy the bytes from the buffer to the sequence */
System.arraycopy(buffer, index_in_buffer, seq, index_in_seq, n_to_copy);

/* move the offsets */
index_in_seq+=n_to_copy;
length-=n_to_copy;
index_in_buffer+=n_to_copy;
index_in_row=(index_in_row+n_to_copy)%this.lineSize;

/* check the next input is a CR/LF */
if(length>0)
{
if(index_in_row==0)
{
/* buffer is filled, read from input stream */
if(index_in_buffer==buffer_length)
{
if(in.read()!='\n')
{
throw new IOException("I/O error expected a carriage return at offset "+in.getFilePointer());
}
}
else /* check the buffer */
{
if(buffer[index_in_buffer]!='\n')
{
throw new IOException("I/O error expected a carriage return at offset "+in.getFilePointer());
}
index_in_buffer++;
}
}

/* reset the buffer it is filled */
if(index_in_buffer==buffer_length)
{
buffer_length=0;
index_in_buffer=0;
}
}


}
return seq;
}


Example

:
echo -e "gi|10242107\t100\t278" | java -jar genomeindex.jar -f ~/jeter.xml
>gi|10242107|100-278
AACTCTTTCTGGAAAATCTATTGGTAGGAGTGAACAGTACATTTCACCAG
ATGCAGAAGCATTCAGTAAATACATGCTGTCAAAGTCTCCAGAAGATATT
GGACCATCTGATTCTGCTTCAAACGATCCACTCACCAGCTTTTCGATTAG
ATCGAATGCAGTTAAGACAAATGCAGAC



That's it
Pierre

The poor state of the java web services for Bioinformatics

In his latest post Brad Chapman cited Jessica Kissinger who wished the Galaxy community could access the web services listed in the http://www.biocatalogue.org/. This reminded me this thread I started on http://biostar.stackexchange.com/ : "Anyone using 'Biomart + Java Web Services' ?" where Michael Dondrup and I realized that there was a poor support of the JAVA Web services API for Biomart.

I wanted to test the ${JAVA_HOME}/bin/wimport for all the services in the biocatalogue: I created a small java program using the biocatalogue API (see below) and extracting the web services having a WSDL file. Each WSDL URI was processed with the ${JAVA_HOME}/bin/wimport and I observed if any class was generated. The wsimport '-version' was JAX-WS RI 2.1.6 in JDK 6.

The result is available as a Google spreadsheet at :

Result


Number of services: 1644
  Can't access the service, something went wrong:6
  No WSDL: 6
  Found a WSDL: 1590


Number of services where wsimport failed to parse the WSDL: 1179 (74%)

Common Errors:
  690 : [ERROR] rpc/encoded wsdls are not supported in JAXWS 2.0.
  119 : [ERROR] undefined simple or complex type 'soapenc:Array'
  96 : [ERROR] 'EndpointReference' is already defined
  7 : [ERROR] only one "types" element allowed in "definitions"
  6 : [ERROR] undefined simple or complex type 'apachesoap:DataHandler'
  4 : [ERROR] only one of the "element" or "type" attributes is allowed in part "inDoc"


Number of services successfully parsed by wsimport: 411 (26%)

Count by host:


Source Code




That's it
Pierre

12 May 2010

BigJoin: joining large files

This page was copied from http://code.google.com/p/code915/wiki/BigJoin.

BigJoin is a java tool I wrote for my lab ( primarily for joining some VCF files) . It joins some large files on one or more columns.

Synopsis

java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin [options] {files(gz)|url(gz)}

Usage

This tool is used to merge some large files using one or more column as the common key for each file. It temporarily stores and sorts its data using BerkeleyDB. There can be two or more files.

Requirements


Download

Download bigjoin.jar at https://code.google.com/p/code915/downloads/list

Source code

https://code.google.com/p/code915/source/browse/trunk/tools/src/java/fr/inserm/umr915/tools/BigJoin.java

Options

  • -h help; This screen.
  • -i case insensible
  • -u expect uniq keys per file (faster)
  • -d regex-delim (default:tab)
  • -c separated by commas (first column is '1')
  • -g ignore empty trim(key)
  • -bdb bdb directory default:${HOME}
  • -t delimiter default:tab
  • -s smart sorting (slower) default:true
  • -z sip values (slower, less spaces) default:false
  • -L do a 'left join' if data is missing
  • -null [string] value if data is missing default:NULL
  • -p print key(s) as the very first columns. default:false
  • --log-level level optional. one of :java.util.logging.Level

Examples


The following example joins the ensGene.txt.gz refGene.txt.gz using the chromosome/start/end as the common key. If one file is mismatching, the missing values will be replaced by some "NULL". The output will be ordered by chrom/start/end:
java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -L -s \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz |\
grep -v NULL | head -n 20 | tr " " ";"
(...)
594;ENST00000323275;chr1;-;1236827;1249909;1237101;1249851;17;1236827,1237260,1237468,1237682,1237835,1238029,1238277,1238751,1238974,1239543,1240066,1240646,1240762,1244538,1245698,1246238,1249823,;1237167,1237390,1237611,1237744,1237943,1238192,1238367,1238835,1239164,1239608,1240205,1240681,1240861,1244767,1245772,1246336,1249909,;0;ENSG00000127054;cmpl;cmpl;0,2,0,1,1,0,0,0,2,0,2,0,0,2,0,1,0,;594;NM_017871;chr1;-;1236827;1249909;1237101;1249851;17;1236827,1237260,1237468,1237682,1237835,1238029,1238277,1238751,1238974,1239543,1240066,1240646,1240762,1244538,1245698,1246238,1249823,;1237167,1237390,1237611,1237744,1237943,1238192,1238367,1238835,1239164,1239608,1240205,1240681,1240861,1244767,1245772,1246336,1249909,;0;CPSF3L;cmpl;cmpl;0,2,0,1,1,0,0,0,2,0,2,0,0,2,0,1,0,
594;ENST00000343938;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;ENSG00000215792;cmpl;cmpl;-1,0,2,;594;NM_001029885;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;GLTPD1;cmpl;cmpl;-1,0,2,
594;ENST00000360706;chr1;+;1250005;1254139;1253433;1254099;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;ENSG00000187488;cmpl;cmpl;-1,-1,0,;594;NM_001029885;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;GLTPD1;cmpl;cmpl;-1,0,2,
594;ENST00000338338;chr1;-;1298972;1300443;1299043;1299999;4;1298972,1299242,1299947,1300396,;1299145,1299688,1300033,1300443,;0;ENSG00000175756;cmpl;cmpl;0,1,0,-1,;594;NM_001127229;chr1;-;1298972;1300443;1299043;1299999;4;1298972,1299242,1299947,1300239,;1299145,1299688,1300033,1300443,;0;AURKAIP1;cmpl;cmpl;0,1,0,-1,

Same, count the lines, use default ordering, join the missing values:
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -L \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

93252 2984064 36481577

real 0m10.229s
user 0m12.697s
sys 0m0.548s


Count the lines, use default ordering, do not join the missing values:
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

11519 368608 7775937

real 0m8.966s
user 0m11.925s
sys 0m0.456s


Count the lines, use default ordering, do not join the missing values, zip the data (slower but uses less space):
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -z \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

11519 368608 7775937

real 0m16.631s
user 0m19.181s
sys 0m0.500s


count the lines, use default ordering, join the missing values, assume columns are unique:
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -L -u \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

76958 2462656 27531602

real 0m8.466s
user 0m11.233s
sys 0m0.348s


count the lines, use default ordering, join the missing values, smart sorting:
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -L -s \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

93252 2984064 36481577

real 0m10.018s
user 0m12.353s
sys 0m0.496s


TODO


use some different column indexes for each file.


That's it
Pierre

11 May 2010

Playing with the C API for sqlite: inserting the hapmap data , my notebook.



In this post, I'll insert a Hapmap genotype file into a SQLite database using the C API for SQLIte. Of course, this is just a test and I would not use SQLite to store this kind of information. Moreover the speed of insertion of SQLite was deadly slow here.

The Genotype file

The first line contains the name of the individuals, the other rows of the file contain one SNP per line and the genotype for each individual:
rs# alleles chrom pos strand assembly# center protLSID assayLSID panelLSID QCcode NA06984 NA06985 NA06986 NA06989 NA06991 NA06993 NA06994 NA06995 NA06997 NA07000 NA07014 NA07019 NA07022 NA07029 NA07031 NA07034 NA07037 NA07045 NA07048 NA07051 NA07055 NA07056 NA07345 NA07346 NA07347 NA07348 NA07349 NA07357 NA07435 NA10830 NA10831 NA10835 NA10836 NA10837 NA10838 NA10839 NA10840 NA10843 NA10845 NA10846 NA10847 NA10850 NA10851 NA10852 NA10853 NA10854 NA10855 NA10856 NA10857 NA10859 NA10860 NA10861 NA10863 NA10864 NA10865 NA11829 NA11830 NA11831 NA11832 NA11839 NA11840 NA11843 NA11881 NA11882 NA11891 NA11892 NA11893 NA11894 NA11917 NA11918 NA11919 NA11920 NA11930 NA11931 NA11992 NA11993 NA11994 NA11995 NA12003 NA12004 NA12005 NA12006 NA12043 NA12044 NA12045 NA12056 NA12057 NA12144 NA12145 NA12146 NA12154 NA12155 NA12156 NA12234 NA12236 NA12239 NA12248 NA12249 NA12264 NA12272 NA12273 NA12275 NA12282 NA12283 NA12286 NA12287 NA12335 NA12336 NA12340 NA12341 NA12342 NA12343 NA12344 NA12347 NA12348 NA12375 NA12376 NA12383 NA12386 NA12399 NA12400 NA12413 NA12489 NA12546 NA12707 NA12708 NA12716 NA12717 NA12718 NA12739 NA12740 NA12748 NA12749 NA12750 NA12751 NA12752 NA12753 NA12760 NA12761 NA12762 NA12763 NA12766 NA12767 NA12775 NA12776 NA12777 NA12778 NA12801 NA12802 NA12812 NA12813 NA12814 NA12815 NA12817 NA12818 NA12827 NA12828 NA12829 NA12830 NA12832 NA12842 NA12843 NA12864 NA12865 NA12872 NA12873 NA12874 NA12875 NA12877 NA12878 NA12889 NA12890 NA12891 NA12892
rs6423165 A/G chrY 109805 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-8572888:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ GG AG GG AG AG AG AA AA GG AA AG NN AG AA AG NN AA AG NN GG GG GG GG AA GG GG AG AG AA AG GG AG AG AG AA AG AG AG GG GG AA AA NN GG AG NN AA AG NN GG NN GG GG GG AA AG AA AA AG GG AG AG GG AG AA AG GG AG GG GG AG GG AG GG GG AG GG GG AA NN AG GG AG GG GG GG AG GG GG AA AA GG GG GG NN AG AG AG GG AG AG AA AA GG AA GG AA AG AA AG GG AG GG GG AG AG GG GG AG AG AG NN AG AG AG AG AG NN AA AA AG AG AG AG AG AA AA AG AG AA AG GG AG GG GG AG AG AG AG AG AG AA AG AG AG AG AA AA AG GG GG GG GG AG AG GG GG AG AA GG AG AG GG AG
rs6608381 C/T chrY 109922 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-8528859:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ CC CT CC CT CT CT TT TT CC TT CT NN CT TT CT NN TT CC NN CC CC CC CC TT CC CC CT CT TT CT CC CT CT CT TT CT CT CT CC CC TT TT NN CC CT NN TT CT NN CC NN CC CC CC TT CT TT TT CT CC CT CT CC CT TT CT CC CT CC CC CT CC CT CC CC CT CC CC TT NN CT CC CT CC CC CC CT CC CC TT TT CC CC CC NN CT CT CT CC CT CT TT TT CC TT CC TT CT TT CT CC CT CC CC CT CT CC CC CT CT CT NN CT CT CT CT CT NN TT TT CT CT CT CT CT TT TT CT CT TT CT CC CT CC CC CT CT CC CT CT CC TT CT CT CT CT TT TT CT CC CC CC CC CT CT CC CC CT TT CC CT CT CC CT
rs6644972 A/G chrY 118624 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs6644972:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ GG GG AG GG AG AG AG GG AG GG AG NN GG GG AG NN GG GG NN AG AG GG GG GG GG GG GG GG GG AG AA GG GG GG AG GG GG GG GG GG GG GG NN GG AG GG GG GG NN AG NN AG AG AA GG GG GG GG GG GG GG GG GG AG GG GG AA AG GG AG GG GG GG GG GG GG GG AG AG NN AG GG AG GG GG GG AG GG GG GG GG AA AA AG NN GG GG GG GG GG GG GG AG GG GG GG GG AG GG GG AG GG GG GG GG GG GG GG GG GG GG AG GG GG GG GG GG NN GG GG AG GG GG AG GG AG GG AG AA GG GG GG GG AG GG GG GG AG GG AG GG GG GG AG GG GG AG GG AG GG GG GG GG GG GG AG GG GG GG GG GG GG GG GG
rs6655397 A/G chrY 141935 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs6655397:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ GG GG GG GG AG AG GG GG GG GG GG NN GG GG NN NN GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG NN AG GG GG GG GG NN AG NN GG GG GG NN AG GG GG GG AG GG GG AG GG GG GG GG AG GG AG GG GG GG GG AG AG GG GG GG NN AG GG GG GG GG GG GG GG GG GG GG GG GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG NN GG GG AG GG AG NN GG GG GG GG GG GG GG GG AG GG GG AG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG AG GG GG GG GG GG GG
rs6603172 C/T chrY 142439 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-8527413:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ TT TT TT TT CT CT TT TT TT TT TT NN TT TT TT NN TT TT NN TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT NN CT TT TT TT TT NN CT NN TT TT TT TT CT TT TT TT CT TT TT CT TT TT TT TT CT TT CT TT TT TT TT CT CT TT TT TT NN CT TT TT TT TT TT TT TT TT TT TT TT TT TT NN TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT CT NN TT TT TT TT TT TT TT TT CT TT TT CT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT TT TT TT TT TT
rs6644970 A/G chrY 142664 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-4207883:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ AG AA AA AG AG GG AG AG AA AA GG NN AG AA GG NN AG AG NN AG AG AG GG AG AG AG AA AG GG GG AA AG AG AG AG AG AA AA AG GG AG GG NN GG AG NN AG AG NN AG NN AA AG AA AA AG GG GG AG AG AG AG AG AG AA AG AA AG GG GG AG AG AA GG AG GG AA AG AG NN AG AG GG AG GG GG AG GG GG AA AG AA AG GG NN GG AA GG AG AG GG GG AA GG AA AG AG GG GG AG AG GG AG AG AG GG AG AG AG AG AG NN AG GG AG AG GG NN AA AG AA AA GG AG AG AG GG AG AG AG GG AG AA AG GG AA AA AG GG AA AG AG GG AA GG AA AA AG AG AG AG AG GG AG GG GG AG AG AG AA AA GG AG AG
rs11019 A/G chrY 159978 + ncbi_B36 affymetrix urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2 urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-4207841:2 urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1 QC+ NN AG NN NN AG GG AA NN NN AG NN AG GG AA NN AG NN NN GG NN GG AG GG NN NN GG NN GG NN GG AG GG NN NN GG AG NN NN NN AG AG NN NN NN NN AA AG GG GG AA AG GG GG NN NN GG GG AG GG AG AG NN AA AA NN NN NN NN NN NN NN NN NN NN GG AG GG AG AG GG AG AG GG GG NN GG GG AG AG GG GG AG AG AG AG AG GG GG GG NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN AG NN AA GG NN NN GG NN NN GG GG AG AA AG GG AA AA NN NN NN NN NN NN AG GG GG AG GG AG NN NN NN NN NN NN NN NN NN GG GG AG AG GG GG NN GG NN NN GG AG
rs1132353 C/T chrY 160116 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs1132353:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ CC CT TT CT CT TT NN CT TT CT CC NN TT CC CT NN TT CT NN CT TT CT CT TT TT CC TT CT CC TT CT TT TT TT TT CT TT TT TT CT CT TT NN TT CT NN CT TT NN CC NN TT TT CC NN TT TT CT TT CT CT CC CC CC CT CT CC CT CC CT CT TT TT TT TT CT TT CT CT NN CT CT TT TT TT TT TT CC CT TT TT CT CT CT NN CT TT TT TT TT TT TT TT TT TT TT CT CT CC TT CT TT CT CT CT CT CT CC CC CC CC TT CC CT CT CT CC NN TT CT TT TT CT TT CT CT CC CT TT CC CC TT TT TT CT TT TT CT TT TT CT TT CT TT TT CT TT CT TT CT CT CT TT TT CT CT TT TT CT TT TT CC TT CT
rs35047434 C/T chrY 169118 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs35047434:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ TT TT TT TT CT CT TT TT TT TT TT NN TT TT TT NN TT TT NN CT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT TT TT CT TT TT CT TT NN CT TT TT TT TT NN TT NN TT TT TT TT TT TT TT TT TT CT TT TT TT TT TT TT CT TT TT TT CT TT CT CT TT TT TT TT NN TT TT TT TT TT TT TT TT TT CT TT TT TT TT NN TT CT TT TT TT CT CT TT TT TT CT CT TT TT CT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT NN TT CT TT CT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT TT TT TT CT TT TT TT CT TT TT TT TT CT TT TT TT CC TT TT TT TT TT TT
(...)

Opening the SQLIte Database

if(sqlite3_open(env->fileout,&(env->connection))!=SQLITE_OK)
{
fprintf(stderr,"Cannot open sqlite file %s.\n",env->fileout);
exit(EXIT_FAILURE);
}

Closing the SQLIte Database

sqlite3_close(env->connection);

Creating a table

if(sqlite3_exec(env->connection,
"create table markers(id INTEGER PRIMARY KEY,name varchar(20) unique,chrom varchar(10),position integer)",
NULL,NULL,&error
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot create table markers: %s\n",error);
exit(EXIT_FAILURE);
}

Filling a prepared statement and inserting some data

if(sqlite3_prepare(env->connection,
"insert into markers(name,chrom,position) values(?,?,?)",
-1,&pstmt_insert_marker,NULL
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot compile insert into markers.\n");
exit(EXIT_FAILURE);
}
(...)

if(sqlite3_bind_text(
pstmt_insert_marker,1,
name,
strlen(name),
NULL)!=SQLITE_OK)
{
fprintf(stderr,"Cannot bind markers's name.\n");
exit(EXIT_FAILURE);
}
(
if(sqlite3_bind_text(
pstmt_insert_marker,2,
chrom,
strlen(chrom),
NULL)!=SQLITE_OK)
{
fprintf(stderr,"Cannot bind markers's chrom.\n");
exit(EXIT_FAILURE);
}
if( sqlite3_bind_int(
pstmt_insert_marker,3,
genomic_position
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot bind markers's position.\n");
exit(EXIT_FAILURE);
}

if (sqlite3_step(pstmt_insert_marker) != SQLITE_DONE) {
fprintf(stderr,"Could not step insert marker.\n");
exit(EXIT_FAILURE);
}

marker_id = sqlite3_last_insert_rowid(env->connection);

sqlite3_reset(pstmt_insert_marker);

Creating a custom function for sqlite

Here I implemented a custom function named homozygous it returns true if a genotype is a string of two identical characters (but not 'N').
void is_homozygous(
sqlite3_context* context, /* context */
int argc, /* number of arguments */
sqlite3_value** argv /* arguments */
)
{
/* one text arg */
if(argc==1 &&
sqlite3_value_type(argv[0])==SQLITE_TEXT)
{
const char *alleles=(const char*)sqlite3_value_text(argv[0]);
if( alleles!=NULL &&
strlen(alleles)==2 &&
alleles[0]==alleles[1] &&
alleles[0]!='N')
{
/* set result to TRUE */
sqlite3_result_int(context,1);
return;
}
}
/* set result to FALSE */
sqlite3_result_int(context,0);
}

Telling sqlite about the new custom function

sqlite3_create_function(env->connection,
"homozygous",/* function name */
1,/* number of args */
SQLITE_UTF8,/* encoding */
NULL,
is_homozygous,/* the implementation callback */
NULL,
NULL
);

Implementing a callback for a simple 'SELECT' query

The following callback prints the results to stdout.
static int callback_select_homozygous(
void* notUsed,
int argc,/* number of args */
char** argv,/* arguments as string */
char** columns /* labels for the columns */
)
{
int i;
/* prints all the columns to stdout*/
for(i=0; i<argc; i++)
{
printf( "\"%s\": \"%s\"\n",
columns[i],
argv[i]!=NULL? argv[i] : "NULL"
);
}
printf("\n");
return 0;
}

Executing a SELECT query

Here, we select all the homoygous genotypes:
if(sqlite3_exec(
env->connection, /* An open database */
"select individuals.name,"
"markers.name,"
"markers.chrom,"
"markers.position,"
"genotypes.alleles "
"from "
"genotypes,"
"markers,"
"individuals "
"where "
"markers.id=genotypes.marker_id and "
"individuals.id=genotypes.individual_id and "
"homozygous(genotypes.alleles)=1"
, /* SQL to be evaluated */
callback_select_homozygous, /* Callback function */
NULL, /* 1st argument to callback */
&errormsg /* Error msg written here */
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot select.\n");
exit(EXIT_FAILURE);
}

The full source code



Excuting the C program

[LOG]inserted marker ID.1
[LOG]inserted marker ID.2
[LOG]inserted marker ID.3
[LOG]inserted marker ID.4
[LOG]inserted marker ID.5
(... WAIT !!! )
"name": "NA12878"
"name": "rs6603251"
"chrom": "chrY"
"position": "240580"
"alleles": "CC"

"name": "NA12890"
"name": "rs6603251"
"chrom": "chrY"
"position": "240580"
"alleles": "CC"

"name": "NA12892"
"name": "rs6603251"
"chrom": "chrY"
"position": "240580"
"alleles": "CC"


That's it

Pierre

04 May 2010

Mapability vs Sequencing Depth

I'm currently trying to see if there is be a correlation between the MAPABILITY and the SEQUENCING DEPTH.

The MAPABILITY, generated for the ENCODE project , is the level of sequence uniqueness of the reference hg18 genome. Scores are normalized to between 0 and 1 with 1 representing a completely unique sequence and 0 representing the sequence occurs >4 times in the genome. A score of 0.5 indicates the sequence occurs exactly twice, likewise 0.33 for three times and 0.25 for four times.

The Sequencing depth represents the (often average) number of nucleotides contributing to a portion of a NGS assembly. On a genome basis, it means that, on average, each base has been sequenced a certain number of times

I've been trying to plot the relation between the mapability*100.0 (X-Axis) and the Sequencing depth( log(Y-Axis),cut-off:100 ) for my data on the human chr1 . The radius of the circles is proportional to the number of times a pair(mapability/depth) was found.
Hum... as far as I can see, nothing really obvious is visible here, but on the other hand, I'm not sure I used the best visualization...


(BTW the image below was created using a base64 endcoding src="...". I hope it will be displayed within blogger.com)

mappability.vs.depth



That's it

Pierre

03 May 2010

My new position at INSERM/UMR915

I'm pleased to announce that I've started a new position as a postdoc in Nantes/France at INSERM/U915 , the Thorax Institute for 3 years. The Institute is face up to the Loire and works in close collaboration with the neighboring Hotel-Dieu Hospital.


My team is led by Dr Richard Redon who worked at the Sanger institute on the Copy Number Variations . I'll be part of a project studing the genetics of the Brugada Syndrome using NGS. I'm just starting on this technology so I'm currently learning how to use all those new tools such as R, MAQ, SAMTOOLS , BWA , etc.... biostar.stackexchange.com has been a useful source of information here.

I applied for this position a few days before leaving to Biohackathon 2010 and I started to work in the lab on April 1st. I've been said that my professional network has been of help to get this job, especially Jan Aerts (my new best friend ! ;-) ) (thanks Jan !) who worked with Dr Redon at the Sanger Insitute.
Your browser does not support the <CANVAS> element !


More to come... ! :-)

That's it

Pierre

Setting-up a GIT remote repository : my notebook

Here are my notes for setting-up a GIT remote repository on ubuntu. It basically requires to create a new user @gituser on a server. @gituser will be used as a central repository and people will be allowed to publish their changes using their public ssh key. The problems I encountered were due to our internal firewall and to my poor knowledge of SSH.

Initialize the SSH key for 'lindenb'


As user 'lindenb'
mkdir -p .ssh
ssh-keygen -t rsa

Generating public/private rsa key pair.
Enter file in which to save the key (/home/lindenb/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/lindenb/.ssh/id_rsa.
Your public key has been saved in /home/lindenb/.ssh/id_rsa.pub.
The key fingerprint is:
xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx lindenb@yokofakun


more .ssh/id_rsa.pub
ssh-rsa M3d(...)BuQ0J9x== plindenbaum@yahoo.fr

Create a 'gituser' user


As 'root':
/usr/sbin/useradd --shell /usr/bin/git-shell -c "GIT Version-Control-System" -m gituser
cd ~gituser
mkdir .ssh

Add lindenb's public key to the list of gituser's authorized_keys


As 'root'
cat ~lindenb/.ssh/id_rsa.pub >> ~gituser/.ssh/authorized_keys

Change the permissions to 'gituser' in .ssh


As 'root'
chown -R gituser:gituser ~gituser/.ssh

Create a new empty project in gituser


As far as I understand, this should be done for every project.
As 'root'
mkdir -p ~gituser/src/project01.git
cd ~gituser/src/project01.git
git --bare init
Initialized empty Git repository in /home/gituser/src/project01.git
chown -R gituser:gituser ~gituser/src

Create my git project


as 'lindenb'
mkdir /home/lindenb/src/mygitproject
cd /home/lindenb/src/mygitproject
git init
Initialized empty Git repository in /home/lindenb/src/mygitproject/.git/

Add some files into the project


echo -e "hello:\n\tgcc -o \$@ hello.c\n" > Makefile
echo -e '#include<stdio.h>\nint main(int argc,char **argv)\n{\nprintf("Hello.\\n");\nreturn 0;\n}\n' > hello.c

git status
# On branch master
#
# Initial commit
#
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# Makefile
# hello.c
nothing added to commit but untracked files present (use "git add" to track)


git add .

git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
# (use "git rm --cached <file>..." to unstage)
#
# new file: Makefile
# new file: hello.c
#


git commit -m 'initial commit'
[master (root-commit) b1ed0a2] initial commit
2 files changed, 10 insertions(+), 0 deletions(-)
create mode 100644 Makefile
create mode 100644 hello.c


## modify Makefile
echo -e "\nclean:\n\trm -f hello *.o\n" >> Makefile

git status
# On branch master
# Changed but not updated:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: Makefile
#
no changes added to commit (use "git add" and/or "git commit -a")


git diff
diff --git a/Makefile b/Makefile
index a329f85..7b0f9d1 100644
--- a/Makefile
+++ b/Makefile
@@ -1,3 +1,7 @@
hello:
gcc -o $@ hello.c

+
+clean:
+ rm -f hello *.o
+



git commit
# On branch master
# Changed but not updated:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: Makefile
#
no changes added to commit (use "git add" and/or "git commit -a")


git commit -a -m "added clean target in Makefile"
[master 8662812] added clean target in Makefile
1 files changed, 4 insertions(+), 0 deletions(-)


Tell the world, publish the project


git remote add origin gituser@localhost:src/project01.git
git push origin master

ssh: connect to host localhost port 22: Connection refused
fatal: The remote end hung up unexpectedly


#huhoo, I'm missing the ssh server
sudo apt-get install openssh-server
git push origin master
Agent admitted failure to sign using the key.

#I fixed the problem with https://bugs.launchpad.net/ubuntu/+source/gnome-keyring/+bug/201786
export SSH_AUTH_SOCK=0

git push --verbose origin master
Pushing to gituser@yokofakun:src/project01.git
Enter passphrase for key '/home/lindenb/.ssh/id_rsa':
Counting objects: 7, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (7/7), 674 bytes, done.
Total 7 (delta 0), reused 0 (delta 0)
To gituser@yokofakun:src/project01.git
* [new branch] master -> master
updating local tracking ref 'refs/remotes/origin/master'


Clone the project


cd /home/lindenb/src/
git clone gituser@yokofakun:src/project01.git
Initialized empty Git repository in /home/lindenb/src/project01/.git/
Enter passphrase for key '/home/lindenb/.ssh/id_rsa':
remote: Counting objects: 7, done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 7 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (7/7), done.


ls -la project01
total 20
drwxr-xr-x 3 lindenb lindenb 4096 2010-05-03 11:42 .
drwxr-xr-x 7 lindenb lindenb 4096 2010-05-03 11:41 ..
drwxr-xr-x 8 lindenb lindenb 4096 2010-05-03 11:42 .git
-rw-r--r-- 1 lindenb lindenb 84 2010-05-03 11:42 hello.c
-rw-r--r-- 1 lindenb lindenb 53 2010-05-03 11:42 Makefile


Do some changes in project01
echo -e "test:hello\n\t./hello\n" >> Makefile

commit the changes:
git commit -a -m "added test target in Makefile"
[master 1de291c] added test target in Makefile
1 files changed, 3 insertions(+), 0 deletions(-)

and tell the world
git push origin master
Enter passphrase for key '/home/lindenb/.ssh/id_rsa':
Counting objects: 5, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 353 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
To gituser@yokofakun:src/project01.git
8662812..1de291c master -> master

go back to 'mygitproject' and update the code
cd /home/lindenb/src/mygitproject/
git pull origin master
Enter passphrase for key '/home/lindenb/.ssh/id_rsa':
From gituser@yokofakun:src/project01
* branch master -> FETCH_HEAD
Updating 8662812..1de291c
Fast forward
Makefile | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)




That's it
Pierre