31 January 2012

Inside the Variation Toolkit: Tools for Gene Ontology

GeneOntologyDbManager is a C++ tool that is part of my experimental Variation Toolkit.
This program is a set of tools for GeneOntology, it is based on the sqlite3 library.

Download

Download the sources from Google-Code using subversion:....
svn checkout http://variationtoolkit.googlecode.com/svn/trunk/ variationtoolkit-read-only
... or update the sources of an existing installation...
cd variationtoolkit
svn update
... and edit the variationtoolkit/congig.mk file.

Dependencies

Compilation

Define "SQLITE_LIB" and "SQLITE_CFLAGS" in config.mk (see HowToInstall )
$ cd variationtoolkit/src/
$ make ../bin/godbmgr 

if ! [ -z "-lsqlite3" ] ;then g++ -o ../bin/godbmgr godatabasemgr.cpp xsqlite.cpp application.o xstream.o xxml.o -g `xml2-config --cflags `  /usr/include/sqlite3.h   -lz -lsqlite3 `xml2-config  --libs` ; else g++ -o ../bin/godbmgr godatabasemgr.cpp  -DNOSQLITE -O3 -Wall  ; fi

Usage

godbmgr (program-name) -f database.sqlite [options] (file1.vcf file2... | stdin )

Sub-Program: loadrdf

Loads the RDF/XML GO database (http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz) into the sqlite3 database.

Usage

godbmgr loadrdf -f database.sqlite (stdin|file)

Options

  • -f (filename) the sqlite3 database

Example

$ curl -s "http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz" |\
  gunzip -c |\
  godbmgr loadrdf -f database.sqlite
list the content of the database:
$ sqlite3 -separator '  ' -header  database.sqlite 'select * from TERM where acn="GO:0000007"'
acn xml
GO:0000007 <go:term xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:about="http://www.geneontology.org/go#GO:0000007">
            <go:accession xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0000007</go:accession>
            <go:name xmlns:go="http://www.geneontology.org/dtds/go.dtd#">low-affinity zinc ion transmembrane transporter activity</go:name>
            <go:definition xmlns:go="http://www.geneontology.org/dtds/go.dtd#">Catalysis of the transfer of a solute or solutes from one side of a membrane to the other according to the reaction: Zn2+ = Zn2+, probably powered by proton motive force. In low affinity transport the transporter is able to bind the solute only if it is present at very high concentrations.</go:definition>
            <go:is_a xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:resource="http://www.geneontology.org/go#GO:0005385"/>
        </go:term>


$ sqlite3 -separator '  ' -header  database.sqlite 'select * from TERM2REL where acn="GO:0000007"'
acn rel target
GO:0000007 is_a GO:0005385

Sub-Program: loadgoa

inserts the database for GOA into a sqlite3 database (e.g: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz)

Usage

godbmgr loadgoa -f database.sqlite (stdin|file)

Options

  • -f (filename) the sqlite3 database

Examples

$  curl -s "ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz" |\
     gunzip -c |\
     godbmgr loadgoa -f database.sqlite
list the content of the database:
$ sqlite3 -line   database.sqlite 'select * from GOA where term="GO:0005385" limit 2' 
              DB = UniProtKB
    DB_Object_ID = B3KU87
DB_Object_Symbol = SLC30A6
            term = GO:0005385
  DB_Object_Name = cDNA FLJ45816 fis, clone NT2RP7019682, highly similar to Homo sapiens solute carrier family 30 (zinc transporter), member 6 (SLC30A6), mRNA
         Synonym = B3KU87_HUMAN|SLC30A6|hCG_23082|IPI01009565|B7WP49
  DB_Object_Type = protein

              DB = UniProtKB
    DB_Object_ID = B5MCR8
DB_Object_Symbol = SLC30A6
            term = GO:0005385
  DB_Object_Name = Solute carrier family 30 (Zinc transporter), member 6, isoform CRA_b
         Synonym = B5MCR8_HUMAN|SLC30A6|hCG_23082|IPI00894292
  DB_Object_Type = protein

Sub-Program: desc

print the descendants (children) of a given GO node.

Usage

godbmgr desc -f db.sqlite [options] term1 term2 ... termn

Options

Examples

Default output

$ godbmgr desc -f database.sqlite "GO:0005385"
GO:0000006
GO:0000007
GO:0005385
GO:0015341
GO:0015633
GO:0016463
GO:0022883

xml/rdf output

$ godbmgr desc -f database.sqlite  -t xml "GO:0005385" | head

<go:go xmlns:go='http://www.geneontology.org/dtds/go.dtd#' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
 <rdf:RDF>
<go:term xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:about="http://www.geneontology.org/go#GO:0000006">
            <go:accession xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0000006</go:accession>
            <go:name xmlns:go="http://www.geneontology.org/dtds/go.dtd#">high affinity zinc uptake transmembrane transporter activity</go:name>
            <go:definition xmlns:go="http://www.geneontology.org/dtds/go.dtd#">Catalysis of the transfer of a solute or solutes from one side of a membrane to the other according to the reaction: Zn2+(out) = Zn2+(in), probably powered by proton motive force. In high affinity transport the transporter is able to bind the solute even if it is only present at very low concentrations.</go:definition>
            <go:is_a xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:resource="http://www.geneontology.org/go#GO:0005385"/>
        </go:term>
<go:term xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:about="http://www.geneontology.org/go#GO:0000007">
            <go:accession xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0000007</go:accession>

GOA output

$godbmgr desc -f database.sqlite -t goa  "GO:0005385"

UniProtKB B3KU87 SLC30A6 GO:0005385 cDNA FLJ45816 fis, clone NT2RP7019682, highly similar to Homo sapiens solute carrier family 30 (zinc transporter), member 6 (SLC30A6), mRNA B3KU87_HUMAN|SLC30A6|hCG_23082|IPI01009565|B7WP49 protein
UniProtKB B5MCR8 SLC30A6 GO:0005385 Solute carrier family 30 (Zinc transporter), member 6, isoform CRA_b B5MCR8_HUMAN|SLC30A6|hCG_23082|IPI00894292 protein
(..)
UniProtKB Q99726 SLC30A3 GO:0015633 Zinc transporter 3 ZNT3_HUMAN|ZNT3|SLC30A3|IPI00293793|Q8TC03protein

TSV output

$ godbmgr desc -f database.sqlite  -t tsv "GO:0022857" |\
    cut -c 1-100 |\
    head
#go:accession go.name go.def
GO:0000006 high affinity zinc uptake transmembrane transporter activity Catalysis of the transfer of
GO:0000007 low-affinity zinc ion transmembrane transporter activity Catalysis of the transfer of a s
GO:0000064 L-ornithine transmembrane transporter activity Catalysis of the transfer of L-ornithine f
GO:0000095 S-adenosylmethionine transmembrane transporter activity Catalysis of the transfer of S-ad
GO:0000099 sulfur amino acid transmembrane transporter activity Catalysis of the transfer of sulfur 
GO:0000100 S-methylmethionine transmembrane transporter activity Catalysis of the transfer of S-meth
GO:0000102 L-methionine secondary active transmembrane transporter activity Catalysis of the transfe
GO:0000227 oxaloacetate secondary active transmembrane transporter activity Catalysis of the transfe
GO:0000269 toxin export channel activity Enables the energy independent passage of toxins, sized les
(...)

Sub-Program: asc

prints the ascendants (parents) of a given node.

Usage

godbmgr asc -f db.sqlite [options] term1 term2 ... termn

Options

Examples

Default output

$ godbmgr asc -f database.sqlite "GO:0022857"
GO:0003674
GO:0005215
GO:0022857
all

xml/rdf output

$ godbmgr asc -f database.sqlite  -t xml "GO:0022857" | head

<go:go xmlns:go='http://www.geneontology.org/dtds/go.dtd#' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
 <rdf:RDF>
<go:term xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:about="http://www.geneontology.org/go#GO:0003674">
            <go:accession xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0003674</go:accession>
            <go:name xmlns:go="http://www.geneontology.org/dtds/go.dtd#">molecular_function</go:name>
            <go:synonym xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0005554</go:synonym>
            <go:synonym xmlns:go="http://www.geneontology.org/dtds/go.dtd#">molecular function</go:synonym>
            <go:synonym xmlns:go="http://www.geneontology.org/dtds/go.dtd#">molecular function unknown</go:synonym>
            <go:definition xmlns:go="http://www.geneontology.org/dtds/go.dtd#">Elemental activities, such as catalysis or binding, describing the actions of a gene product at the molecular level. A given gene product may exhibit one or more molecular functions.</go:definition>
            <go:comment xmlns:go="http://www.geneontology.org/dtds/go.dtd#">Note that, in addition to forming the root of the molecular function ontology, this term is recommended for use for the annotation of gene products whose molecular function is unknown. Note that when this term is used for annotation, it indicates that no information was available about the molecular function of the gene product annotated as of the date the annotation was made; the evidence code ND, no data, is used to indicate this.</go:comment>

GOA output

$godbmgr desc -f database.sqlite -t goa  "GO:0005385"

UniProtKB B3KU87 SLC30A6 GO:0005385 cDNA FLJ45816 fis, clone NT2RP7019682, highly similar to Homo sapiens solute carrier family 30 (zinc transporter), member 6 (SLC30A6), mRNA B3KU87_HUMAN|SLC30A6|hCG_23082|IPI01009565|B7WP49 protein
UniProtKB B5MCR8 SLC30A6 GO:0005385 Solute carrier family 30 (Zinc transporter), member 6, isoform CRA_b B5MCR8_HUMAN|SLC30A6|hCG_23082|IPI00894292 protein
(..)
UniProtKB Q99726 SLC30A3 GO:0015633 Zinc transporter 3 ZNT3_HUMAN|ZNT3|SLC30A3|IPI00293793|Q8TC03protein

TSV output

$ godbmgr asc -f database.sqlite  -t tsv "GO:0022857"   
#go:accession go.name go.def
GO:0003674 molecular_function Elemental activities, such as catalysis or binding, describing the actions of a gene product at the molecular level. A given gene product may exhibit one or more molecular functions.
GO:0005215 transporter activity Enables the directed movement of substances (such as macromolecules, small molecules, ions) into, out of or within a cell, or between cells.
GO:0022857 transmembrane transporter activity Enables the transfer of a substance from one side of a membrane to the other.
all all .

Sub-program: goa

Annotate a TSV file with the GOA annotation.

Usage

godbmgr goa -f db.sqlite [options] (stdin|files)

Options

  • -f (filename) the sqlite3 database
  • -c (column index) REQUIRED. The observed column.

Example

$ echo -e "#MyGene\nHello\nNOTCH2" |\
   godbmgr goa -c 1 -f database.sqlite  |\
   head -n 4 |\
   verticalize

>>> 2
$1 #MyGene                Hello
$2 DB                     .
$3 DB_Object_ID           .
$4 DB_Object_Symbol       .
$5 term                   .
$6 DB_Object_Name,Synonym .
$7 DB_Object_Type         .
$8   ???                    .
<<< 2

>>> 3
$1 #MyGene                NOTCH2
$2 DB                     UniProtKB
$3 DB_Object_ID           Q04721
$4 DB_Object_Symbol       NOTCH2
$5 term                   GO:0001709
$6 DB_Object_Name,Synonym Neurogenic locus notch homolog protein 2
$7 DB_Object_Type         NOTC2_HUMAN|NOTCH2|IPI00297655|Q5T3X7|Q99734|Q9H240
$8   ???                    protein
<<< 3

>>> 4
$1 #MyGene                NOTCH2
$2 DB                     UniProtKB
$3 DB_Object_ID           Q04721
$4 DB_Object_Symbol       NOTCH2
$5 term                   GO:0004872
$6 DB_Object_Name,Synonym Neurogenic locus notch homolog protein 2
$7 DB_Object_Type         NOTC2_HUMAN|NOTCH2|IPI00297655|Q5T3X7|Q99734|Q9H240
$8   ???                    protein

Sub-Program: grep

filters the line having an identifier (gene...) that is a children of a given GO term.

Usage

godbmgr grep -f db.sqlite [options] (stdin|files)

Options

  • -f (filename) the sqlite3 database
  • -c (column index) REQUIRED. The column containing a GO:acn
  • -v inverse selection
  • -t (GO:acn) add a GO term in the filter (One is REQUIRED).
  • -r (rel) add a go relationship (http://www.obofoundry.org/ro/) (OPTIONAL, default: it adds "is_a").

Example

$ 
$ echo -e "#MyACN\nGO:0003674\nGO:0001618\n" |\
  godbmgr grep -f database.sqlite -c 1 -t GO:0004872 -t GO:0004879 

#MyACN
GO:0001618
$ echo -e "#MyACN\nGO:0003674\nGO:0001618\n" |\
  godbmgr grep -f database.sqlite -c 1 -t GO:0004872 -t GO:0004879 -v

#MyACN
GO:0003674


That's it,

Pierre

28 January 2012

Inside the variation toolkit: VCF2XML

vcf2xml is C++ tool that is part of my Variation Toolkit.
It transforms a "Variant Call Format document" to XML, so it can be later processed with xslt, xquery, etc...

Dependencies

Download

Download the sources from Google-Code using subversion:....
svn checkout http://variationtoolkit.googlecode.com/svn/trunk/ variationtoolkit-read-only
... or update the sources of an existing installation...
cd variationtoolkit
svn update
... and edit the variationtoolkit/congig.mk file.

Compiling:

$ cd variationtoolkit/src/
$ make ../bin/vcf2xml

g++ -o ../bin/vcf2xml vcf2xml.cpp application.o -O3 -Wall `xml2-config --cflags --libs` -lz

Usage:

vcf2xml (file.vcf | stdin)

Example:

$ vcf2xml input.vcf | xmllint --format -

<?xml version="1.0" encoding="UTF-8"?>
<vcf>
  <head>
    <meta key="fileformat">VCFv4.1</meta>
    <meta key="samtoolsVersion">0.1.17 (r973:277)</meta>
    <infos>
      <info>
        <id>DP</id>
        <number>1</number>
        <type>Integer</type>
        <description>Raw read depth</description>
      </info>
      <info>
        <id>DP4</id>
        <number>4</number>
        <type>Integer</type>
        <description># high-quality ref-forward bases</description>
      </info>
      <info>
        <id>MQ</id>
(...)
      </calls>
    </variation>
    <variation>
      <chrom>chr1</chrom>
      <pos>112697</pos>
      <ref>T</ref>
      <alt>G</alt>
      <qual>10.4</qual>
      <infos>
        <info key="DP">1</info>
        <info key="AF1">1</info>
        <info key="AC1">2</info>
        <info key="DP4">0,0,0,1</info>
        <info key="MQ">60</info>
        <info key="FQ">-30</info>
      </infos>
      <calls>
        <call sample="input.bam">
          <prop key="GT">1/1</prop>
          <prop key="PL">40,3,0</prop>
          <prop key="GQ">5</prop>
        </call>
      </calls>
    </variation>
  </body>
</vcf>
That's it,
Pierre

Insert your VCFs in a sqlite database.

vcf2sqlite is C++ tool that is part of my Variation Toolkit.
It inserts a "Variant Call Format document" (VCF) into a sqlite3 database.

Download

Download the sources from Google-Code using subversion:....
svn checkout http://variationtoolkit.googlecode.com/svn/trunk/ variationtoolkit-read-only
... or update the sources of an existing installation...
cd variationtoolkit
svn update
... and edit the variationtoolkit/congig.mk file.

Dependencies

http://www.sqlite.org/ : libraries and headers for sqlite3.

Compilation

Define "SQLITE_LIB" and "SQLITE_CFLAGS" in config.mk (see HowToInstall )
$ cd variationtoolkit/src/
$ make ../bin/vcf2sqlite 

if ! [ -z "$(SQLITE_LIB)" ] ;then g++ -o ../bin/vcf2sqlite vcf2sqlite.cpp xsqlite.cpp application.o -O3 -Wall -lz   ; else g++ -o ../bin/vcf2sqlite vcf2sqlite.cpp  -DNOSQLITE -O3 -Wall  ; fi

Usage

vcf2sqlite -f database.sqlite (file1.vcf file2... | stdin )

Options

  • -f (file) sqlite3 database (REQUIRED).

Schema


Example:

$ vcf2sqlite -f db.sqlite file.vcf
$ sqlite3 -line db.sqlite  "select * from VCFCALL LIMIT 4"

       id = 1
   nIndex = 0
vcfrow_id = 1
sample_id = 1
     prop = GT
    value = 1/1

       id = 2
   nIndex = 1
vcfrow_id = 1
sample_id = 1
     prop = PL
    value = 46,6,0

       id = 3
   nIndex = 2
vcfrow_id = 1
sample_id = 1
     prop = GQ
    value = 10

       id = 4
   nIndex = 0
vcfrow_id = 2
sample_id = 1
     prop = GT
    value = 1/1

$ sqlite3 -column -header  db.sqlite \
   "select SAMPLE.name,VCFCALL.value,count(*) from VCFCALL,SAMPLE where SAMPLE.id=VCFCALL.sample_id and prop='GT' group by SAMPLE.id,VCFCALL.value"

name         value       count(*)  
-----------  ----------  ----------
rmdup_1.bam  0/1         545       
rmdup_1.bam  1/1         429       
rmdup_2.bam  0/1         625       
rmdup_2.bam  1/1         349       
rmdup_3.bam  0/1         595       
rmdup_3.bam  1/1         379       
rmdup_4.bam  0/1         548       
rmdup_4.bam  1/1         426       
rmdup_5.bam  0/1         564       
rmdup_5.bam  1/1         410       
rmdup_6.bam  0/1         724       
rmdup_6.bam  1/1         250
That's it
Pierre

07 January 2012

A CGI-version of samtools tview.

I've created a lightweight CGI-based web-application for samtools tview. This C++ program named ngsproject.cgi uses the samtools api, it allows any user to visualize all the alignments in a given NGS project. The projects and their BAMS are defined on the server side using a simple XML document. e.g:

<?xml version="1.0"?>
<projects>
 <reference id="hg19">
  <path>/home/lindenb/samtools-0.1.18/examples/ex1.fa</path>
 </reference>
 <bam id="b1">
  <sample>Sample 1</sample>
  <path>/home/lindenb/samtools-0.1.18/examples/ex1.bam</path>
 </bam>
 <bam id="b2">
  <sample>Sample 2</sample>
  <path>/home/lindenb/samtools-0.1.18/examples/ex1.bam</path>
 </bam>
 <project id="1">
  <name>Test 1</name>
  <description>Test</description>
  <bam ref="b1"/>
  <bam ref="b2"/>
  <reference ref="hg19" />
 </project>
 <project id="2">
  <name>Test 2</name>
  <description>Test</description>
  <bam ref="b2"/>
  <reference ref="hg19" />
 </project>
</projects>

Once the CGI has been installed, the user can visualize the reads of each samples.

This tool is available in the variation toolkit at http://code.google.com/p/variationtoolkit/.

That's it.

Pierre

05 January 2012

The Variation Toolkit

During the last weeks, I've worked on an experimental C++ package named The Variation Toolkit (varkit). It was originally designed to provide some command lines equivalent to knime4bio but I've added more tools over time. Some of those tools are very simple-and-stupid ( fasta2tsv) , reinvent the wheel ("numericsplit"), are part of an answer to biostar, are some old tools (e.g. bam2wig) that have been moved to this package, but some others like "samplepersnp", "groupbygene" might be useful to people.
The package is available at : http://code.google.com/p/variationtoolkit/.

Here is the current documentation (05 Jan 2012):




That's it,

Pierre