09 January 2013

A XML schema (xsd) for GeneOntology

The GeneOntology can be downloaded as a RDF/XML file from http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz.
Although it is a RDF file, the structure of the file remains the same. As a consequence, it is shipped with a DTD that describes the structure of the document ( http://www.geneontology.org/dtd/go.dtd ).
I've just written a XML schema (XSD) for this RDF file. This schema is available on github at:
https://github.com/lindenb/xsd-sandbox/tree/master/schemas/bio/go.

Validation with xmllint

The RDF file is successfully validated against my xsd schema:
$ curl "http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz" |\
 gunzip -c | grep -v "<!DOCTYPE " > go.xml

xmllint  --noout --schema go.xsd go.xml
go.xml validates
Note: I've ignored the elements defined in the DTD but absent in the RDF file.

Code Generation with XJC

XJC can be used to generate the java classes for this schema:
xjc go.xsd 
parsing a schema...
compiling a schema...
org/w3/_1999/_02/_22_rdf_syntax_ns_/ObjectFactory.java
org/w3/_1999/_02/_22_rdf_syntax_ns_/RDF.java
org/w3/_1999/_02/_22_rdf_syntax_ns_/package-info.java
org/geneontology/dtds/go/AbstractRelation.java
org/geneontology/dtds/go/Go.java
org/geneontology/dtds/go/IsA.java
org/geneontology/dtds/go/NegativelyRegulates.java
org/geneontology/dtds/go/ObjectFactory.java
org/geneontology/dtds/go/PartOf.java
org/geneontology/dtds/go/PositivelyRegulates.java
org/geneontology/dtds/go/Regulates.java
org/geneontology/dtds/go/package-info.java

Java Parsing

... and we can parse the terms of GO with java without writing a new parser and without any dependencies. For example, the following code parses the whole ontology and prints it to stdout as XML:
import java.io.InputStream;
import java.io.StringWriter;
import org.geneontology.dtds.go.*;
import org.w3._1999._02._22_rdf_syntax_ns_.*;
import javax.xml.namespace.QName;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBElement;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.Marshaller;
import javax.xml.transform.stream.StreamSource;

public class TestGo
    {
    public static void main(String[] args) throws Exception
        {
 JAXBContext jaxbCtxt=JAXBContext.newInstance("org.geneontology.dtds.go:org.w3._1999._02._22_rdf_syntax_ns_");
 Marshaller marshaller = jaxbCtxt.createMarshaller();
 Unmarshaller unmarshaller=jaxbCtxt.createUnmarshaller();
        marshaller.setProperty("jaxb.formatted.output",true);
        Object go=unmarshaller.unmarshal(new java.io.File("go.xml"));
        marshaller.marshal(go, System.out);
        }
    }
compile and run:
$javac TestGo.java \
  org/w3/_1999/_02/_22_rdf_syntax_ns_/ObjectFactory.java \
  org/geneontology/dtds/go/ObjectFactory.java

$ java TestGo  | head -n 100
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<go xmlns="http://www.geneontology.org/dtds/go.dtd#" xmlns:ns2="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <ns2:RDF>
        <term ns2:about="http://www.geneontology.org/go#GO:0000001">
            <accession>GO:0000001</accession>
            <name>mitochondrion inheritance</name>
            <synonym>mitochondrial inheritance</synonym>
            <definition>The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.</definition>
            <is_a ns2:resource="http://www.geneontology.org/go#GO:0048308"/>
            <is_a ns2:resource="http://www.geneontology.org/go#GO:0048311"/>
        </term>
        <term ns2:about="http://www.geneontology.org/go#GO:0000002">
            <accession>GO:0000002</accession>
            <name>mitochondrial genome maintenance</name>
            <definition>The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.</definition>
            <is_a ns2:resource="http://www.geneontology.org/go#GO:0007005"/>
            <dbxref ns2:parseType="Resource">
                <database_symbol>InterPro</database_symbol>
                <reference>IPR009446</reference>

That's it,

Pierre

PS: many thanks to @bdoughan for his help on SO.

No comments: