15 December 2006

JAVA 1.6 Mustang, JAXB and Bioinformatics

JAXB provides a convenient way to bind an XML schema to a representation in Java code. It makes it easy for you to incorporate XML data and processing functions in applications based on Java technology without having to know much about XML itself. The Architecture for XML Binding is now included in the new Java1.6. I wanted to test JAXB to see how it could be used to parse the NCBI/TinySeqXML format. First I created a XSD description of a TSeq:

Source tinyseq.xsd
(...)
<xs:element name="TSeqSet">
<xs:annotation>
<xs:documentation>Set of sequences</xs:documentation>
</xs:annotation>
<xs:complextype>
<xs:sequence>
<xs:element ref="TSeq" maxoccurs="unbounded">
</xs:element>
</xs:sequence>
</xs:complextype>
(...)
I then invoked the binding compiler XJC.
XJC generates Java classes acorresponding to the elements. It
parsed my tinyseq xsd schema and created three files:
pierre@linux:~> xjc org/lindenb/sandbox/tinyseq.xsd -d ./ -p org.lindenb.sandbox.tinyseq
parsing a schema...
compiling a schema...
org/lindenb/sandbox/tinyseq/ObjectFactory.java
org/lindenb/sandbox/tinyseq/TSeq.java
org/lindenb/sandbox/tinyseq/TSeqSet.java
I then wrote a java class using the JAXB API to read an then write a TinySeq file .

[source is here]
/** find the JAXB context in the defined path */
JAXBContext jc = JAXBContext.newInstance("org.lindenb.sandbox.tinyseq");
Unmarshaller u = jc.createUnmarshaller();
/** read the sequence */
TSeqSet seqSet = (TSeqSet)u.unmarshal(new FileInputStream("org/lindenb/sandbox/tinyseq.xml"));

Marshaller m= jc.createMarshaller();
/** echo the sequence to stdout */
m.marshal(seqSet, System.out);
compiling and running...
jpierre@linux> javac org/lindenb/sandbox/JAXBTinySeq.java org/lindenb/sandbox/tinyseq/ObjectFactory.java
jpierre@linux> java -cp . org.lindenb.sandbox.JAXBTinySeq tinyseq.xml

<?xml version="1.0" ?><TSeqSet><TSeqSet><TSeq_se
qtype value="nucleotide"/><TSeq_gi>27592135</TSeq_gi><
TSeq_accver>CB017399.1</TSeq_accver><TSeq_sid>gnl|d
bEST|16653996</TSeq_sid><TSeq_taxid>9031</TSeq_taxid
><TSeq_orgname>Gallus gallus</TSeq_orgname><TSeq_defl
ine>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus g
allus cDNA clone pgn1c.pk016.a18 5' similar to ref|XP_176823.1 simila
r to Rotavirus X associated non-structural protein (RoXaN) [Mus muscu
lus] ref|XP_193795.1| similar to Rotavirus X as></TSeq_defline>
<TSeq_length>671</TSeq_length><TSeq_sequence>GGA
AGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTC
ACGGGTCGCCTTTGTGCAGTACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCT
GGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGAC
(...)


That's it, all the classes and the methods to store and parse the XML were generated using xjc and everything was ready for direct use.

Pierre

No comments: