05 September 2012

Customizing the java classes for the NCBI generated by XJC

Reminder: XJC is the Java XML Binding Compiler. It automates the mapping between XML documents and Java objects:

XSD (aka XML-schema)
+
XJC
=
JAVA Classes

The code generated by XJC allows to :
  • Unmarshal XML content into a Java representation
  • Access and update the Java representation
  • Marshal the Java representation of the XML content into XML content

For example, the following XML-Schema (tinyseq.xsd) describes a TinySeq-XML document returned by the NCBI.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:documentation> XML schema for NCBI tinyseq format</xs:documentation>
  </xs:annotation>
  <xs:complexType name="TSeqSet_t">
    <xs:annotation>
      <xs:documentation>Set of sequences</xs:documentation>
    </xs:annotation>
    <xs:sequence>
      <xs:element ref="TSeq" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
  <xs:complexType name="TSeq_t">
    <xs:annotation>
      <xs:documentation>A Tiny Sequence</xs:documentation>
    </xs:annotation>
    <xs:sequence>
      <xs:element name="TSeq_seqtype">
        <xs:complexType>
          <xs:attribute name="value">
            <xs:simpleType>
              <xs:restriction base="xs:string">
                <xs:enumeration value="nucleotide"/>
                <xs:enumeration value="protein"/>
              </xs:restriction>
            </xs:simpleType>
          </xs:attribute>
        </xs:complexType>
      </xs:element>
      <xs:element name="TSeq_gi" type="xs:long"/>
      <xs:element name="TSeq_accver" type="xs:string"/>
      <xs:element name="TSeq_sid" type="xs:string"/>
      <xs:element name="TSeq_taxid" type="xs:long"/>
      <xs:element name="TSeq_orgname" type="xs:string"/>
      <xs:element name="TSeq_defline" type="xs:string"/>
      <xs:element name="TSeq_length" type="xs:nonNegativeInteger"/>
      <xs:element name="TSeq_sequence" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
  <xs:element name="TSeqSet" type="TSeqSet_t"/>
  <xs:element name="TSeq" type="TSeq_t"/>
</xs:schema>

This xml-schema can be compiled with XJC:

${JAVA_HOME}/bin/xjc -d . -p generated tinyseq.xsd
parsing a schema...
compiling a schema...
generated/ObjectFactory.java
generated/TSeqSetT.java
generated/TSeqT.java

$ more generated/TSeqT.java

package generated;
(...)
@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "TSeq_t", propOrder = {  "tSeqSeqtype",  "tSeqGi",  "tSeqAccver","tSeqSid",(...),"tSeqSequence"})
public class TSeqT {
    @XmlElement(name = "TSeq_seqtype", required = true)
    protected TSeqT.TSeqSeqtype tSeqSeqtype;
    @XmlElement(name = "TSeq_gi")
    protected long tSeqGi;
    @XmlElement(name = "TSeq_accver", required = true)
    protected String tSeqAccver;
    @XmlElement(name = "TSeq_sid", required = true)
    protected String tSeqSid;
    @XmlElement(name = "TSeq_taxid")
    protected long tSeqTaxid;
    @XmlElement(name = "TSeq_orgname", required = true)
    protected String tSeqOrgname;
    @XmlElement(name = "TSeq_defline", required = true)
    protected String tSeqDefline;
    @XmlElement(name = "TSeq_length", required = true)
    @XmlSchemaType(name = "nonNegativeInteger")
    protected BigInteger tSeqLength;
    @XmlElement(name = "TSeq_sequence", required = true)
    protected String tSeqSequence;
(...)
    }

But XJC doesn't know how to generate some classical java functions like 'hashCode', 'equals' or 'toString' or to add some custom methods to your classes.

Hopefully the standard distribution of XJC comes with a plugin named -Xinject-code whch injects some custom code in the classes generated by XJC.

XSD (aka XML-schema)
+
java xml binding (jxb)
+
XJC
=
Customized JAVA Classes

For example, if we want to add a toString method to the class TSeqT, we're going to write the following "java xml binding file" (jxb) which alters the initial xml schema:

<?xml version="1.0" encoding="UTF-8"?>
<jxb:bindings 
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xjc="http://java.sun.com/xml/ns/jaxb/xjc"
xmlns:jxb="http://java.sun.com/xml/ns/jaxb"
xmlns:ci="http://jaxb.dev.java.net/plugin/code-injector"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
jxb:extensionBindingPrefixes="ci "
jxb:version="2.1"

>

<jxb:bindings schemaLocation="tinyseq.xsd">
        <!-- here we use an XPATH expression to tell xjc about which part
 of the XML schema we want to change -->
 <jxb:bindings node="/xs:schema/xs:complexType[@name='TSeq_t']">
  <ci:code>

 /** toString : returns the gi and the defline  */
 public String toString()
  {
  return "gi:"+getTSeqGi()+"|"+getTSeqDefline();
  }

</ci:code>
 </jxb:bindings>
</jxb:bindings>

</jxb:bindings>

Below, I wrote a larger JXB file 'tinyseq.jxb' which injects the following methods:
  • 'equals' method for TSeq
  • 'hashCode' method for TSeq
  • 'toString' method for TSeq
  • 'printAsFasta' method for TSeq
  • 'getTSeqSetbyId' method for TSeqSet. A static function fetching a TinySeq sequence from the NCBI for a given 'gi'
  • a 'main' method for TSeqSet. It loops over a list of 'gi's, fetches the sequences (using NCBI-EFetch) and prints the sequences as FASTA
<?xml version="1.0" encoding="UTF-8"?>
<jxb:bindings 
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xjc="http://java.sun.com/xml/ns/jaxb/xjc"
xmlns:jxb="http://java.sun.com/xml/ns/jaxb"
xmlns:ci="http://jaxb.dev.java.net/plugin/code-injector"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
jxb:extensionBindingPrefixes="ci "
jxb:version="2.1"

>

<jxb:bindings schemaLocation="tinyseq.xsd">
 <jxb:bindings node="/xs:schema/xs:complexType[@name='TSeq_t']">
  <ci:code>
 /* print this sequence as fasta */
 public void printAsFasta(java.io.PrintStream out)
  {
  String s=getTSeqSequence();
  out.print(">gi:"+getTSeqGi()+"|"+ getTSeqAccver() +"|"+getTSeqDefline());
  for(int i=0;i &lt; s.length();++i)
   {
   if(i%60==0) out.println();
   out.print(s.charAt(i));
   }
  out.println();
  }

 /** equals: two  TSeq are equal if they have the same gi */
 @Override
 public boolean equals(Object o)
  {
  if(o==this) return true;
  if(o==null || o.getClass()!=this.getClass()) return false;
  return this.getTSeqGi()==TSeqT.class.cast(o).getTSeqGi();
  }

 /** hashCode : use gi */
 @Override
 public int hashCode()
  {
  return  (int)(this.getTSeqGi()^(this.getTSeqGi()>>>32));
  }

 /** toString : returns the gi and the defline  */
 public String toString()
  {
  return "gi:"+getTSeqGi()+"|"+getTSeqDefline();
  }

</ci:code>
 </jxb:bindings>




 <jxb:bindings node="/xs:schema/xs:complexType[@name='TSeqSet_t']">
  <ci:code>
 /** get TSeqSetT from a given gi */
 public static TSeqSetT getTSeqSetbyId(long gi)
  throws javax.xml.bind.JAXBException, javax.xml.bind.UnmarshalException , java.io.IOException
  {
  /** find the JAXB context in the defined path */
  javax.xml.bind.JAXBContext jc = javax.xml.bind.JAXBContext.newInstance(TSeqSetT.class,TSeqT.class);
  javax.xml.bind.Unmarshaller u = jc.createUnmarshaller();
  String uri="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&amp;rettype=fasta&amp;retmode=xml&amp;id="+gi;
  /** read the sequence */
  return u.unmarshal(new javax.xml.transform.stream.StreamSource(uri),TSeqSetT.class).getValue();
  }

 /** main: takes a list of gi and prints the sequences as fasta */
 public static void main(String args[]) throws Exception
  {
  for(int optind=0;optind &lt; args.length;++optind)
   {
   TSeqSetT tss=TSeqSetT.getTSeqSetbyId(Long.parseLong(args[optind]));
   for(generated.TSeqT seq:tss.getTSeq()) seq.printAsFasta(System.out);
   }
  }

</ci:code>

</jxb:bindings>

</jxb:bindings>

</jxb:bindings>

Compile the schema, compile the java classes and execute

$ xjc -target 2.1 -verbose -Xinject-code -extension -d . -p generated -b tinyseq.jxb tinyseq.xsd
parsing a schema...
compiling a schema...
[INFO] generating code
unknown location

generated/ObjectFactory.java
generated/TSeqSetT.java
generated/TSeqT.java

$ javac generated/*.java
$ java generated.TSeqSetT 25 26 27
>gi:25|X53813.1|Blue Whale heavy satellite DNA
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTG
GGGGTCCAGCCATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCT
AAAGAAATCACATTAGGATTCTCTTTTTAAGCTGTTCCTTAAAACACTAGAGTCTTAGAA
ATCTATTGGAGGCAGAAGCAGTCAAGGGTAGCCTAGGGTTAGGGTTAGGCTTAGGGTTAG
GGTTAGGGTACGGCTTAGGGTACTGTTTCGGGGAGGGGTTCAGGTACGGCGTAGGGTATG
GGTTAGGGTTAGGGTTAGGGTTAGTGTTAGGGTTAGGGCTCGGTTTAGGGTACGGGTTAG
GATTAGGGTACGTGTTAGGGTTAGGGTAGGGCTTAGGGTTAGGGTACGTGTTAGGGTTAG
GG
>gi:26|X53814.1|Blue Whale heavy satellite DNA
TAGTTATTAAACCTATCCCACTCTCTAGATACACCTTAGCACGTAAAGGAATATTATTTG
GGGGTCCAGACATGGAGAAGAGTTTAGACACTAGGATAAGATAAGGAACACACCCATTCT
AAAGAAATCACATTAGGATTCTCTTTTTAAGCTGTTCCTTAAAACTCTAGTGCTTAGGAA
ATCTATTGGAGGCAGAAGCAGTCAAGGGTAGCCTAGGGTTAGGGTTAGGCTTATGGTTAG
GGCTAGGGTACGGCTTAGGGTACGGATTCGGGGAGGGGTTCGGGTACGGCGTAGGGTATG
GGTTAGGGTTAGCGTTAGTGTTAGGGTTAGGGCTCGGTTTAGGGTACGGGTTAGGATTAG
GGTACGTGTTAGGGTTAGGGTAGGGGTTAGGGTTAGGGTACGCGTTAGGGTTAGGG
>gi:27|Z18633.1|B.physalus gene for large subunit rRNA
AACCAGTATTAGAGCACTGCCTGCCCGGTGACTAATCGTTAAACGGCCGCGGTATCCTGA
CCGTGCAAAGGTAGCATAATCACTTGTTCTCTAATTAGGGACTTGTATGAATGGCCACAC
GAGGGTTTTACTGTCTCTTACTTTTAATCAGTGAAATTGACCTCTCCGTGAAGAGGCGGA
GATAACAAAATAAGACGAGAAGACCCTATGGAGCTTCAATTAATCAACCCAAAAACCATA
ACCTTAAACCACCAAGGGATAACAAAACCTTATATGGGCTGACAATTTCGGTTGGGGTGA
CCTCGGAGTACAAAAAACCCTCCGAGTGATTAAAACTTAGGCCCACTAGCCAAAGTACAA
TATCACTTATTGATCCAATCCTTTGATCAACGGAACAAGTTACCCTAGGGATAACAGCGC
AATCCTATTCTAGAGTCCATATCGACAATAGGGTTTACGACCTCGATGTTGGATCAGGAC
ATCCTAATGGTGCAGCTGCTATTAAGGGTTCGTTTGTT
That's it,


Pierre

No comments: