18 February 2015

Automatic code generation for @knime with XSLT: An example with two nodes: fasta reader and writer.



KNIME is a java+eclipse-based graphical workflow-manager.


Biologists in my lab often use this tool to filter VCFs or other tabular data. A software Development kit (SDK) is provided to build new nodes. My main problem with this SDK is, that you need to write a large number of similar files and you also have to interact with a graphical interface. I wanted to automatize the generation of java code for new node. In the following post I will describe how I wrote two new node for reading and writing fasta files.


The nodes are described in a XML file and the java code is generated with a XSLT stylesheet and is available on github at:



Example


We're going to create two nodes for FASTA:


  • a fasta reader
  • a fasta writer

We define a plugin.xml file, it uses xinclude to include the definition of the two nodes. The base package of will be com.github.lindenb.xsltsandbox . The nodes will be displayed in the knime-workbench under /community/bio/fasta


<?xml version="1.0" encoding="UTF-8"?>
<plugin xmlns:xi="http://www.w3.org/2001/XInclude"
        package="com.github.lindenb.xsltsandbox"
        >
  <category name="bio">
    <category name="fasta" label="Fasta">
      <xi:include href="node.read-fasta.xml"/>
      <xi:include href="node.write-fasta.xml"/>
    </category>
  </category>
</plugin>

node.read-fasta.xml : it takes a FileReader (for the input fasta file ) and an integer to limit the number of fasta sequences to be read. The outpout will be a table with two columns (name/sequence). We only write the code for reading the fasta file.


<?xml version="1.0" encoding="UTF-8"?>
<node name="readfasta" label="Read Fasta" description="Reads a Fasta file">
  <property type="file-read" name="fastaIn">
    <extension>.fa</extension>
    <extension>.fasta</extension>
    <extension>.fasta.gz</extension>
    <extension>.fa.gz</extension>
  </property>
  <property type="int" name="limit" label="max sequences" description="number of sequences to be fetch. 0 = ALL" default="0">
  </property>
  <property type="bool" name="upper" label="Uppercase" description="Convert to Uppercase" default="false">
  </property>
  <outPort name="output">
    <column name="title" label="Title" type="string"/>
    <column name="sequence" label="Sequence" type="string"/>
  </outPort>
  <code>
    <import>
      import java.io.*;
      </import>
    <body>
    @Override
    protected BufferedDataTable[] execute(final BufferedDataTable[] inData, final ExecutionContext exec) throws Exception
        {
        int limit = this.getPropertyLimitValue();
        String url = this.getPropertyFastaInValue();
        boolean to_upper = this.getPropertyUpperValue();
        getLogger().info("reading "+url);
        java.io.BufferedReader r= null;
        int n_sequences = 0;
        try
            {
            r = this.openUriForBufferedReader(url);

            DataTableSpec dataspec0 = this.createOutTableSpec0();
            BufferedDataContainer container0 = exec.createDataContainer(dataspec0);

            String seqname="";
            StringBuilder sequence=new StringBuilder();
            for(;;)
                {
                exec.checkCanceled();
                exec.setMessage("Sequences "+n_sequences);
                String line= r.readLine();
                if(line==null || line.startsWith("&gt;"))
                    {
                    if(!(sequence.length()==0 &amp;&amp; seqname.trim().isEmpty()))
                        {
                          container0.addRowToTable(new  org.knime.core.data.def.DefaultRow(
                          org.knime.core.data.RowKey.createRowKey(n_sequences),
                        this.createDataCellsForOutTableSpec0(seqname,sequence)
                                                                                                                                    ));
                        ++n_sequences;
                        }
                    if(line==null) break;
                    if( limit!=0 &amp;&amp; limit==n_sequences) break;
                    seqname=line.substring(1);
                    sequence=new StringBuilder();
                    }
                else
                    {
                    line= line.trim();
                    if( to_upper ) line= line.toUpperCase();
                    sequence.append(line);
                    }
                }
            container0.close();
            BufferedDataTable out0 = container0.getTable();
            return new BufferedDataTable[]{out0};
            }
        finally
            {
            r.close();
            }
        }
      </body>
  </code>
</node>

node.write-fasta.xml : it needs an input dataTable with two column (name/sequence), an integer to set the lentgh of the lines and requires a file-writer to write the fasta file.


<?xml version="1.0" encoding="UTF-8"?>
<node name="writefasta" label="Write Fasta" description="Write a Fasta file">
  <inPort name="input">
  </inPort>
  <property type="file-save" name="fastaOut">
  </property>
  <property type="column" name="title" label="Title" description="Fasta title" data-type="string">
  </property>
  <property type="column" name="sequence" label="Sequence" description="Fasta Sequence" data-type="string">
  </property>
  <property type="int" name="fold" label="Fold size" description="Fold sequences greater than..." default="60">
  </property>
  <code>
    <import>
      import org.knime.core.data.container.CloseableRowIterator;
      import java.io.*;
      </import>
    <body>
    @Override
    protected BufferedDataTable[] execute(final BufferedDataTable[] inData, final ExecutionContext exec) throws Exception
        {
        CloseableRowIterator iter=null;
        BufferedDataTable inTable=inData[0];
        int fold = this.getPropertyFoldValue();
        int tIndex = this.findTitleRequiredColumnIndex(inTable.getDataTableSpec());
        int sIndex = this.findSequenceRequiredColumnIndex(inTable.getDataTableSpec());
        PrintWriter w =null;
        try
            {
            w= openFastaOutForPrinting();

            int nRows=0;
            double total=inTable.getRowCount();
            iter=inTable.iterator();
            while(iter.hasNext())
                {
                DataRow row=iter.next();
                DataCell tCell =row.getCell(tIndex);

                DataCell sCell =row.getCell(sIndex);

                w.print("&gt;");
                if(!tCell.isMissing())
                    { 
                    w.print(StringCell.class.cast(tCell).getStringValue());
                    }
                if(!sCell.isMissing())
                    {
                    String sequence = StringCell.class.cast(sCell).getStringValue();
                    for(int i=0;i&lt;sequence.length();++i)
                        {
                        if(i%fold == 0) w.println();
                        w.print(sequence.charAt(i));
                        exec.checkCanceled();
                        }
                    }
                w.println();

                exec.checkCanceled();
                exec.setProgress(nRows/total,"Saving Fasta");
                ++nRows;
                }
            w.flush();
            return new BufferedDataTable[0];
            }
        finally
            {
            if(w!=null) w.close();
            }
        }
      </body>
  </code>
</node>

The following Makefile generates the code, compiles and installs the new plugin in the ${knime.root}/plugins directory :


.PHONY:all clean install run

knime.root=${HOME}/package/knime_2.11.2

all: install

run: install
        ${knime.root}/knime -clean

install:
        rm -rf generated
        xsltproc --xinclude \
                --stringparam base.dir generated \
                knime2java.xsl plugin.xml
        $(MAKE) -C generated install knime.root=${knime.root}


clean:
        rm -rf generated

The code generated by this Makefile:

$ find generated/ -type f
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodeFactory.xml
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodePlugin.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodeFactory.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodeDialog.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/AbstractReadfastaNodeModel.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodeModel.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodeFactory.xml
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodePlugin.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodeFactory.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodeDialog.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/AbstractWritefastaNodeModel.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodeModel.java
generated/src/com/github/lindenb/xsltsandbox/CompileAll__.java
generated/src/com/github/lindenb/xsltsandbox/AbstractNodeModel.java
generated/MANIFEST.MF
generated/Makefile
generated/plugin.xml
generated/dist/com_github_lindenb_xsltsandbox.jar
generated/dist/com.github.lindenb.xsltsandbox_2015.02.18.jar

The file generated/dist/com.github.lindenb.xsltsandbox_2015.02.18.jar is the file to move to ${knime.root}/plugins


(At the time of writing I put the jar at http://cardioserve.nantes.inserm.fr/~lindenb/knime/fasta/ )


open knime, the new nodes are now displayed in the Node Repository


Node Repository


You can now use the nodes, the code is displayed in the documentation:


Workbench



That's it,

Pierre


No comments: