30 October 2008

The EBI/IntAct Web-Service API, my notebook

This post covers my experience with the IntAct API at EBI. IntAct provides a freely available, open source database system and analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions and are freely available.

This web service is invoked for searching binary interactions, it is described (but not documented...) as a WSDL file at http://www.ebi.ac.uk/intact/binary-search-ws/binarysearch?wsdl

Glassfih, the Java Application Server from Sun, comes with a tool called wsimport. It generates a set of java files used to handle this web-service from the wsdl file.

Here are the generated java files :

AFAIK, the WSDL file contained almost no documentation about this service, but eclipse helped me to find the correct methods thanks to the completion of the code editor.
Here is the short program I just wrote: it connects to the webservice and retrieves all the binary interactions with NSP3
import uk.ac.ebi.intact.binarysearch.wsclient.generated.BinarySearch;
import uk.ac.ebi.intact.binarysearch.wsclient.generated.BinarySearchService;
import uk.ac.ebi.intact.binarysearch.wsclient.generated.SimplifiedSearchResult;

public class IntActClient
* @param args
public static void main(String[] args) {
final String query="NSP3";
BinarySearchService service=new BinarySearchService();
BinarySearch port=service.getBinarySearchPort();
SimplifiedSearchResult ssr= port.findBinaryInteractionsLimited(query, 0,500);
System.out.println("#first-result "+ssr.getFirstResult());
System.out.println("#max-results "+ssr.getMaxResults());
System.out.println("#total-results "+ssr.getTotalResults());
System.out.println("#luceneQuery "+ssr.getLuceneQuery());
for(String line:ssr.getInteractionLines())
catch(Throwable err)

The result:
#first-result 0
#max-results 500
#total-results 7
#luceneQuery identifiers:nsp3 pubid:nsp3 pubauth:nsp3 species:nsp3 type:nsp3 detmethod:nsp3 interact
uniprotkb:Q8N5H7|intact:EBI-745980 uniprotkb:O43281|intact:EBI-718488 uniprotkb:SH2D3C uniprotkb:EFS
intact:EBI-1263954 uniprotkb:Q00721|intact:EBI-1263962 - uniprotkb:S7 - uniprotkb:NCVP4|uniprotkb:vn....
uniprotkb:Q00721|intact:EBI-1263962 intact:EBI-1263971 uniprotkb:S7 - uniprotkb:NCVP4|uniprotkb:vn34....
uniprotkb:Q00721|intact:EBI-1263962 uniprotkb:Q00721|intact:EBI-1263962 uniprotkb:S7 uniprotkb:S7 un...
uniprotkb:Q00721|intact:EBI-1263962 uniprotkb:Q9UGR2|intact:EBI-948845 uniprotkb:S7 uniprotkb:ZC3H7B....
uniprotkb:Q04637|intact:EBI-73711 uniprotkb:Q00721|intact:EBI-1263962 uniprotkb:EIF4G1 uniprotkb:S7....
uniprotkb:Q04637|intact:EBI-73711 uniprotkb:P03536|intact:EBI-296448 uniprotkb:EIF4G1 uniprotkb:S7 u...

Ok, it was easy but I'm a little bit disappointed here because the result was 'just' a set of tab delimited lines (and where is the documentation about those columns ??) and I would have rather expected a set of XML objects.
update: the format of the columns was described here:ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psimitab/README.

That's it for tonight....


29 October 2008

A survey of the Proteins in Wikipedia

Via FriendFeed I've started a collaboration with Andrew Su on the state of the articles about the proteins in wikipedia.
I've used the media wiki (http://en.wikipedia.org/w/api.php) to extract the revisions, the sizes of all the articles containing the Template:PBB_Summary.
A result of this survey is available here.
We first tried to use IBM/Many eyes to display the results, but the applet ran out of memory: http://services.alphaworks.ibm.com/manyeyes/view/SkWN8RsOtha6qaEpsJGCR2~.
I then wrote a custom java interface, inspired from ManyEyes, to display the results. This interface is available as an applet at here. (or you can run in as a javaws application at here with the previous dataset where you can save the image as SVG).

This is an ongoing project but all any suggestion will be appreciated. :-)


EMBL/Strings: find interactors at 2 degrees of separation my notebook.

Thank (again) to the Life Scientists on FriendFeed I've discoreved the API of STRING8 ( STRING 8—a global view on proteins and their functional interactions in 630 organisms NAR 2008): STRING is a database and web resource dedicated to protein–protein interactions, including both physical and functional interactions..

I've used this API to find the partners of a protein at two degrees of separations, here is my notebook:
First download the network for each protein (Note : the database is also available for download) using their HTTP-based API: e.g.: http://string.embl.de/api/psi-mi/interactions?identifier=Roxan. The Ensembl gene ID seems to be the more efficient (non ambiguous) identifiers (e.g. http://string.embl.de/api/psi-mi/interactions?identifier=ENSP00000263243). Note that the STRING database is available for download.

I also wrote a basic XSLT stylesheet transforming the PSI/XML to graphiz-dot format. The stylesheet is available here: http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/psi2dot.xslt. e.g:

xsltproc psi2dot.xslt ROXAN.xml | dot -opicture.png -Tpng

Another XSLT stylesheet (psi2sql.xslt creates the statements to insert one or more psi file into a mysql database ).
xsltproc --stringparam temporary "" psi2sql.xslt interaction1.xml | mysql -u login --password=password -D database -N
xsltproc --stringparam temporary "" psi2sql.xslt interaction2.xml | mysql -u login --password=password -D database -N
xsltproc --stringparam temporary "" psi2sql.xslt interaction3.xml | mysql -u login --password=password -D database -N

The parameter temporary is an argument for the stylesheet telling mysql not to work with temporary tables.

Two of the tables created (interactions and interactors) are described below:
mysql> desc interactor;
| Field | Type | Null | Key | Default | Extra |
| id | int(11) | NO | PRI | NULL | auto_increment |
| pk | varchar(50) | NO | UNI | NULL | |
| shortLabel | varchar(255) | YES | | NULL | |
| fullName | text | YES | | NULL | |

mysql> desc interaction;
| Field | Type | Null | Key | Default | Extra |
| id | int(11) | NO | PRI | NULL | auto_increment |
| interactor1_id | int(11) | NO | MUL | NULL | |
| interactor2_id | int(11) | NO | MUL | NULL | |
| unitLabel | varchar(50) | YES | | NULL | |
| unitFullName | varchar(100) | YES | | NULL | |
| confidence | float | YES | | NULL | |
| experiment_id | int(11) | NO | MUL | NULL | |
7 rows in set (0.00 sec)

And here are the mysql statements finding the protein linked to EIF4G1 at two degrees of separation:
create a temporary table containing a the 2-deg interactions.
create temporary table t1
id1 int,
id2 int,
id3 int

insert into t1(id1,id2,id3)
select distinct
interactor as P1,
interactor as P2,
interactor as P3,
interaction as I1,
interaction as I2
P1.shortLabel="EIF4G1" and
P3.shortLabel!="EIF4G1" and
((P1.id= I1.interactor1_id AND P2.id= I1.interactor2_id) or (P2.id= I1.interactor1_id AND P1.id= I1.interactor2_id)) and
((P2.id= I2.interactor1_id and P3.id= I2.interactor2_id) or (P3.id= I2.interactor1_id and P2.id= I2.interactor2_id))

Remove the simple interactions from the temporary table:
delete t1 from
interactor as P1,
interactor as P3,
interaction as I1
((t1.id1=P1.id and t1.id3=P3.id) or (t1.id1=P3.id and t1.id3=P1.id)) and
((P1.id= I1.interactor1_id and P3.id= I1.interactor2_id) or (P3.id= I1.interactor1_id and P1.id= I1.interactor2_id))

And dump the results:
P1.shortLabel as "Partner1",
P2.shortLabel as "Partner2",
P3.shortLabel as "Partner3"
interactor as P1,
interactor as P2,
interactor as P3
t1.id1 = P1.id
t1.id2 = P2.id

Here is the result:
Partner1 Partner2 Partner3

That's it

21 October 2008

Javadoc is not enough: java2dot

I just wrote a tiny tool used to draw a graph for a java hierarchy. The input of the program is a set of jar files and the name of the classes to be displayed.

The source code is available here:

. The information about each class is obtained using the java.lang.reflect API and the classes are dynamically loaded using an URLClassLoader. The output is a DOT file which is then piped into graphiz dot

As an example, the command line below was used to create the hierarchy of the com.hp.hpl.jena.rdf.model.Model.
It was generated using the following command line:
java -jar ./java2dot.jar
Pierre Lindenbaum PhD. pindenbaum@yahoo.fr
Java2Dot : Compiled by lindenb on 2008-10-21 at 17:40:52 in /home/lindenb/src/lindenb/proj/tinytools
-h this screen
-jar <dir0:jar1:jar2:dir1:...> add a jar in the jar list. If directory, will add all the *ar files
-r add a pattern of classes to be ignored.
-i ignore interfaces
-m ignore classes iMplementing interfaces
-d ignore declared-classes (classes with $ in the name)
-o output file

class-1 class-2 ... class-n

java -jar ./java2dot.jar -jar ${JENADIR}/Jena-2.5.6/lib -d com.hp.hpl.jena.rdf.model.Model |\
dot -Tjpeg -ojenamodel.jpeg

Update: A jar is available here http://lindenb.googlecode.com/files/java2dot.jar.


13 October 2008

Creating DIA diagrams from mysql via XSLT

During a conversation on FriendFeed about using inkscape and Dia, Chris Lasher asked me if I tried to use inkscape to create diagrams in SVG format. This gave me the idea to have a new/fresh look at Dia and see if I could use it for my self-interest (I should soon manage a mysql database with plenty of tables but I'm missing such schema). Dia (http://www.gnome.org/projects/dia/ ) can be used to draw many different kinds of diagrams. It currently has special objects to help draw entity relationship diagrams, UML diagrams, flowcharts, network diagrams, and many other diagrams.. A Dia diagram is formatted as a gzipped xml file. Today I created a XSLT stylesheet transforming the XML description of a table in mysql to a basic (no layout, no links) diagram in Dia. This stylesheet sql2dia is available here:

Usage:In the following example, I ask for the structure of four tables at the UCSC. Mysql adds a xml declaration after each query so we need to grep -v this header and surround the queries with an extra element:
(echo "<root>";
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -e 'desc snp129; desc snpSeq; desc snpArrayAffy5; desc knownGene' -X |\
grep -v "<?xml" ;\
echo "</root>") > /tmp/tmp.xml
xsltproc sql2dia.xsl /tmp/tmp.xml |\
gzip -c > ~/file.dia

And here is a screenshot of the ouput.

That's it


09 October 2008

What is in a list of SNP ? Again, but a GUI.

In a previous post I described how I generated some wrappers in java to map the tables of the mysql database at the UCSC, and I wrote a tool to get the data about a set of snp (cytoband, genes, hapmap...). Today I was asked if I could transform this application into a GUI (the fear of the infamous command line.. again...)

That was straightforward to embed my code into an interactive software. I just transformed it into a java webstart application available here:

About SNP

The source code is available here.

The "mysql source" is the mysql anonymous server of the UCSC (hg18). If any, you can also select a local mysql database from the combo menu (host=localhost login=anonymous password={empty} db=hg18). There is a quota, don't play with thousand of thousand of markers, people at UCSC may not like this, so use it at your own risk.

In case of a mysql error: the problem may comes from your firewall.

Just upload a list of SNP and then press "Fetch info to File...". This tool is not multithreaded so the window may seems frozen when the data are downloaded.

That's it

08 October 2008

Building a presentation with inkscape + batik. My notebook.

OK, I hate PowerPoint ...

... and I hate OpenOffice/Impress

Next week, I'll present a talk about how to handle a bibliography with the tools available on the web (RSS, social bookmarking, zotero, etc...). Today I tried to build the slides using inkscape (the SVG editor) and apache batik (a Java-based toolkit for applications that want to use images in the SVG format).

Each slide was drawn using inkscape. The background was designed using this hack. Each slide was then converted to PDF using the Batik-Rasterizer (Problem: inkscape already supports SVG1.2 with new tags such as flowRoot but batik does not. This problem was solved by converting the texts to their pathes . Also, be careful before moving the files, the path to the pictures are relatives in inskcape )
java -jar ${BATIK_PATH}/batik-rasterizer.jar -bg -m "application/pdf" *.svg

All the slides were then merged into an unique file using ghostscript.
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=slides.pdf -dBATCH *.pdf

Here is the result (this is still a draft, sorry the references of the pictures are still missing):


03 October 2008

Java Wrappers for the tables of the UCSC/GoldenPath

Years ago, Jim Kent(UCSC) (the author of the BLAT algorithm) published in the Linux Journal "autoSql and autoXml: Code Generators from the Genome Project" the tools generate database definitions for SQL, write C header files with your data definitions and function prototypes, write C code to get data to and from C structures and generate C code for an XML parser.

For example the following 'as' file (http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/hg/lib/cytoBand.as is the definition of the table called cytoBand:

table cytoBand
"Describes the positions of cytogenetic bands with a chromosome"
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position in genoSeq"
uint chromEnd; "End position in genoSeq"
string name; "Name of cytogenetic band"
string gieStain; "Giemsa stain results"

will be used to generate the following sql definition
# cytoBand.sql was originally generated by the autoSql program, which also 
# generated cytoBand.c and cytoBand.h. This creates the database representation of
# an object which can be loaded and saved from RAM in a fairly
# automatic way.

#Describes the positions of cytogenetic bands with a chromosome
chrom varchar(255) not null, # Human chromosome number
chromStart int unsigned not null, # Start position in genoSeq
chromEnd int unsigned not null, # End position in genoSeq
name varchar(255) not null, # Name of cytogenetic band
gieStain varchar(255) not null, # Giemsa stain results
PRIMARY KEY(chrom(12),chromStart),

the C code, and the C header.

As a java programmer I wanted to create my own wrappers to use the data of the UCSC. I wrote a custom ANT task(the code is available here) using the public mysql server of the UCSC to get the structure of each table (e.g. desc cytoBand) and the description of each table (e.g. select autoSqlDef from tableDescriptions where tableName="cytoBand"). Each structure is parsed and injected into an apache-velocity template (the template is available here).

Here is an example of a source generated by the ant task:

As you can see 'enum' and 'set' are transformed into java Enum, getter, tableModel (for gui/swing) are created. Each class also comes with some useful static methods creating the instances from a sql query. For example here is what I wrote today to grab the information about the genes/cytobands/hapmap about a set of snp.

 for(RsId rsid: rsSet)
PreparedStatement pstmt=con.prepareStatement("select * from snp129 where name=?");
pstmt.setString(1, rsid.getName());
Hg18Snp129 snp= Hg18Snp129.selectOneOrZero(pstmt.executeQuery());
cout().println(rsid.getName()+TAB+"##Not FOUND");

(... print information about this snp ...)

pstmt=con.prepareStatement("select * from cytoBand where chrom=? and chromStart<=? and chromEnd>=?");
pstmt.setString(1, snp.getChrom());
pstmt.setInt(2, snp.getChromStart());
pstmt.setInt(3, snp.getChromEnd());

for(Hg18CytoBand band:Hg18CytoBand.select(pstmt.executeQuery()))
(.. print info about this cytoband...)
pstmt=con.prepareStatement("select * from refGene where chrom=? and txStart<=? and txEnd>=?");
pstmt.setString(1, snp.getChrom());
pstmt.setInt(2, snp.getChromStart());
pstmt.setInt(3, snp.getChromEnd());
for(Hg18RefGene gene : Hg18RefGene.select(pstmt.executeQuery()))
if(i>0) cout().print(",");

for(String hapmapDb:HAPMAPDB)
pstmt=con.prepareStatement("select * from "+hapmapDb+" where name=?");
pstmt.setString(1, snp.getName());
Hg18HapmapSnps hapmap= Hg18HapmapSnps.selectOneOrZero(pstmt.executeQuery());
(...print empty fields...)
(... print result)

Note: Hibernate is also a popular tool to map objects to databases. But here everything is read-only, (we don't need any transaction) and the relationships between the tables are rather complicated to be described using a mapping file (e.g. see the numerous "Connected Tables and Joining Fields" for the table knownGene).

That's it