19 September 2008

Center for the Study of Human Polymorphisms: Week 3

In my previous post I showed how I used apache velocity to generate some 'C' code for the Operon project based on BerkeleyDB. I also generated the Makefiles and some Lex and Yacc files to create a simple language to query each database. Today I've compiled and linked my first applications. Each application will use my simple language to query each database without having to write a new piece of code for each new kind of query.

For example, the database called 'snpIds' contains a consecutive number of structures defined as :

typedef struct snpIds_t
{
char* featureid;
char* rs_number;
}snpIds,*snpIdsPtr;


I can now query this database like this
snpiddump -q "OR( EQ({rs_number},\"rs10043098\"), EQ({rs_number},\"rs2377171\") ) " -f xml

(OK, the syntax looks ugly, but this design was the simplest way to avoid the shit/reduce conflicts in the yacc parser).The query part is broken into tokens by the lexer and interpreted by the yacc parser. The parser build a Parse Tree which can be drawn like this:

"rs2377171"
/
EQUALS
/ \
/ {rs_number}
--OR
\ {rs_number}
\ /
EQUALS
\
"rs10043098"

This tree is then evaluated versus each record in the database. When a record matches, it is printed out in xml|json|text. e.g.:
<?xml version="1.0" encoding="UTF-8"?>
<op:operon xmlns:op="http://operon.cng.fr">
<op:SnpIds>
<op:featureid>101051105133288</op:featureid>
<op:rs_number>rs10043098</op:rs_number>
</op:SnpIds>
<op:SnpIds>
<op:featureid>101161015120774</op:featureid>
<op:rs_number>rs2377171</op:rs_number>
</op:SnpIds>
</op:operon>

Again, most of the code was written using a velocity template [here].

Pierre

No comments: