17 June 2010

A custom C++ streambuf for BIODAS-DNA/XML

(via cplusplus.com):streambuf objects are in charge of providing reading and writing functionality to/from certain types of character sequences, such as external files or strings. streambuf objects are usually associated with one specific character sequence, from which they read and write data through an internal memory buffer. The buffer is an array in memory which is expected to be synchronized when needed with the physical content of the associated character sequence.
A streambuf contains a method named underflow called in order to make additional characters available for reading.

A first streambuf : reading the content of an URL

My first class extends std::streambuf and is a wrapper for the CURL C API. The source code is available on github at http://github.com/lindenb/cclindenb/blob/master/src/core/lindenb/net/curl_streambuf.h. Here, each time underflow is called, the CURL API is asked to download a new chunk of data from the URL and it becomes the new buffer for this streambuf.

Usage


#include "lindenb/net/curl_streambuf.h"

static const char *url1="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:100000,100100";

static void test0001()
{
lindenb::net::curl_streambuf curl(url1);
std::istream in(&curl);
std::string line;
while(std::getline(in,line,'\n'))
{
std::cout << line << std::endl;
}
}

Output

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASDNA SYSTEM "http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
<SEQUENCE id="chr1" start="100000" stop="100100" version="1.00">
<DNA length="101">
cactaagcacacagagaataatgtctagaatctgagtgccatgttatcaa
attgtactgagactcttgcagtcacacaggctgacatgtaagcatcgcca
t
</DNA>
</SEQUENCE>
</DASDNA>

A second streambuf : extracting the DNA from a BIODAS/DNA XML

For this second streambuf, I've used the Pull-Parser of libxml. This class is available at http://github.com/lindenb/cclindenb/blob/master/src/core/lindenb/bio/das/dna_streambuf.h. Here, the first time underflow is called, the instance of streambuf creates a new xmlTextReaderPtr skipping all the XML elements until it finds a new tag <DNA>. The internal buffer for this streambuf is then filled, during the remaining calls of underflow, with the text content of the DNA until it reaches the closing tag </DNA>

Usage

#include <fstream>
#include "lindenb/bio/das/dna_streambuf.h"
#include "lindenb/net/curl_streambuf.h"

static const char *url1="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:100000,100100";

static void test0002()
{
lindenb::net::curl_streambuf curl(url1);
std::istream in_net(&curl);
lindenb::bio::das::dna_streambuf dasdna(in_net);
std::istream in_das(&dasdna);
for(;;)
{
int c=in_das.get();
if(c==-1) break;
std::cout << (char)c;
}

}

Output

cactaagcacacagagaataatgtctagaatctgagtgccatgttatcaaattgtactgagactcttgcagtcacacaggctgacatgtaagcatcgccat


That's it
Pierre

1 comment:

Aaron Quinlan said...

Thank you, this is incredibly helpful!