23 April 2010

Short post: Plain text vs Binary data.

This week, I was asked the difference between storing some data using a plain text format and a binary format. I wrote the following code C++ to illustrate how to store some genotypes in a ( possibly huge )table saved as a binary file. The first bytes of this file tells the number of individuals, the number of markers, the name of the individuals. Then for each marker, we get the name of the marker, its position and an array of genotypes for each individual.

The first invocation of this program (-p write ) creates a random table and a second call (-p read ) answers the genotypes for some random individuals/markers.



That's it.

Pierre

20 April 2010

Reading/Writing SAM/BAM files with the picard API for java.

I'm currently playing with the BAM/SAM Java API available in the picard package. This API can be used for creating new programs that read and write SAM files, the generic format for storing large nucleotide sequence alignments. Both SAM text format and SAM binary (BAM) format are supported. The source code comes with a simple example showing how to read and write a SAM/BAM file.

(...)
public void convertReadNamesToUpperCase(
final File inputSamOrBamFile,
final File outputSamOrBamFile
)
{
final SAMFileReader inputSam = new SAMFileReader(inputSamOrBamFile);
final SAMFileWriter outputSam = new SAMFileWriterFactory().makeSAMOrBAMWriter(inputSam.getFileHeader(),
true, outputSamOrBamFile);

for (final SAMRecord samRecord : inputSam) {
// Convert read name to upper case.
samRecord.setReadName(samRecord.getReadName().toUpperCase());
outputSam.addAlignment(samRecord);
}
outputSam.close();
inputSam.close();
}
(...)
toay, I've played with this SAM API to build two simple java tools:

SbamGrep

SbamGrep is available at http://code.google.com/p/code915/wiki/SbamGrep. It filters a SAM/BAM file using the SAM flags.

Option

 -o (filename-out) or default is stdout SAM
-v inverse selection
-f [flag] add bam/sam flag for filtering. multiple separeted with a comma. One of:
READ_PAIRED or 1 or 0x1
PROPER_PAIR_ or 2 or 0x2
READ_UNMAPPED or 4 or 0x4
MATE_UNMAPPED or 8 or 0x8
READ_STRAND or 16 or 0x10
MATE_STRAND or 32 or 0x20
FIRST_OF_PAIR or 64 or 0x40
SECOND_OF_PAIR or 128 or 0x80
NOT_PRIMARY_ALIGNMENT or 256 or 0x100
READ_FAILS_VENDOR_QUALITY_CHECK or 512 or 0x200
DUPLICATE_READ or 1024 or 0x400

Example

The following command lists all the reads having a flag (READ_PAIRED and READ_UNMAPPED and MATE_UNMAPPED and FIRST_OF_PAIR):
java -cp sam-1.16.jar:dist/sbamgrep.jar fr.inserm.umr915.sbamtools.SbamGrep -f 0x4,0x1,0x8,0x40 file.bam
Result (masked):
@HD VN:1.0 SO:unsorted
@SQ SN:chrT LN:349250
IL_XXXX1 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (94**0-)*7=0688855555522@86;;;5;6:;63:4?-622647..-.5.%
IL_XXXX2 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*+*2396,@5+5:@@@;;5)50)6960684;58;86*5102)0*+8:*137;
IL_XXXX3 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/999-00328:88984@@=8??@@:-66,;8;5;6+;255,1;788883676'
IL_XXXX4 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (916928.82@@50854;33222224;@25?5522;5=;;858888555*0666
IL_XXXX5 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*4*5-**32989+::@82;;853+39;80.53)-)79)..'55.8988*200
IL_XXXX6 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (*+**,14265;@@??@8?9@@@5@488?8666260.)-*9;;;88:8'05418
IL_XXXX7 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9136242-2@@@;96.888@@@@80$585882623.':**+3*03137..--.
IL_XXXX8 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/89255855@88?585557..())@@@;5552286526755@@5888..3;/$
(...)

SbamStats

The second tool, SbamStats is available at http://code.google.com/p/code915/wiki/SbamStats. It provides a simple report about all the SAM flags used in one or more files.

Example

:
java -cp sam-1.16.jar:dist/sbamstats.jar fr.inserm.umr915.sbamtools.SBamStats file.bam
Output:
READ_PAIRED:0x1|READ_STRAND:0x10|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 10827
READ_PAIRED:0x1|READ_STRAND:0x10|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 10827
READ_PAIRED:0x1|FIRST_OF_PAIR:0x40 13951
READ_PAIRED:0x1|SECOND_OF_PAIR:0x80 13951
READ_PAIRED:0x1|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 23846
READ_PAIRED:0x1|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 23846
READ_PAIRED:0x1|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 33606
READ_PAIRED:0x1|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 33606
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 143090
READ_PAIRED:0x1|READ_UNMAPPED:0x4|READ_STRAND:0x10|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 143090
READ_PAIRED:0x1|READ_UNMAPPED:0x4|SECOND_OF_PAIR:0x80 161781
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|FIRST_OF_PAIR:0x40 161781
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|SECOND_OF_PAIR:0x80 174429
READ_PAIRED:0x1|READ_UNMAPPED:0x4|FIRST_OF_PAIR:0x40 174429
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 176219
READ_PAIRED:0x1|READ_UNMAPPED:0x4|READ_STRAND:0x10|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 176219
READ_PAIRED:0x1|PROPER_PAIR_:0x2|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 915161
READ_PAIRED:0x1|PROPER_PAIR_:0x2|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 915161
READ_PAIRED:0x1|PROPER_PAIR_:0x2|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 2623345
READ_PAIRED:0x1|PROPER_PAIR_:0x2|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 2623345
READ_PAIRED:0x1|READ_UNMAPPED:0x4|MATE_UNMAPPED:0x8|SECOND_OF_PAIR:0x80 26232123
READ_PAIRED:0x1|READ_UNMAPPED:0x4|MATE_UNMAPPED:0x8|FIRST_OF_PAIR:0x40 26232123



That's it !
Pierre

19 April 2010

A stateful C function for R: parsing Fasta sequences

In the following post, I'll create a C extension for R. This extension will iterate over all the FASTA sequences in a file and will return a pair(name,sequence) for each sequence, that is to say that I won't store all the sequences in memory.

The C code

The structure FastaHandler holds the state of the fasta parser. It contains a pointer to the FILE, the current sequence and fasta header. We also need to save the previous line.
/** stateful structure for the fastafile */
typedef struct fastaHandler_t
{
/** input file */
FILE* in;
/** save the previous line, it will contains the next fasta header */
char* previous_line;
/** sequence name */
char *seq_name;
/** sequence dna */
char *seq_dna;
/** sequence_length */
int seq_length;
}FastaHandler,*FastaHandlerPtr;

The function fasta_open opens the fasta file and initialize a new FastaHandlerPtr. This pointer is then persisted into a 'R' variable using R_MakeExternalPtr.
/**
* open a fasta file, init the FastaHandler
*/
SEXP fasta_open(SEXP filename)
{
FastaHandlerPtr handle=NULL;
if(!isString(filename)) error("filename is not a string");
if(length(filename)!=1) error("expected only one filename");

handle=(FastaHandlerPtr)calloc(1,sizeof(FastaHandler));
if(handle==NULL)
{
error("Cannot alloc FastaHandler");
}
const char* c_filename=CHAR(STRING_ELT(filename,0));
errno=0;
handle->in= fopen( c_filename,"r");
if(handle->in==NULL)
{
error("Cannot open \"%s\": %s",c_filename,strerror(errno));
}
/** the handle is bound a R variable */
return R_MakeExternalPtr(handle, R_NilValue, R_NilValue);
}
We also need a function to close the structure. The pointer to the FastaHandlerPtr is retrieved using R_ExternalPtrAddr(R_variable):
/**
* close the FastaHandlerPtr
*/
SEXP fasta_close(SEXP r_handle)
{
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in!=NULL)
{
fclose(handle->in);
handle->in=NULL;
}
if(handle->previous_line!=NULL)
{
free(handle->previous_line);
handle->previous_line=NULL;
}
free(handle->seq_name);
handle->seq_name=NULL;
free(handle->seq_dna);
handle->seq_dna=NULL;
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}
The function fasta_next return the next pair(name,sequence):
/**
* read the next fasta sequence
*/
SEXP fasta_next(SEXP r_handle)
{
int count=0;
char* ptr=NULL;
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in==NULL) error("handle->in==NULL");

while((ptr=readLine(handle))!=NULL)
{
if(ptr[0]=='>')
{
//this is the header of a fasta sequence
if(handle->seq_name!=NULL || handle->seq_dna!=NULL)
{
handle->previous_line=ptr;
SEXP return_value=NULL;
//if there a sequence in memory ?
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
//create the sequence
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
//return the sequence if any
if(return_value!=NULL) return return_value;
}
//cleanup
handle->seq_name=ptr;
handle->seq_length=0;
handle->seq_dna=NULL;
}
else
{
int j=0;
int len=0;
//remove blank characters
while(ptr[j]!=0)
{
if(!isspace(ptr[j]))
{
ptr[len++]=ptr[j];
}
++j;
}
//enlarge the dna sequence
handle->seq_dna=realloc(
handle->seq_dna,
sizeof(char)*(handle->seq_length+1+len)
);
if(handle->seq_dna==NULL) error("cannot realloc seq_dna");
//append the new line
memcpy(&handle->seq_dna[handle->seq_length],ptr,sizeof(char)*len);
handle->seq_length+=len;
handle->seq_dna[handle->seq_length]=0;
free(ptr);
ptr=NULL;
}
}
//last sequence
SEXP return_value=NULL;
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
return (return_value==NULL?R_NilValue:return_value);
}
This function calls readLine reading the next non-empty line:
/**
* return the next non empty line from a FastaHandlerPtr
* should be free()
*/
static char* readLine(FastaHandlerPtr handle)
{
int c;
char* ptr=NULL;
int length=0;
/* if there was a saved line, return it */
if(handle->previous_line!=NULL)
{
ptr=handle->previous_line;
handle->previous_line=NULL;
return ptr;
}
/** while the line is empty */
while(length==0)
{
int capacity=0;
//no more to read ?
if(feof(handle->in)) return NULL;
ptr=NULL;
//read each char
while((c=fgetc(handle->in))!=EOF)
{
//if this is an EOL, break
if(c=='\n') break;
//enlarge the buffer if it is too small
if(ptr==NULL || (length+2)>=capacity)
{
capacity+=500;
ptr=(char*)realloc(ptr,capacity*sizeof(char));
if(ptr==NULL) error("cannot realloc to %d bytes",capacity);
}
//append the char to the buffer
ptr[(length)++]=c;
}
//nothing was read (empty line) ? continue
if(length==0)
{
free(ptr);
ptr=NULL;
}
}
//add a end-of-string char
ptr[length]=0;
//return the line
return ptr;
}
It also invokes the function make_sequence(name,dna). This function creates a new R pair(name,sequence) :
/** create a pair list containing the name and the dna */
static SEXP make_sequence(const char* seq_name,const char* length)
{
SEXP values,names;
//the R value contains two objects
PROTECT(values = allocVector(VECSXP, 2));
//first item is the sequence name
SET_VECTOR_ELT(values, 0, mkString(seq_name));
//second item is the sequence dna
SET_VECTOR_ELT(values, 1, mkString(length));

//the labels
PROTECT(names = allocVector(STRSXP, 2));
//first label is the name
SET_STRING_ELT(names, 0, mkChar("name"));
//2nd label is the name
SET_STRING_ELT(names, 1, mkChar("sequence"));

//bind the labels to the values
setAttrib(values, R_NamesSymbol, names);
UNPROTECT(2);
return values;
}

The 'R' code

The 'C' dynamic library is loaded and each C function is bound to R using ".Call":
dyn.load(paste("libfasta", .Platform$dynlib.ext, sep=""))

fasta.open <- function(filename)
{
.Call("fasta_open", filename)
}

fasta.close <- function(handler)
{
.Call("fasta_close",handler)
}

fasta.next <- function(handler)
{
.Call("fasta_next",handler)
}

Compiling

Here is my makefile:
run:libfasta.so rs_chLG1.fas
${R_HOME}/bin/R --no-save < fasta.R

rs_chLG1.fas:
wget -O $@.gz ftp://ftp.ncbi.nih.gov/snp/organisms/bee_7460/rs_fasta/rs_chLG1.fas.gz
gunzip $@.gz

libfasta.so:fasta.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include fasta.c
gcc -shared -Wl,-soname,fasta.so.1 -o $@ fasta.o

Testing

I've downloaded some fasta sequences from dbSNP/Apis Mellifera. This file is open with fasta.open, we loop over each sequence and we print those sequences having a length lower than 300bp, at the end we close the stream with fasta.close:
f<-fasta.open("rs_chLG1.fas")

while(!is.null(snp <-fasta.next(f)))
{
if(nchar(as.character(snp$sequence))< 300)
print(snp)
}

fasta.close(f)

Result:
$name
[1] "gnl|dbSNP|rs44106732 rs=44106732|pos=48|len=298|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "GAAAATCAAAGACAATTTTTGGAATGGAACTAAATAACTTTATTCTTYCTTTCTTTCTCGTCGAGAATATCGTTATCGTTGCATGGGTTATGGAATAGCGTTCGTTAAAAATGTTATATTTCGAGGAAATATCGAAGATAGGCTTTGCGAAAGTCTGTTTCTCTAGAATTAAGATTTATATGTGTTGCAAGGGGAAGTTCAAAGAGAAATCGTGGCCAGTTCGAACATTATTATGTCTATCAATGATCGAGAAGTGTCATTGATGCAAGAGAAAAGTTTTCTCTTGCATTTAAGTATC"

$name
[1] "gnl|dbSNP|rs44108824 rs=44108824|pos=251|len=268|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "GTTAAATGGGAATTTTGGGGATTATTGGAGGAGGATTTACGTTTCGAGGATTGTTGATGATCTTAGGATTGTTTTCAGTTTGGAATTTTTTCTTCTTCTTCAACGTTGTACAACATTCTCCTCGAATTTTGTGTAACGAGGAGGATTAAACCTTTGGAAAATCACGTAAATTAGAGGACGATATATTGGGTTGGCAACTAAGTAATTGCGGATTTTTTTTAGAAAATCAAAGACAATTTTTGGAATGGAAYTAAATAACTTTATTCTT"

$name
[1] "gnl|dbSNP|rs44150807 rs=44150807|pos=251|len=274|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "TTCGATCTTCGTTTCACGCTCACGTTTCACGTTTCTCGTCCGGCGTAACGAACGGTATTCCGCTGATCACGAATAATTTCTTTCTGGAGAGTCCATTAGGGGACCGTCCCCTCCCCCCTCTCGCGCACACAGACACCCATCGCTTTCGACGCCTCGTCCATTCGAGGGAGAAACGAACGACTATTAGAAAAAAATCTTCTTTATATCTCTATAAATTCAATTTGCAGAGAAGCAAAGAGCTTTAAAATATYAACCATTATAACCGAACTTGTTG"

(....)


Full Source code


Makefile


run:libfasta.so rs_chLG1.fas
${R_HOME}/bin/R --no-save < fasta.R

rs_chLG1.fas:
wget -O $@.gz ftp://ftp.ncbi.nih.gov/snp/organisms/bee_7460/rs_fasta/rs_chLG1.fas.gz
gunzip $@.gz

libfasta.so:fasta.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include fasta.c
gcc -shared -Wl,-soname,fasta.so.1 -o $@ fasta.o

fasta.c


#include <ctype.h>
#include <errno.h>
#include <R.h>
#include <Rinternals.h>

/** stateful structure for the fastafile */
typedef struct fastaHandler_t
{
/** input file */
FILE* in;
/** save previous line, it will contains the next fasta header */
char* previous_line;
/** sequence name */
char *seq_name;
/** sequence dna */
char *seq_dna;
/** sequence_length */
int seq_length;

}FastaHandler,*FastaHandlerPtr;


/**
* return the next non empty line from a FastaHandlerPtr
* should be free()
*/
static char* readLine(FastaHandlerPtr handle)
{
int c;
char* ptr=NULL;
int length=0;
/* if there was a saved line, return it */
if(handle->previous_line!=NULL)
{
ptr=handle->previous_line;
handle->previous_line=NULL;
return ptr;
}
/** while the line is empty */
while(length==0)
{
int capacity=0;
//no more to read ?
if(feof(handle->in)) return NULL;
ptr=NULL;
//read each char
while((c=fgetc(handle->in))!=EOF)
{
//if this is an EOL, break
if(c=='\n') break;
//enlarge the buffer if it is too small
if(ptr==NULL || (length+2)>=capacity)
{
capacity+=500;
ptr=(char*)realloc(ptr,capacity*sizeof(char));
if(ptr==NULL) error("cannot realloc to %d bytes",capacity);
}
//append the char to the buffer
ptr[(length)++]=c;
}
//nothing was read (empty line) ? continue
if(length==0)
{
free(ptr);
ptr=NULL;
}
}
//add a end-of-string char
ptr[length]=0;
//return the line
return ptr;
}

/**
* open a fasta file, init the FastaHandler
*/
SEXP fasta_open(SEXP filename)
{
FastaHandlerPtr handle=NULL;
if(!isString(filename)) error("filename is not a string");
if(length(filename)!=1) error("expected only one filename");

handle=(FastaHandlerPtr)calloc(1,sizeof(FastaHandler));
if(handle==NULL)
{
error("Cannot alloc FastaHandler");
}
const char* c_filename=CHAR(STRING_ELT(filename,0));
errno=0;
handle->in= fopen( c_filename,"r");
if(handle->in==NULL)
{
error("Cannot open \"%s\": %s",c_filename,strerror(errno));
}
/** the handle is bound a R variable */
return R_MakeExternalPtr(handle, R_NilValue, R_NilValue);
}

/** create a pair list containing the name and the dna */
static SEXP make_sequence(const char* seq_name,const char* length)
{
SEXP values,names;
//the R value contains two objects
PROTECT(values = allocVector(VECSXP, 2));
//first item is the sequence name
SET_VECTOR_ELT(values, 0, mkString(seq_name));
//second item is the sequence dna
SET_VECTOR_ELT(values, 1, mkString(length));

//the labels
PROTECT(names = allocVector(STRSXP, 2));
//first label is the name
SET_STRING_ELT(names, 0, mkChar("name"));
//2nd label is the name
SET_STRING_ELT(names, 1, mkChar("sequence"));

//bind the labels to the values
setAttrib(values, R_NamesSymbol, names);
UNPROTECT(2);
return values;
}

/**
* read the next fasta sequence
*/
SEXP fasta_next(SEXP r_handle)
{
int count=0;
char* ptr=NULL;
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in==NULL) error("handle->in==NULL");

while((ptr=readLine(handle))!=NULL)
{
if(ptr[0]=='>')
{
//this is the header of a fasta sequence
if(handle->seq_name!=NULL || handle->seq_dna!=NULL)
{
handle->previous_line=ptr;
SEXP return_value=NULL;
//if there a sequence in memory ?
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
//create the sequence
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
//return the sequence if any
if(return_value!=NULL) return return_value;
}
//cleanup
handle->seq_name=ptr;
handle->seq_length=0;
handle->seq_dna=NULL;
}
else
{
int j=0;
int len=0;
//remove blank characters
while(ptr[j]!=0)
{
if(!isspace(ptr[j]))
{
ptr[len++]=ptr[j];
}
++j;
}
//enlarge the dna sequence
handle->seq_dna=realloc(
handle->seq_dna,
sizeof(char)*(handle->seq_length+1+len)
);
if(handle->seq_dna==NULL) error("cannot realloc seq_dna");
//append the new line
memcpy(&handle->seq_dna[handle->seq_length],ptr,sizeof(char)*len);
handle->seq_length+=len;
handle->seq_dna[handle->seq_length]=0;
free(ptr);
ptr=NULL;
}
}
//last sequence
SEXP return_value=NULL;
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
return (return_value==NULL?R_NilValue:return_value);
}

/**
* close the FastaHandlerPtr
*/
SEXP fasta_close(SEXP r_handle)
{
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in!=NULL)
{
fclose(handle->in);
handle->in=NULL;
}
if(handle->previous_line!=NULL)
{
free(handle->previous_line);
handle->previous_line=NULL;
}
free(handle->seq_name);
handle->seq_name=NULL;
free(handle->seq_dna);
handle->seq_dna=NULL;
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}

fasta.R


dyn.load(paste("libfasta", .Platform$dynlib.ext, sep=""))

fasta.open <- function(filename)
{
.Call("fasta_open", filename)
}
fasta.close <- function(handler)
{
.Call("fasta_close",handler)
}

fasta.next <- function(handler)
{
.Call("fasta_next",handler)
}

f<-fasta.open("rs_chLG1.fas")

while(!is.null(snp <-fasta.next(f)))
{
if(nchar(as.character(snp$sequence))< 300)
print(snp)
}

fasta.close(f)

That's it !
Pierre

14 April 2010

Object Oriented Programming with R: My notebook

In the following post, I describe how I've used the OOP features of R to create and use the following class hierarchy:

Your browser does not support the <CANVAS> element !


First the class Person is defined. It contains four fields : firstName, lastName, birthDate and birthPlace.
setClass("Person",
representation(
firstName="character",
lastName="character",
birthDate="Date",
birthPlace="character"
))

A kind of 'constructor' function can be called for Person to check that both firtsName and lastName are not empty:
setValidity("Person",
function(object)
{
length(object@firstName)>0 &&
length(object@lastName)>0
}
)
DeceasedPerson is a subClass of Person, it contains two more fields: deathPlace and deathDate:
setClass("DeceasedPerson",
representation(
deathDate="Date",
deathPlace="character"
),
contains="Person"
)
Scientist is another subClass of Person it contains one more field:'knownFor':
setClass("Scientist",
representation(
knownFor="character"
),
contains="Person"
)
Lastly, DeceasedScientist is a subClass of both Scientist and DeceasedPerson:
setClass("DeceasedScientist",
contains=c("Scientist","DeceasedPerson")
)
Let's define a 'generic' function 'age' returning the age of an individual from his 'birthdate':
age <- function(individual)
{
as.integer((Sys.Date()-individual@birthDate)/365)
}
setGeneric("age")
Polymorphism: for the DeceasedPerson another function will be used, it will calculate the age from both 'deathDate' and 'birthDate':
age.of.death <- function(individual)
{
as.integer((individual@deathDate-individual@birthDate)/365)
}
setMethod(age,signature=c("DeceasedPerson"),definition=age.of.death)
Ok, let's play with our class, we can first create a new instance of Scientist for Craig Venter:
craigVenter <-new(
"Scientist",
firstName="Craig",
lastName="Venter",
birthPlace="Salt Lake City",
birthDate=as.Date("1946-10-14", "%Y-%m-%d"),
knownFor=c("The Institute for Genomic Research","J. Craig Venter Institute")
)
... and Charles Darwin is a DeceasedScientist:
charlesDarwin <-new(
"DeceasedScientist",
firstName="Charles",
lastName="Darwin",
birthDate=as.Date("1809-02-12", "%Y-%m-%d"),
deathDate=as.Date("1882-04-19", "%Y-%m-%d"),
knownFor=c("Natural Selection","The Voyage of the Beagle")
)
Hey , we know where Charles was born!
charlesDarwin@birthPlace="Shrewsbury"
The following statement fails because the firstName is empty:
> try(new("Person",lastName="Darwin",birthDate=as.Date("1809-02-12", "%Y-%m-%d")),FALSE)
Error in validObject(.Object) : invalid class "Person" object: FALSE
Is Darwin a valid object?:
> validObject(charlesDarwin)
[1] TRUE
Print both individuals:
> charlesDarwin
An object of class “DeceasedScientist”
Slot "knownFor":
[1] "Natural Selection" "The Voyage of the Beagle"

Slot "firstName":
[1] "Charles"

Slot "lastName":
[1] "Darwin"

Slot "birthDate":
[1] "1809-02-12"

Slot "birthPlace":
[1] "Shrewsbury"

Slot "deathDate":
[1] "1882-04-19"

Slot "deathPlace":
character(0)

> craigVenter
An object of class “Scientist”
Slot "knownFor":
[1] "The Institute for Genomic Research" "J. Craig Venter Institute"

Slot "firstName":
[1] "Craig"

Slot "lastName":
[1] "Venter"

Slot "birthDate":
[1] "1946-10-14"

Slot "birthPlace":
[1] "Salt Lake City"
Let's use the 'is' operator:
> is(craigVenter,"Person")
[1] TRUE
> is(craigVenter,"DeceasedScientist")
[1] FALSE
> is(charlesDarwin,"DeceasedScientist")
[1] TRUE
Finally let's invoke the polymorhic function 'age' for both individuals:
> age(charlesDarwin)
[1] 73 #age.of.death was called
> age(craigVenter)
[1] 63 #generic "age'


Full source code

setClass("Person",
representation(
firstName="character",
lastName="character",
birthDate="Date",
birthPlace="character"
))

setValidity("Person",
function(object)
{
length(object@firstName)>0 &&
length(object@lastName)>0
}
)

setClass("DeceasedPerson",
representation(
deathDate="Date",
deathPlace="character"
),
contains="Person"
)

setClass("Scientist",
representation(
knownFor="character"
),
contains="Person"
)

age <- function(individual)
{
as.integer((Sys.Date()-individual@birthDate)/365)
}

setGeneric("age")

age.of.death <- function(individual)
{
as.integer((individual@deathDate-individual@birthDate)/365)
}


setClass("DeceasedScientist",
contains=c("Scientist","DeceasedPerson")
)

setMethod(age,signature=c("DeceasedPerson"),definition=age.of.death)

craigVenter <-new(
"Scientist",
firstName="Craig",
lastName="Venter",
birthPlace="Salt Lake City",
birthDate=as.Date("1946-10-14", "%Y-%m-%d"),
knownFor=c("The Institute for Genomic Research","J. Craig Venter Institute")
)

charlesDarwin <-new(
"DeceasedScientist",
firstName="Charles",
lastName="Darwin",
birthDate=as.Date("1809-02-12", "%Y-%m-%d"),
deathDate=as.Date("1882-04-19", "%Y-%m-%d"),
knownFor=c("Natural Selection","The Voyage of the Beagle")
)

try(new("Person",lastName="Darwin",birthDate=as.Date("1809-02-12", "%Y-%m-%d")),FALSE)

charlesDarwin@birthPlace="Shrewsbury"

validObject(charlesDarwin)

charlesDarwin
craigVenter


is(craigVenter,"Person")
is(craigVenter,"DeceasedScientist")
is(charlesDarwin,"DeceasedScientist")
age(charlesDarwin)
age(craigVenter)


That's it !
Pierre

08 April 2010

BAM, BWA and SAMTOOLS: my notebook

Here are my notes about MAQ,BWA and SAMTOOLS.

MAQ


Maq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. Here I used chrM.fa, the reference sequence for the mitochondrial genome
>chrM
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT
GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC
CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT
TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTA.....
fasta2bfa converts the reference sequence to the binary fasta format
$(MAQ) fasta2bfa chrM.fa _reference.bfa


I also generated a set of random fastq sequences for the chrM. Each read contains a mutation A/G every 10 bases and an insertion of 3 bases every 50 bases.
(...)
@chrM|8|2
GTATATCACCCTATTAACCACTAACGGGAG
+
kzzst|{efokkdyirlqkflupvofumxe
@chrM|20|3
ATTAACCACTAACGGGAGCTATCCATGCAT
+
zplixjsdlmeemjrxxkt|nrd?qwqwym
@chrM|25|4
CCACTAACGGGAGCTATCCATGCATAAATGGT
+
|uklu}+rqepiuoi|qjyrhpyzlllrq|oz
@chrM|32|5
CGGGAGCTATCCATGCATAAATGGTATTTTAG
+
qyvylqiqkhddve4rddxh|frmznluouxp
(...)

fastq2bfq converts the short reads to the binary fasta format:
$(MAQ) fastq2bfq chrM.fastq _reads.bfq

We can now align the reads to the chrM:
$(MAQ) map -u _poor.txt -H _multiple.data _align.map _reference.bfa _reads.bfq
'_multiple.data' contains the multiple/all 01-mismatch hits , and '_poor.txt' contains the unmapped and poorly aligned reads:
chrM|0|1 99 AAAATCACAGGTATATCACCCTATTAACCACT ````````T``````)````````````````
chrM|25|4 99 CCACTAACGGGAGCTATCCATGCATAAATGGT ``````+`````````````````````````
chrM|32|5 99 CGGGAGCTATCCATGCATAAATGGTATTTTAG ``````````````4`````````````````
chrM|35|6 99 GAGCTATCCATGCATAAATGGTATTTTAGTCT ``````````````````````2`````````
chrM|49|7 99 TAAATGGTATTTTAGTCTGGGGGATGTGCACG ````````````````````````````````
chrM|76|10 99 ACGCAATAGCATTGAGAGACGCTGAAAAGCCG `````````````&``````````````````
chrM|80|11 99 AATAGCATTGAGAGACGCTGAAAAGCCGGAGC ``````````````B```Q``````2``````
chrM|89|12 99 GAGAGACGCTGAAAAGCCGGAGCACCCTATGT ````````````````:```````````````
chrM|90|13 99 AGAGACGCTGAAAAGCCGGAGCACCCTATGTC ````````````````````````````````
chrM|92|14 99 AGACGCTGAAAAGCCGGAGCACCCTATGTCAC ``````````````````?`````````````

View the mapping alignment with mapview:
$(MAQ) mapview _align.map > _mapview.txt

##result 'verticalized'

$1 read.name : chrM|71|26834
$2 chromosome : chrM
$3 position : 72
$4 strand : +
$5 insert.size : 0
$6 paired.flag : 0
$7 mapping.quality : 59
$8 single-end.mapping.quality : 59
$9 alternative.mapping.quality: 59
$10 number.mismatches.of.best.hit : 2
$11 sum.qualities.mismatched.bases.best.hit : 60
$12 number of 0-mismatch hits of the first 24bp : 0
$13 number of 1-mismatch hits of the first 24bp on ref : 1
$14 length of the read : 32
$15 sequence : TGTGCACGCGATAGCATTGGGAGACGCTGGGG
$16 quality : ```````````````````````X````````
(...)


mapstat : get statistics from the alignment:
$(MAQ) mapstat _align.map > _mapstat.txt


-- Total number of reads: 308
-- Sum of read length: 9856
-- Error rate: 0.058746
-- Average read length: 32.00

-- Mismatch statistics:

-- MM 0 1
-- MM 1 37
-- MM 2 268
-- MM 3 2

-- Mapping quality statistics:

-- MQ 20-29 1 1
-- MQ 40-49 138 138
-- MQ 50-59 107 107
-- MQ 70-79 62 62

-- Flag statistics:

-- FG 0 308


mapcheck: Read quality check
$(MAQ) mapcheck _reference.bfa _align.map > _mapcheck.txt

Number of reference sequences: 1
Length of reference sequences exlcuding gaps: 16571
Length of gaps in the reference sequences: 0
Length of non-gap regions covered by reads: 3846
Length of 24bp unique regions of the reference: 3846
Reference nucleotide composition: A: 30.86%, C: 31.33%, G: 13.16%, T: 24.66%
Reads nucleotide composition: A: 41.35%, C: 24.81%, G: 13.91%, T: 19.93%
Average depth across all non-gap regions: 0.591
Average depth across 24bp unique regions: 0.591

A C G T : AC AG AT CA CG CT GA GC GT TA TC TG : 0? 1? 2? 3? 4? 5? 6? : 0? 1? 2? 3? 4? 5? 6?
1 52.1 18.5 16.8 12.5 : 0 33 25 253 0 0 125 0 0 288 0 182 : 3 204 0 0 254 340 198 : 15 998 167 167 0 0 0
2 56.4 16.2 12.9 14.5 : 0 41 0 164 33 0 171 0 0 220 0 34 : 3 36 0 3 412 346 198 : 15 991 167 15 200 29 0
3 52.1 19.5 12.5 15.8 : 0 29 7 114 43 0 222 0 0 138 0 52 : 3 7 3 7 438 340 201 : 15 8 15 8 255 10 0
4 45.9 24.1 10.2 19.8 : 0 15 15 52 0 0 97 0 0 48 0 16 : 10 3 3 3 438 336 204 : 5 15 15 15 105 10 0
5 39.9 25.7 15.2 19.1 : 0 34 0 13 0 0 68 0 0 48 0 16 : 0 3 3 3 442 353 195 : 167 15 15 15 90 0 0
6 35.6 25.7 13.5 25.1 : 0 10 0 37 0 0 48 0 0 26 0 0 : 3 10 0 7 445 343 191 : 15 5 167 8 59 0 0
7 38.3 31.4 12.5 17.8 : 0 0 0 10 0 0 73 0 0 0 0 0 : 0 7 0 7 445 340 201 : 167 8 167 8 30 0 0
8 38.0 26.7 14.2 21.1 : 0 0 0 0 0 0 0 0 0 0 0 0 : 3 7 3 3 435 346 201 : 15 8 15 15 0 0 0
9 39.9 20.1 18.5 21.5 : 0 0 0 116 0 0 18 0 0 58 0 0 : 0 17 7 0 425 353 198 : 167 3 8 167 8 112 0
10 48.5 15.2 17.2 19.1 : 0 0 0 238 32 0 82 0 0 113 0 70 : 0 0 3 7 435 353 201 : 167 167 15 8 46 261 0
11 36.6 27.4 17.5 18.5 : 0 0 0 0 0 0 0 0 0 0 0 0 : 3 13 7 10 425 353 188 : 15 4 8 5 0 0 0
(...)


pileup displays the alignment in a ‘pileup’ text format.
$(MAQ) pileup _reference.bfa _align.map > _pileup.txt

#chrom pos. ref depth base.on.read(@=same)
chrM 1 G 0 @
chrM 2 A 0 @
chrM 3 T 0 @
(...)
chrM 72 T 1 @,
chrM 73 G 1 @,
chrM 74 T 1 @,
chrM 75 G 1 @,
(...)
chrM 110 C 2 @,,
chrM 111 A 2 @GG
chrM 112 C 2 @,,
(..)


assemble creates a consensus sequences from read mapping.
$(MAQ) assemble _consensus.cns _reference.bfa _align.map 2> /dev/null


cns2snp extracts SNP sites from the consensus.
$(MAQ) cns2snp _consensus.cns > _snp.txt

##verticalized
$1 chromosome : chrM
$2 position : 91
$3 ref.base : C
$4 consensus.base : G
$5 phred.qual : 30
$6 read.depth : 1
$7 average number of hits of reads : 1.00
$8 highest mapping quality of the reads covering the position : 59
$9 minimum consensus quality in the 3bp flanking regions at each side of the site : 30
$10 second best call : N
$11 log likelihood ratio of the second best and the third best call : 29
$12 third best call : N
(...)


cns2view shows detailed information at all sites from the consensus.
$(MAQ) cns2view _consensus.cns _viewsnp.txt

##verticalized
$1 chromosome : chrM
$2 position : 1
$3 ref.base : G
$4 consensus.base : N
$5 phred.qual : 0
$6 read.depth : 0
$7 average number of hits of reads : 0.00
$8 highest mapping quality of the reads covering the position : 0
$9 minimum consensus quality in the 3bp flanking regions at each side of the site : 0
$10 second best call : N
$11 log likelihood ratio of the second best and the third best call : 0
$12 third best call : N
(...)


cns2fq extracts the consensus sequences in FASTQ format. "Bases in lower case are essentially repeats or do not have sufficient coverage; bases in upper case indicate regions where SNPs can be reliably called"
$(MAQ) cns2fq _consensus.cns > _cns2fq.txt


@chrM
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnntgtgcacgcgatagcattgggagacgcTggrGccggagcgccctatgtc
gcagtatctgnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnntcaATATT
ACAGGCGAACATACCTACTAAAAAGtgtnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnncagACATCATAACAAAAAATTTCCACCA
AAAACCccccnnnnnnnnnnnntggccacaacacttaaacacatctctgcaaaannnnnn
(...)
+
(...)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!???????????????????????????2EE8EBBBBBBBBBBBBBBBBB
BBBBBBBBBB!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!BHHHKKKK
KKKKKKKKKKKKKKKKKKKKKKKKE???!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
(...)

BWA


Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.

index indexes the chrM in the FASTA format:
$(BWA) index -a is -p _bwa.db chrM.fa


aln finds the suffix array coordinates of the input reads.
$(BWA) aln _bwa.db chrM.fastq > _bam.aln.sai


samse generates alignments in the SAM format given single-end reads.
$(BWA) samse _bwa.db _bam.aln.sai chrM.fastq > _bam.samse.sam

#_bam.samse.sam verticalized
(...)
$1 query : chrM|12602|40780
$2 flag : 0
$3 ref: chrM
$4 position : 12603
$5 quality : 37
$6 cigar.string : 30M
$7 mate.ref.sequence : *
$8 mate.pos : 0
$9 insert.size : 0
$10 query.seq : TCATCCCTGTAGCATTGTGCGTTACATGGT
$11 query.qual : kdfz=|srknznewhnkftyowhtmevnhz
$12 ? : XT:A:U ##cf. manual
$13 ? : NM:i:1
$14 ? : X0:i:1
$15 ? : X1:i:0
$16 ? : XM:i:1
$17 ? : XO:i:0
$18 ? : XG:i:0
$19 ? : MD:Z:18T11

(...)


SAM tools

The SAM format is a generic format for storing large nucleotide sequence alignments. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. Sometimes this tool needs a tab-delimited file 'chrM.ref' containing the length of each chromosome:
chrM 16571


import converts the alignment into BAM (a binary/compact version of SAM)
$(SAM) import chrM.ref _bam.samse.sam _unsorted.bam


sort BAM alignments by leftmost coordinates
$(SAM) sort _unsorted.bam _sorted


sort prints the alignments:
$(SAM) view -h _sorted.bam _sam.view.txt

#verticalized result (BAM format):

$1 ? : chrM|8|2
$2 ? : 0
$3 ? : chrM
$4 ? : 9
$5 ? : 25
$6 ? : 30M
$7 ? : *
$8 ? : 0
$9 ? : 0
$10 ? : GTATATCACCCTATTAACCACTAACGGGAG
$11 ? : kzzst|{efokkdyirlqkflupvofumxe
$12 ? : XT:A:U
$13 ? : NM:i:2
$14 ? : X0:i:1
$15 ? : X1:i:0
$16 ? : XM:i:2
$17 ? : XO:i:0
$18 ? : XG:i:0
$19 ? : MD:Z:2C19C7


pileup prints the alignment in the pileup format.
$(SAM) pileup -c -f chrM.fa _sorted.bam > _sam.pileup.txt

#verticalized...
(...)
$1 chromosome name : chrM
$2 ref coordinate : 11
$3 reference base : C
$4 read base : A
$5 consensus quality : 60
$6 SNP quality : 60
$7 root mean square mapping quality : 25
$8 reads covering the position : 11
$9 read bases at a SNP line : AAAAAAAAA^:A^:A
$10 read quality : yj
(...)


That's it
Pierre

05 April 2010

The Alchemist: CSS pseudo 3D.

In the following post, I use a two-and-a-half-dimensional (2.5) technique to fake a three-dimensional (3D) visualization of Joseph Wright of Derby's painting The Alchemist in Search of the Philosopher's Stone This post was mostly inspired by Román Cortès' 3D Menimas and a post on Experiment Garden.

The Alchemist in Search of the Philosopher's Stone


The following image/script was successfully tested under Firefox 3.6. Move your mouse over the image

How it works

Each component of the collage is part of a larger transparent picture:
An array defines each sprite. One sprite is defined by its rectangular clip (x,y,width,height) in the original image, its position (x,y) in the collage and a value (ratio) reflecting the amount of pixels it should scroll
var picts=[
{
/* background */
img:null,
clip:[0,0,image_width,image_height],
ratio:0.0,
x:0,y:0
},
{
/* people */
img:null,
clip:[227,1,111,147],
ratio:0.1,
x:5,y:154
},
(...)
];
The HTML is defined as follow:
<div id='id9879709_main' style='overflow:hidden;position:relative;background-color:red;width:226px;height:320px;border:1px solid black;'>
<span style="position:relative;padding:3px;background-color:white;top:15px;left:15px;border:2px solid black;z-index:100; font-weight: bold;"><a target="_wp" href="http://commons.wikimedia.org/wiki/File:JosephWright-Alchemist.jpg">The Alchemist</a></span>
</div>
.When this html document is loaded, a <div> clipping each sprite is created an inserted into the document:
var main=document.getElementById('main');
main.onmousemove = callback;
for(var i in picts)
{
var pict=picts[i];
pict.img=document.createElement("div"); pict.img.setAttribute("style","position:absolute;top:"+pict.y+"px;left:"+pict.x+"px;width:"+pict.clip[2]+"px;height:"+pict.clip[3]+"px;z-index:"+i+";background:url(alchemy02.png) no-repeat scroll -"+pict.clip[0]+"px -"+pict.clip[1]+"px;");
main.appendChild(pict.img);
}
Lastly a javascript callback is called each time the mouse moves over the image. This method shifts each sprites according to the value of its 'ratio':
var v=document.getElementById('id9879709_main');
var left=v.offsetLeft;
var width=226.0;
var midX= left+width/2.0;
var maxshift=30;
var mouseX=e.clientX;
if(mouseX &gt; left+width) mouseX=left+width;
if(mouseX &lt; left) mouseX=left;
var shift= ((e.clientX-midX)/(width/2.0))*maxshift;
for(var i in picts)
{
var pict=picts[i];
pict.img.style.left=(pict.x+shift*pict.ratio)+"px";
}


That's it
Pierre