24 October 2010

Where are the alternative reading frames in the Human Genome ?

The following post was inspired by a question asked recently on Biostar :"Do exons ever have different reading frames in spliced variants?".

To find those alternative reading frames I've used the table KnownGene available at UCSC from : http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownGene.txt.gz. This file contains the positions of the exons for each transcript in the human genome:

mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 -e 'select * from knownGene limit 10\G'

(...)
*************************** 7. row ***************************
name: uc009vis.1
chrom: chr1
strand: -
txStart: 4268
txEnd: 6628
cdsStart: 4268
cdsEnd: 4268
exonCount: 4
exonStarts: 4268,4832,5658,6469,
exonEnds: 4692,4901,5805,6628,
proteinID:
alignID: uc009vis.1
*************************** 8. row ***************************
name: uc009vit.1
chrom: chr1
strand: -
txStart: 4268
txEnd: 9622
cdsStart: 4268
cdsEnd: 4268
exonCount: 9
exonStarts: 4268,4832,5658,6469,6720,7095,7777,8130,8775,
exonEnds: 4692,4901,5810,6628,6918,7605,7924,8229,9622,
proteinID:
alignID: uc009vit.1


The following java program creates an array of bytes having a length greater than the length of the human chromosome chr1. This array is initialized with the constant 'NIL'. Then for each chromosome and each transcript, we loop over each exon and we record what was the reading frame (0, 1 or 2) at a given position. If this position was already flagged with another frame, a warning is printed to stdout.

Compilation & Execution


javac BioStar3034.java
java BioStar3034

Result


(...)
chr1:53286155-53286156 (+)
chr1:53286156-53286157 (+)
chr1:53286157-53286158 (+)
chr1:53286158-53286159 (+)
chr1:53286159-53286160 (+)
chr1:53286160-53286161 (+)
chr1:53286161-53286162 (+)
chr1:53286162-53286163 (+)
(...)
java BioStar3034 | sort | uniq | wc -l
300696



(Image from UCSC/OpenWetWare)


That's it
Pierre

No comments: