12 May 2010

BigJoin: joining large files

This page was copied from http://code.google.com/p/code915/wiki/BigJoin.

BigJoin is a java tool I wrote for my lab ( primarily for joining some VCF files) . It joins some large files on one or more columns.

Synopsis

java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin [options] {files(gz)|url(gz)}

Usage

This tool is used to merge some large files using one or more column as the common key for each file. It temporarily stores and sorts its data using BerkeleyDB. There can be two or more files.

Requirements


Download

Download bigjoin.jar at https://code.google.com/p/code915/downloads/list

Source code

https://code.google.com/p/code915/source/browse/trunk/tools/src/java/fr/inserm/umr915/tools/BigJoin.java

Options

  • -h help; This screen.
  • -i case insensible
  • -u expect uniq keys per file (faster)
  • -d regex-delim (default:tab)
  • -c separated by commas (first column is '1')
  • -g ignore empty trim(key)
  • -bdb bdb directory default:${HOME}
  • -t delimiter default:tab
  • -s smart sorting (slower) default:true
  • -z sip values (slower, less spaces) default:false
  • -L do a 'left join' if data is missing
  • -null [string] value if data is missing default:NULL
  • -p print key(s) as the very first columns. default:false
  • --log-level level optional. one of :java.util.logging.Level

Examples


The following example joins the ensGene.txt.gz refGene.txt.gz using the chromosome/start/end as the common key. If one file is mismatching, the missing values will be replaced by some "NULL". The output will be ordered by chrom/start/end:
java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -L -s \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz |\
grep -v NULL | head -n 20 | tr " " ";"
(...)
594;ENST00000323275;chr1;-;1236827;1249909;1237101;1249851;17;1236827,1237260,1237468,1237682,1237835,1238029,1238277,1238751,1238974,1239543,1240066,1240646,1240762,1244538,1245698,1246238,1249823,;1237167,1237390,1237611,1237744,1237943,1238192,1238367,1238835,1239164,1239608,1240205,1240681,1240861,1244767,1245772,1246336,1249909,;0;ENSG00000127054;cmpl;cmpl;0,2,0,1,1,0,0,0,2,0,2,0,0,2,0,1,0,;594;NM_017871;chr1;-;1236827;1249909;1237101;1249851;17;1236827,1237260,1237468,1237682,1237835,1238029,1238277,1238751,1238974,1239543,1240066,1240646,1240762,1244538,1245698,1246238,1249823,;1237167,1237390,1237611,1237744,1237943,1238192,1238367,1238835,1239164,1239608,1240205,1240681,1240861,1244767,1245772,1246336,1249909,;0;CPSF3L;cmpl;cmpl;0,2,0,1,1,0,0,0,2,0,2,0,0,2,0,1,0,
594;ENST00000343938;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;ENSG00000215792;cmpl;cmpl;-1,0,2,;594;NM_001029885;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;GLTPD1;cmpl;cmpl;-1,0,2,
594;ENST00000360706;chr1;+;1250005;1254139;1253433;1254099;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;ENSG00000187488;cmpl;cmpl;-1,-1,0,;594;NM_001029885;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;GLTPD1;cmpl;cmpl;-1,0,2,
594;ENST00000338338;chr1;-;1298972;1300443;1299043;1299999;4;1298972,1299242,1299947,1300396,;1299145,1299688,1300033,1300443,;0;ENSG00000175756;cmpl;cmpl;0,1,0,-1,;594;NM_001127229;chr1;-;1298972;1300443;1299043;1299999;4;1298972,1299242,1299947,1300239,;1299145,1299688,1300033,1300443,;0;AURKAIP1;cmpl;cmpl;0,1,0,-1,

Same, count the lines, use default ordering, join the missing values:
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -L \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

93252 2984064 36481577

real 0m10.229s
user 0m12.697s
sys 0m0.548s


Count the lines, use default ordering, do not join the missing values:
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

11519 368608 7775937

real 0m8.966s
user 0m11.925s
sys 0m0.456s


Count the lines, use default ordering, do not join the missing values, zip the data (slower but uses less space):
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -z \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

11519 368608 7775937

real 0m16.631s
user 0m19.181s
sys 0m0.500s


count the lines, use default ordering, join the missing values, assume columns are unique:
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -L -u \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

76958 2462656 27531602

real 0m8.466s
user 0m11.233s
sys 0m0.348s


count the lines, use default ordering, join the missing values, smart sorting:
time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
-c 3,5,6 -L -s \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc

93252 2984064 36481577

real 0m10.018s
user 0m12.353s
sys 0m0.496s


TODO


use some different column indexes for each file.


That's it
Pierre

No comments: