Gregory Todd Williams: Small Suffix Trees

Small Suffix Trees

November 28^th, 2006 7:03 PM

I’ve been working on a bioinformatics project involving lots of suffix trees. The project involves generating suffix trees for each genome in a set, and using them to generate small profiles of the genome signatures. Based on the nature of the genomes I’m working with (prokaryotic, in the 1—2Mbp range), I’ve been constraining the suffix trees to work with sequences up to 2097151 (2²¹ - 1) base pairs in length. This is allowing me to keep a packed representation of the tree in ~18.6 bytes per base pair (based on the suffix trees generated for several of the genomes).

Recent work places the size of suffix trees (admittedly able to handle much larger sequences) in the 10—20 bytes per base pair range (Kurtz (1999), Reducing the Space Requirement of Suffix Trees). Although the application-specific size constraints give me an easy advantage, I’m very happy to have working code (in Perl!) that falls within this range, and even happier that it’s based on a just a few days of coding. I think attempting to squeeze a few more bytes out of the packed tree by following Kurtz’ approach would be beyond the scope of this project. However, even at 18 bytes I’ll be able to pack roughly 35 genomes into a gigabyte of RAM and that’s better than I was expecting.

#Perl

Next: RDF::Query 1.041
Previous: Farm 255

Gregory Todd Williams

Small Suffix Trees

Comments