Small Suffix Trees

November 28th, 2006 7:03 PM

I’ve been working on a bioinformatics project involving lots of suffix trees. The project involves generating suffix trees for each genome in a set, and using them to generate small profiles of the genome signatures. Based on the nature of the genomes I’m working with (prokaryotic, in the 1—2Mbp range), I’ve been constraining the suffix trees to work with sequences up to 2097151 (221 - 1) base pairs in length. This is allowing me to keep a packed representation of the tree in ~18.6 bytes per base pair (based on the suffix trees generated for several of the genomes).

Recent work places the size of suffix trees (admittedly able to handle much larger sequences) in the 10—20 bytes per base pair range (Kurtz (1999), Reducing the Space Requirement of Suffix Trees). Although the application-specific size constraints give me an easy advantage, I’m very happy to have working code (in Perl!) that falls within this range, and even happier that it’s based on a just a few days of coding. I think attempting to squeeze a few more bytes out of the packed tree by following Kurtz’ approach would be beyond the scope of this project. However, even at 18 bytes I’ll be able to pack roughly 35 genomes into a gigabyte of RAM and that’s better than I was expecting.

Comments

… and this relates to PhotoStuff? :)

Posted by: gromgull on November 29th, 2006 5:56 AM

No. Unless you want to annotate genomes. :)

Posted by: kasei on November 30th, 2006 10:50 AM

You said it best. Funny how life comes full-circle sometimes. You’re working on suffix trees and I’m working on map data. I think you made it further than I ever did with those suffix trees too! Nice job.

Erik

Posted by: Erik Osterman on December 12th, 2006 4:00 AM