Thursday 15 November 2012

Running GeneWise with HMMs #2

In a previous post about GeneWise, I talked about how to run GeneWise using HMMs of gene families, for example, HMMs of animal gene families from TreeFam.

I mentioned my perl script run_genewisedb.pl that runs GeneWise, by comparing each HMM in an input file of multiple HMMs, to each DNA sequence in a fasta file of multiple sequences.

A problem with this script is that it can be quite slow to run on a whole animal genome of several hundred megabases, using a large number of HMMs (eg. all ~16,000 HMMs for TreeFam families).

I've now written a second script, run_genewisedb_after_blast.pl, which lets you run GeneWise using HMMs, but only on the regions of DNA sequences that have tblastn matches for the proteins used to create the HMMs.

As one of the inputs for this second script, you need to give it a fasta file of the protein sequences used to create the HMMs. If you are using TreeFam HMMs, you can create this fasta file of protein sequences using my perl script get_treefam_family_seqs.pl.

The perl script run_genewisedb_after_blast.pl cuts down the run time of running GeneWise a lot, as it only runs GeneWise on regions of the DNA sequences that have good tblastn matches to proteins in the families used to create the HMMs. You can specify an e-value cutoff for the tblastn matches to control how strong a tblastn match must be, and also specify how many bases on either side of the tblastn match to include in the chunk of sequence given to GeneWise.

No comments: