Repeat Identification and Annotation
We have created a pipeline to identify and annotate DNA repeats in mammalian genomes, using two pre-exisiting tools (PALS/PILER and RepeatScout) which had previously not been used on an entire mammalian genome.
The pipeline breaks up the genome into manageable chunks to run PALS in a parallelized fashion on a computer cluster. The chunks are then concatenated at the chromosome level and used as input for PILER, generating clustered, consensus sequences for repeats on each chromosome. RepeatScout was run on individual chromosomes and its output converted to make it compatible with PILER output. To identify redundancy across chromosomes, consensus sequences and RepeatScout output were aligned to each other using WUBLAST. Redundancy was minimized by clustering the consensus sequences along with the RepeatScout output on the basis of the WUBLAST output to generate globally alignable non-redundant consensus sequences. In this fashion we have identified many previously known repeats and a number of heretofore unknown repeats present at both low and high copy number.
By analysing interspersed repeat data we have found underlying correlations with respect to repeat numbers/insertions in mammalian genomes. An example of this type of correlation analysis is shown in the figure on the right:
Honours research projects are available. Contact Prof. David Adelson if you are interested.