Philippe Chouvarine, 2008
Philippe Chovarine

ALGORITHM & SOFTWARE DEVELOPMENT

We are involved in generation of novel data mining tools and data analysis tools for use by genome scientists from around the world. 

Sequence Read Classification Pipeline (SRCP) - Philippe Chouvarine has led development of MGEL's "Sequence Read Classification Pipeline" (SRCP).  The SRCP is a complex and powerful genome characterization tool in which novel Perl and DTS scripts have been integrated with existing bioinformatics standards (e.g., BLAST).  A manuscript on the SRCP was published in Analytical Biochemistry (click here to request a reprint).  The SRCP scripts, source codes, and sequence databases are available through our Bioinformatics Tools page.

CotQuest - For many years, we have utilized the program of Pearson et al. (1977) to analyze Cot data. However, the program is difficult to use and is prone to errors produced by local maxima/minima. Moreover, it does not generate an error estimate for each output value. In collaboration with statistician Dr. John Bunge (Cornell University), MGEL's Philippe Chouvarine and Daniel Peterson have developed a new SAS-based Cot analysis program that makes Cot analysis more straightforward and statistically robust.  The program has been placed on the Bioinformatics Tools page, and a publication describing the program has been accepted for publication in Analytical Biochemistry.

Targeted Data Mining Of Spatial Proximity Relationships in Genomes - This project is being led by Surya Saha as part of his Ph.D. dissertation.  Mr. Saha is co-advised by Dr. Susan Bridges and MGEL's Daniel Peterson.  Dispersed repetitive DNA sequences have played a prominent role in the evolutionary histories of eukaryotic genomes, and their persistence in eukaryotic DNA indicates that they have, on the whole, been evolutionarily advantageous.  While there are an increasing number of algorithms that have been developed for discovering novel dispersed repeats, significant analysis of the repeats and their relationships to other genome features will be required before we can truly understand the complex ways in which dispersed repeat sequences contribute to evolutionary fitness. In this regard, we have developed an approach to mine the coordinates of repetitive regions on chromosomal length DNA sequences and describe proximity relationships between repeat families and repeats within a family.  Association rule data mining is used to elucidate relationships among repetitive elements, and regions containing clusters of repeats are explored using graph theory to provide insight into aggregate proximity relations.  The approaches described will be extended to address proximity relationships between all annotated elements within a genome such as genes and regulatory elements.

Other Bioinformatics Topics: LIMS | Sequence Analysis | Intranet Development

Other Research Projects: Accelerating Pine Genomics (APG) | Reniform Nematode Genome | Sorghum Genome | Gymnosperm Genomics | Crocodilian Genomics | Genetics of Aspergillus Resistance | Brine Shrimp Genomics | Algorithm & Software Development