Algorithms for High Throughput Sequencing
High throughput sequencing technologies are generating a wealth of sequence data. These technologies are often used to obtain sequences from DNA or RNA samples and perform computation and analysis of digital gene expression values. However, these short sequence “reads” must be aligned to an available reference genome before generating such values. An ongoing research project in the Perkins lab is to study sequence mapping and design improved algorithms for mapping transcribed sequences.
Understanding Duplications in Vertebrate Gene Families
The Hoffmann Lab is focused on understanding biological diversity from a genomic standpoint, with occasional forays in the molecular evolution of viruses, and on the diversification of mammalian species. Most of the work is focused towards understanding the relative contribution of gene and whole genome duplications to the expansion of vertebrate gene families and the origin of biological innovations, and on assessing the potential role of natural selection on the acquisition of novel biological functions. In particular we are very keen on mammalian gene families, which provide an outstanding model for these types of questions. Because vertebrate globins are one of the most studied gene families from a biochemical, structural, functional, and molecular standpoints, we can use the tools of comparative genomics, phylogenetic analyses and molecular evolution to great advantage in this system. Accordingly, students in the Hoffmann Lab will be involved with learning to analyze genomic sequence data to understand copy number variation, and assess the evolutionary forces underlying the observed patterns of variation.
Interactive Visualization of Biological Data
REU projects for bioinformatics will challenge students to work together with computer scientists and biology experts to solve complex problems via interactive computer graphics. While two examples of such projects are given below, actual projects will be determined in collaboration with application scientist and the student.
Extend MSAVis: MSAVis (Figure 1) has several feasible extensions that can be tackled in parallel by dedicated students; two are presented here. First, as it stands, MSAVis does not allow editing of protein sequences to test different alignment hypotheses; this is a feature of interest to its users. A student would add this functionality which would involve modifying MSAVis' interaction mechanisms and integrating it with sequence alignment software. Second, there are additional protein features that could be integrated such as binding sites or information about secondary structure. Such a project would involve designing the visual metaphors for the added information and designing the interface to query the biological databases to extract them.
Gene Atlas: In this web-based tool (Figure 2), a gene atlas will be refined. The gene atlas allows the efficient comparison of multiple gene expression samples (usually from species at different times in their life cycle) to be compared efficiently. Additional interaction methods and visual metaphors could be explored to make this a tool with genuine impact on biological studies.
Functional Genomics in Developmental Biology
Dr. Memili’s research areas include functional genomics of mammalian gamete and embryo development, and epigenetics of stem cells, stemming from his experiences from the graduate research at the University of Wisconsin-Madison, and from postdoctoral research at Harvard Medical School, respectively. The REU students will conduct original, hypothesis research to ascertain developmental mechanisms in the gametes and embryos, and maintenance of stemness in adult stem cells, adipose derived stem cells (ASC). The research involves multidisciplinary approaches such developmental biology, computational biology, and epigenetics. The students will apply diverse approaches including in vitro fertilization of bovine and mouse oocytes and culture of embryos and ASC. Specific projects will include identification and manipulation of specific sperm borne microRNAs regulating embryo development, and microRNAs of ASC controlling stemness.
Transcriptome Analysis Using RNA-Seq
Transcriptome analysis methods using RNA sequencing (RNA-Seq) enables monitoring of changes in gene expression with high accuracy. Since this method of studying transcriptomes offers an unbiased snapshot of gene expression with increased sensitivity, dynamic range and discriminatory power compared to microarrays, it has immense potential for host-pathogen systems biology. The so called “dual RNA-Seq” , in principle can determine the temporal gene expression changes in the host and pathogen during infection. The promise of this next generation sequencing technological advance can only be fulfilled, with parallel advances in computational strategies for addressing the data analysis challenge involved in mapping the sequencing reads from a single run to multiple species i.e. host and pathogen. The data analysis issues are exacerbated when one looks at multifactorial syndrome such as Bovine Respiratory Disease (BRD). BRD in cattle is caused by a number of viral and bacterial pathogens. Conducting RNA-Seq from the terminal site of infection i.e infected bovine lung tissue is expected to capture gene expression changes in the bovine host and bacterial pathogens such as M .haemolytica, P.multocida and H. somnus, to name a few (Figure 3). Thus BRD RNA-Seq extends the dual RNA-Seq concept to a single host and multiple pathogens, increasing the complexity of the informatic challenges posed by the data. Mapping RNA-Seq reads in this scenario could possibly require a metatranscriptomic approach to identifying common bacterial functions and pathways. There is a need for evaluating the suitability of existing algorithms such as Maq, Bowtie, or others for mapping RNA-Seq reads multiple genomes. It is conceivable that the algorithms would require ‘scaling up’. It is also possible that we would need to build on these methods or develop completely novel methods. The undergraduate participants of this application will participate in all the above said aspects of developing a computational framework for mapping RNA-Seq reads to multiple species. We will use a synthetic dataset to develop these methods. The available bovine RNA-Seq data from public resources as well as our RNA-Seq data for M. haemolytica, P. multocida and H. somnus will be combined to generate a dataset for all species involved in the infection.
Understanding Emergence and Adaptation of Oseltamivir and Amantadine Drug-Resistance in Influenza A Viruses Using Machine Learning
Influenza A viruses may cause a pandemic disaster that will impact multiple continents as well as seasonal influenza epidemics through single or multiple nations. Four documented influenza pandemics occurred in 1918, 1957, 1968, and 2009. More than 40 million people were killed in the 1918 pandemic. The peak influenza season in the northern hemisphere is from January to April every year. More than 200,000 hospitalizations and up to 49,000 deaths are caused by influenza in the United States each year. In addition, influenza A viruses can cause infections in birds and other animals, and lead to large economic losses and financial burdens. For instance, the ongoing epidemics of H5N1 highly pathogenic avian influenza viruses in Asia, Europe, and Africa has caused the culling of at least 220 million of birds since the first detection of this virus by Wan.
Two main classes of anti-influenza drugs include neuraminidase inhibitors, such as zanamivir and oseltamivir, or viral M2 protein inhibitor, such as amantadine and rimantadine. They are used clinically to reduce disease symptoms in patients. However, the emergence of drug resistance has been a continuous challenge to influenza treatment. As of 2010, human seasonal influenza A viruses, H1N1 and H3N2, have been found to be nearly 100 percent resistant to oseltamivir and amantadine, respectively. Molecular characterization of influenza A viruses recovered from influenza surveillance in Asia demonstrated that there about are up to 95% of H5N1 highly pathogenic avian influenza virues containing M2 inhibitor resistance markers in Vietnam and Thailand and that there are about 8.9 in China. The drug resistance can result either from inappropriate usage of drugs or hitch-hiking mutation, a common factor resulting in adamantine resistance in human H3N2 influenza viruses.
Previous research showed that oseltamivir-resistance (H1N1) is linked to an H275Y amino acid mutation on the neuraminidase (NA) viral surface glycoprotein, while amantadine-resistance (H3N2) is associated with an S31N amino acid mutation on the viral M2 matrix protein. However, the mutations associated with H275Y in H1N1 virus and S31N in H3N2 virus and their roles in the emergence and adaptation of drug resistant strains are still unknown. The objective of this study is to develop and apply novel machine learning techniques to identify the molecular markers which are highly correlated with the mutation linked to drug resistance. The recruited undergraduate in this REU program will he trained to develop novel machine learning algorithms based on previously developed methods in Wan Lab. These findings will potentially facilitate a greater understanding of the role of correlated mutations in emergence and adaptation of drug resistance in influenza A viruses.
Indel Detection in Genome-Wide Association Studies
The advent of high throughput genotyping by sequencing (GBS) gives geneticists a new tool with which to answer questions regarding gene identification, genetic diversity and relationships, and increased gain from selection in practical breeding program of crop plants. It also allows for the generation of data sets of unprecedented size. However, bioinformatics tools to store, query, and analyze these data sets in a meaningful fashion are still needed, and user friendly tools in particular will make the data more useful to the biologists who will use and interpret it. One such data set has been generated by the USDA ARS Corn Host Plant Resistance Research Unit on the Mississippi State University campus. This data consists of sequencing data from 292 maize inbred lines. A preliminary analysis using the Cornell GBS bioinformatics pipeline (http://www.maizegenetics.net/gbs-bioinformatics) has allowed the alignment of the 292 genomic sequences against the maize B73 genomic reference sequence (www.maizesequence.org). From there, we can begin to investigate the differences in the genetic sequence between the 292 maize lines. We hope to determine how each of the lines are related to each other, and to associate the differences in the genetic sequences with differences in response of each line to infection with the plant pathogen Aspergillus flavus, and the associated production by the pathogen of aflatoxin, a toxic and carcinogenic metabolite of the fungus. This will allow us to identify genes associated with resistance, and to breed resistant new maize cultivars in the future.
A query tool has been created that will identify and extract Single Nucleotide Polymorphisms (SNPs) from the sequences. SNPs are differences in the genetic sequence at a single base pair. Other differences include insertion/deletion polymorphisms (InDels), which correspond to deletions of one up to several thousand base pairs of DNA into the sequence of one line compared to that of another (or, the deletion of the sequence in one line compared to another). InDel polymorphisms are ideal for studying gene function from DNA sequences, because they are large, easily identifiable within a small number of samples, and can lead to differential functioning of a gene. However, in a large sample or in a very long sequence, the identification of InDels has been extremely problematic due to very large computational requirements. Therefore, this project will determine a method for identifying InDels using an InDel alighment tool created using the BLAST (Basic Logical Alignment Search Tool, (http://www.maizegenetics.net/gbs-bioinformatics) on a small sample, and expand the size of the sample (both number of lines and length of sequence) that can be efficiently analyzed. Determining this threshold and then seeking methods so surpass will lead to exceedingly useful tools to better use the data available to us
Population Genetics of Helianthus Annuus
Research projects for undergraduates will be designed to both generate usable data, and serve as complete introductions to hypothesis driven research. Projects will be focused in one of two areas, the role of microsatellites as agents of adaptive evolution in sunflowers, and the population genetics of iguanas. For example, students involved in screening microsatellites for amplification and variability in sunflowers will be testing the hypothesis that EST derived microsatellites are under greater evolutionary constraint. The prediction that follows from this hypothesis is that anonymous loci should harbor more variation than EST derived loci. Students will learn how to perform fragment analysis, and basic computational biology associated with population genetics. Students collecting data on seed set and seed mass could test the hypothesis that variance in reproductive success varies across multiple populations. In this way, students are involved in meaningful research, and they are introduced to the entire process of science from hypothesis development to reporting. Students studying iguanas are using heterozygote excess, and fitness correlations to improve our understanding of the role that mutational load plays in dictating small population dynamics.