{"@context":"http://iiif.io/api/presentation/2/context.json","@id":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/manifest.json","@type":"sc:Manifest","label":"Graphical and machine learning algorithms for large-scale genomics data","metadata":[{"label":"dc.description.sponsorship","value":"This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree"},{"label":"dc.format","value":"Monograph"},{"label":"dc.format.medium","value":"Electronic Resource"},{"label":"dc.format.mimetype","value":"Application/PDF"},{"label":"dc.identifier.uri","value":"http://hdl.handle.net/11401/78335"},{"label":"dc.language.iso","value":"en_US"},{"label":"dcterms.abstract","value":"One fundamental question in computational genomics is to understand the relationship between genotype and phenotype. In this dissertation, I developed graphical and machine learning algorithms for large-scale genomics data, allowing accurate genotyping and molecular phenotype quantification. This work has helped to shed new light on the genetic contributions to autism spectrum disorders, intellectual disability, and other psychiatric disorders, as well as enabled detailed analysis of the molecular biology of several model organisms. The first major theme of my research has been in the study of genomic variations, in particular insertion and deletion (indel) mutations. As the second most common type of variations in the human genome, indels have been linked to many diseases, but indels of more than a few bases are still challenging to discover from short-read sequencing data. We present an open-source algorithm, Scalpel, which combines mapping and assembly for sensitive and specific discovery of indels. A detailed repeat analysis coupled with a self-tuning k-mer strategy allows Scalpel to outperform other state-of-the-art approaches for indel discovery, particularly in regions containing near-perfect repeats. We characterized various types of sequencing data to investigate the sources of indel errors. We also developed a classification scheme to rank high and low quality calls. In a second major theme of research, I present new methods for analyzing ribosome profiling (Riboseq) data, a powerful technique for monitoring protein translation in vivo. This, combined with detailed genomic variation data allows researchers to study how the genome influences transcription, translation, and ultimately the overall phenotype of an organism. However, there are prevalent sampling and biological biases in Riboseq data, limiting our ability to understand translation control. To tackle these issues, I developed Scikit-ribo, the first open-source software for accurate genome-wide inference of translation efficiency (TE) and A-site prediction. Scikit-ribo accurately identifies ribosome A-site locations even with different mRNA digestion protocols and nearly perfectly reproduces the codon elongation rates in several datasets (r=0.99). Next we show the commonly used RPKM-derived TE is very sensitive to sampling errors and biological biases, skewing the TE estimates in all previous studies. To address this, I developed a codon level generalized linear model with ridge penalty to correctly estimate TE while inferring codon elongation rates and mRNA secondary structure. We performed a large-scale validation using mass spectrometry data of 1200 genes and showed very high correlation. Scikit-ribo is particularly robust to low abundance genes that are most commonly distorted by lesser approaches and successfully corrected the TE biases for more than 2000 genes in S. cerevisiae. These improvements allow us to discover the Kozak-like consensus sequence in S. cerevisiae and a previously undiscovered biological significance in the Dhh1p study. Together, these results show that Scikit-ribo substantially improves Riboseq analysis and deepens the understanding of translation control."},{"label":"dcterms.available","value":"2018-07-09T13:34:18Z"},{"label":"dcterms.contributor","value":"Wu, Song"},{"label":"dcterms.creator","value":"Fang, Han"},{"label":"dcterms.dateAccepted","value":"2018-07-09T13:34:18Z"},{"label":"dcterms.dateSubmitted","value":"2018-07-09T13:34:18Z"},{"label":"dcterms.description","value":"Department of Applied Mathematics and Statistics."},{"label":"dcterms.extent","value":"130 pg."},{"label":"dcterms.format","value":"Monograph"},{"label":"dcterms.identifier","value":"http://hdl.handle.net/11401/78335"},{"label":"dcterms.issued","value":"2017-08-01"},{"label":"dcterms.language","value":"en_US"},{"label":"dcterms.provenance","value":"Submitted by Jason Torre (fjason.torre@stonybrook.edu) on 2018-07-09T13:34:18Z\nNo. of bitstreams: 1\nFang_grad.sunysb_0771E_13370.pdf: 9403941 bytes, checksum: bf64b3fc8ccc2520115bf919dfaef501 (MD5)"},{"label":"dcterms.subject","value":"Genetics"},{"label":"dcterms.title","value":"Graphical and machine learning algorithms for large-scale genomics data"},{"label":"dcterms.type","value":"Dissertation"},{"label":"dc.type","value":"Dissertation"}],"description":"This manifest was generated dynamically","viewingDirection":"left-to-right","sequences":[{"@type":"sc:Sequence","canvases":[{"@id":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/canvas/page-1.json","@type":"sc:Canvas","label":"Page 1","height":1650,"width":1275,"images":[{"@type":"oa:Annotation","motivation":"sc:painting","resource":{"@id":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/14%2F97%2F56%2F149756642023239687469029458236741592708/full/full/0/default.jpg","@type":"dctypes:Image","format":"image/jpeg","height":1650,"width":1275,"service":{"@context":"http://iiif.io/api/image/2/context.json","@id":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/14%2F97%2F56%2F149756642023239687469029458236741592708","profile":"http://iiif.io/api/image/2/level2.json"}},"on":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/canvas/page-1.json"}]}]}]}