Sugars, the collection of all naturally-occurring monosaccharides and disaccharides, are small molecules belonging to the class of carbohydrates. Examples of sugars are glucose, fructose, and sucrose. Essentially all sugars have the same chemical formula but different molecular structures. Virtually all metabolites found in bodily fluids (primarily blood, urine, and saliva) can be identified and quantified using gas chromatography – mass spectrometry (See Figure 1). However, sugars cannot; they must be identified using techniques other than mass spectrometry. This is because, even under ideal conditions, the mass spectra of sugars are very similar, though not identical. (See Figure 2.) In a real world laboratory setting lack of sufficient sample and interferences from other molecules in the bodily fluid create noise and uncertainty in the mass spectra. This makes identifying sugars in real samples difficult. A positive advancement in the field of clinical analysis would be the ability to identify sugars in bodily fluids by GC-MS. This would avoid the need to perform completely separate experiments to determine sugar content.
GC-MS identifies components of complex mixtures such as bodily fluids by vaporizing the sample and forcing the vapor through a capillary column having an absorptive inner lining. As the substance passes through the column different molecules elute at different times due to differences in each molecule’s thermodynamic gas/liquid partition function. The molecules are then sent to the mass spectrometer. Here they are ionized by electron impact. The high energy electron beam breaks the molecules apart and their characteristic mass spectrum is measured. Thus, a mass spectrum is a collection of mass and intensity pairs which can be plotted. This plot (or spectrum as in Figure 1) is then compared to a database of existing spectra and the compound is identified. Identifying sugars however, is a notoriously difficult problem.
Given the spectra of an unknown sugar, can a searching function be constructed that successfully identifies the sugar from a collection of known sugar spectra? Given a known sugar with known chemical structure and given a collection of mass spectra collected on different instruments and with varying degrees of accuracy, can one model the noise or error associated with an instrument? Can one derive necessary or sufficient conditions for an unknown sugar to be identifiable (or not identifiable)? Finally, can one select instrument settings that produce spectra that, when compared to library spectra, minimize the maximum likelihood of a incorrectly identifying an unknown?
This project will involve investigating the mathematical and statistical techniques for estimating the distance between chemical spectra of varying degrees of accuracy and searching functions designed to identify sugars. Other classes of substances may also be considered (pesticides, pollutants, methamphetamines).
Interest in computing (Matlab or C/C++). Background in optimization or statistics; some linear algebra. Special interest in things like
regression, machine learning, filtering methods...etc would be beneficial but not necessary.
W. Demuth, M. Karlovits, K. Varmuza. ``Spectral similarity versus structural similarity: mass spectrometry’’, Analytica Chimica Acta, 2004. pp 75—85.
NIST, Mass Spectral Database 2011, National Institute of Standards and Technology, http://www.nist.gov/srd/nist1a.htm, Gaithersburg, MD, 1998
S. Stein. ``Chemical substructure identification by mass spectral library searching’’, Journal of the American Society for Mass Spectrometry’’, 1995. pp 644-655.