[+] Team 1: Mathematical Models for Adaptive Multi-modal Sensing
- Mentor Aaron Luttman, National Security Technologies, LLC
- Mentor Jared Catenacci, National Security Technologies, LLC
- Ariel Bowman, University of Texas at Arlington
- Shawn Burkett, University of Colorado
- Hayley Guy, North Carolina State University
- Laura Iosip, University of Maryland
- Yufei Yu, University of Kansas
- Sheng Zhang, Purdue University
Scientific experiments are a natural source of data – which usually means diagnostic systems fielded to collect information within the experiments themselves – but there has been a recent trend towards collecting data around big science experiments to understand if we can detect and characterize the behaviors associated with the experiments. The question is whether it is possible to determine what experiments are being conducted by analyzing human patterns, so-call “patterns of life,” around and in the experimental facilities. In order to measure patterns of life, we analyze many different types of data, from power grid load profiles to internet activity to sound and pressure signals from cars.
There are two primary challenges that must be addressed:
Mathematical Models for Adaptive Sensing – When should a sensor system turn on its sensors and transmit its data, given that these two activities take a lot of power?
Physics-based Multi-modal Feature Selection and Detection – How can one incorporate physics models for sensing into machine learning approaches to data analysis?
Real multi-sensor data will be provided for testing and validation.
[+] Team 2: Quantum Computation and QUBO Slicing
- Mentor Jesse Berwald, D-Wave Systems
- Olivia Cannon, University of Minnesota, Twin Cities
- Tanushree Roy, University of Central Florida
- Chang Shu, University of California, Davis
- Dallas Smith, Brigham Young University
- Elizabeth Wicks, University of Washington
Background
Quantum annealing computers have begun to enter the business and academic worlds. Over the past five years they have been used for a wide variety of (prototypical) applications, with evidence of differentiated performance in some cases.
A first step in utilizing these computers is to reformulate the problem in an energy minimization framework. This is typically cast as a Hamiltonian, or alternatively as a quadratic unconstrained binary optimization (QUBO), which can be represented as a matrix. These formulations are translated to the physical qubits on the quantum processing unit (QPU) through a process termed “embedding”. Embedding a given problem onto the QPU is handled through a number of different heuristics and is an active area of research in itself, one of which is described below.
Problem statement
In this project we will investigate one proposed solution to the embedding problem:
The goal is to make the most efficient use of the qubit hardware by developing a parameterized transformation from the space spanned by physical qubits, “qubit space”, to the space spanned by problem variables, the “problem search space”. Our goal will be to define a linear transformation from qubit space to problem search space that allows for a more efficient use of available hardware.
Since the problem space is (in general) much larger than the qubit space, a fixed parameterization will succeed in mapping the qubit space into an proper subspace of the problem space. We term these subspaces “slices”. This reduced problem can then be solved with an optimal use of the available hardware. Using different parameterizations, we can define a series of linear transformations onto orthogonal subspaces of the problem space.
There are many parameterizations to choose from, each of which raises a number of research questions. We will prioritize our investigation roughly as follows:
1. Given a QUBO matrix defining the problem search space, is there an algorithm that produces the most efficient set of transformations (parameterizations) from qubit space to problem space?
2. Is there a greedy algorithm that is best in practice — i.e. choose a slice that maximizes the use of the chip, and then choose successively smaller slices to query the entire search space.
3. What is the role of sparsity in the choice of transformations?
4. The QPU itself has a unique architecture. How does this architecture affect the choice of transformations?
References
- Traffic flow optimization using a quantum annealer: https://arxiv.org/pdf/1708.01625.pdf
- A NASA Perspective on Quantum Computing: Opportunities and Challenges: https://arxiv.org/pdf/1704.04836.pdf
[+] Team 3: Time Series Analysis of Gas Mixture Data
- Mentor Nicholas Asendorf, 3M
- Kate Brubaker, Purdue University
- Ruihao Huang, Michigan Technological University
- Philku Lee, Mississippi State University
- Elpiniki Nikolopoulou, Arizona State University
- Michelle Pinharry, University of Minnesota, Twin Cities
Motivation
Sensor networks are ubiquitous in today’s Internet of Things, capable of collecting high frequency data in a cost efficient way. This results in mountains of time-series data that hopefully contain signals of interest buried in noise. As the number of deployed sensors grows, so does the dimensionality of the observed data, further increasing the complexity of the problem. 3M is interested in such large scale time series analyses because many of our datasets can be framed in this way: manufacturing, sales, and chemical experiments to name a few.
Dataset
This publicly available dataset contains time series sensor readings from chemical sensors over the duration of 12 hours. The input to these sensors are known concentrations of various gases. The dataset contains timestamped measurements from 16 gas sensors and the input concentrations of the gases. This is a labeled time series dataset. There are two different gas mixture measurement files, one for Ethylene and CO, and one for Ethylene and Methane. At 3M, we may have similar types of experimental data (perhaps using different sensors) where we would like to determine the interactions between materials or understand fundamental properties of materials. Being able to intelligently and efficiently mine these rich datasets for insights about material characteristics is critical.
The Challenge
Some interesting problems to consider:
• Develop an algorithm to estimate the concentration of each gas given sensor measurements. You might approach this problem using classical machine learning, splitting data into training, validation, and testing, while treating time series measurements as independent points.
• Develop algorithms to estimate the concentrations of each gas using time series based methods like windowing, tsfresh, or RNNs. In this approach, we don’t want to treat each measurement as independent. How do these algorithms compare to classical machine learning techniques?
• Can you use the fact that we have 4 replicates of each sensor at each time point to improve your algorithms? Can you use any clever data fusion techniques or outlier detection strategies?
• What can you tell about the importance or accuracy of the 4 types of sensors used?
• What happens when we purposely introduce missing data? Can we use the replicates of each sensor to overcome this? How robust are your algorithms to missing data?
• Since each dataset has measurements for Ethylene, can we use both datasets to develop a more robust estimation scheme for that gas?
[+] Team 4: Structured Variational Auto Encoders
- Mentor Irfan Bulu, Schlumberger-Doll Research
- Hua Chen, University of Delaware
- Aaron Cohen, Indiana University
- Mingchang Ding, University of Delaware
- Melanie Jensen, Tulane University
- Christopher Miller, University of California, Berkeley
- Michael Ramsey, University of Colorado
Generative models such as Variational Auto Encoders (VAE), Generative Adversarial Networks(GAN) have been very successful in unsupervised learning settings. In a VAE setting, we would like to learn a set of latent variables that explain our data. Although, this has been very successful as a generative model, the interpretation of latent variables is still a challenge. Ideally, what we would like to do is unsupervised learning through which we identify a number of classes (not specified yet). Once a set of classes has been identified, we can then label once instead of having to label the entire data set. Imagine you have a sample of handwritten digits without labels. If we can structure VAE in a way that it can identify 10 classes, we can then go label these classes as the relevant digits. This would be very helpful as most of our data is unlabeled or poorly labeled.
Concepts that may be helpful to know: neural network, generative models, graphical models, stochastic variational inference
[+] Team 5: Tailored Discovery in Stock Portfolios
- Mentor Christopher Bemis, Whitebox Advisors
- Chirasree Chatterjee, Saint Louis University
- Zhen Gao, Vanderbilt University
- Cristian Minoccheri, State University of New York, Stony Brook (SUNY)
- Shannon Negaard-Paper, University of Minnesota, Twin Cities
- Shiqiang Xia, University of Minnesota, Twin Cities
Modern portfolio theory has provided tools to identify systemic and idiosyncratic risks via models like Markowitz' Mean-Variance Optimization. In addition, a taxonomy of equities has emerged through feature identification, with one of the earliest and most impactful being Fama and French's three factor model.
In this project, we will leverage technical and fundamental data like return series and earnings information along with well understood equity features like exposure to so-called size, value, and market portfolios to develop tools for suggesting supplements (e.g., technology stocks when looking at Apple) and complements (e.g., energy stocks when looking at Delta Airlines) for individual equities and portfolios. These tools may be used in tailored discovery and research by analysts looking to either construct a portfolio based on a theme or to diversify. The work will ideally evolve from point estimates using simple norms in a predetermined feature space to applying machine learning techniques.
Data will be supplied from Quandl, and the preferred language for development will be Python.
[+] Team 6: Sequence-to-sequence Modeling for the Business of Baseball
- Mentor Keith Rush, Milwaukee Brewers
- Maria Gommel, The University of Iowa
- Ekaterina Kryuchkova, Cornell University
- SangJoon Lee, University of Connecticut
- Iurii Posukhovskyi, University of Kansas
- Eric Roberts, University of California, Merced
Each fan has a unique relationship to his or her favorite sports teams, and each has a different ideal every time they step into the stadium. When a team makes a big free-agent signing in February, the fan who follows he competition closely will be ecstatic--the fan who primarily enjoys the communal aspects will only see this effect in the buzz generated in his or her social circles. In order to cherish their fans to the utmost, teams must have a global view of their business and be able to structure data from all sources and across all levels of granularity, creating one universe into which all inputs and from which all outputs feed.
This project is fundamentally a first step in that direction. The problem we are focusing on is roughly the following: conditioned on a vector representing a fan's history with the Club and the attributes of a particular game, how well can we ingest information in time and map it forward one time step. For this purpose, we will test the standard recurrent and convolutional network architectures, as well as experimenting with variants and discussing the reasons for applying each and their limitations. Data will be provided from the Brewers and the development will take place in Python, utilizing cloud infrastructure for the computing power.