Below are some possible topics for individual projects. This is by no means
intended to be a comprehensive list; these are merely suggestions. Any topic you
propose, if related to the course, will be fine.
0. Data science and the pandemic.
Many data science challenges have come up since the beginning of the pandemic. We will be glad
to hear about possible problems that students propose to study.
1. Kolmogorov complexity
Learn about Kolmogorov complexity, including some ideas by Vitanyi and his coauthors
about how to bring them closer to practice. Apply these to real data.
2. Gaussian mixtures
Gaussian mixture distributions feature a superposition of several Gaussian components.
The signal can be generated by one of several possible classes, and each class corresponds
to a different Gaussian component. These models appear in many applications. A project along
these lines would consider data where Gaussian mixtures have potential to be useful, and then
apply an algorithm for learning the mixture model as part of processing the data. Here the
term "processing" is used loosely, and could entail classification, estimation, prediction,
or other data science types of approaches. A well known paper that describes how to learn
Gaussian mixture models is M. Figueiredo and A. K. Jain, "Unsupervised learning of finite
mixture models", IEEE Trans. Pattern Analysis Machine Intelligence, vol. 24, no. 3, pp.
381-396, 2002.
3. Kolmogorov sampler
In 2002, David Donoho published an interesting technical report, The Kolmogorov Sampler.
This report shows how Kolmogorov complexity can be used as a denoising algorithm for cleaning
up a noisy signal x in R^N observed through noisy measurements, y=x+n. The main idea is
to search for the estimated signal x' with smallest complexity that is close to the (known)
measurements y. Donoho provided some interesting theoretical results about this approach,
and one of the highlights is that the algorithm can denoise *universally* without knowing
the input distribution governing x.
This project could take (at least) two forms. One would be theoretical, and discuss the results,
and possibly see how they might be extended in other directions. Another possible route for the
project involves implementing and running it on real data. I have an implementation of a related
algorithm that could be modified and turned into a denoising algorithm per the approach proposed
by Donoho.
4. Two part codes
We have briefly touched on two part codes in class. This project will look at them in
greater detail. In data compression, two part codes first encode the representation level
that best fits the empirical distribution of the data, and in the second part the data is
compressed based on this representation level. The second part can be implemented using
arithmetic codes, and their implementation appears online. The first part relies heavily
on the details of the two part code. This project will involve design of quantizers for
the empirical parameter value.
5. Dictionary learning
Homework 8 shows how dictionary learning or clustering can be used to design a set of
patches that represent an image well. There are many possible extensions of the project,
and some of them are discussed at the and of the project. You can certainly come up
with other ideas for possible improvements; feel free to chat with the instructor about
such ideas.
6. Fast sorting algorithms
Homework 3 evaluated how merge sort can be improved by mixing and matching various sorting
algorithms, in particular as part of the "basis case" of the algorithm. In this project,
you will explore these and other directions, learn about other sorting algorithms, and
put together an improved sorting implementation. This project is more on the scientific
programming side than some other projects.
7. LDA, QDA, and more
We learned in class about how linear discriminant analysis (LDA) and its quadratic variant
(QDA) can be used to classify data. In this project, you will consider some data set where
classification should be performed, and see how to use these and or other classifiers on
the data.
8. Matrix completion
A well known data science problem from the last few years is known as the Netflix problem.
A huge matrix is only partly observed. Columns correspond to different movies, rows are
users, and each entry of the matrix relates how well the user liked that movie. Typical
users have watched a few dozen movies and rated them. However, there are many thousands
of movies, and data about a user's interest in the other movies is unknown. You can easily
see that Netflix (and similar companies offering such services) might be interested in
predicting how well a user may like other movies, and suggesting those movies. To do so,
a technique called matrix completion is involved; unknown matrix entries are estimated
using different models for the matrix. This project will involve identifying a data set
where matrix completion may be useful, studying about matrix completion, and applying one
of the algorithms to the data.
9. Financial prediction
This project involves various possible ways to predict future price changes of financial assets.
One example might be to put together a compression algorithm that compresses past stock returns.
If we can indeed compress past returns, it implies that we have (a bit of) predictive ability
for future returns. Instead of data compression, an approach based on linear regression might
also be useful.
10. Hurricane prediction
Ahead of Hurricane Florence in September 2018, we chatted in class about some issues related to
the hurricane, including weather prediction. In this project, your role will be to learn about
metereology, find data related to hurricanes, pose some interesting questions, and try to address
them from the data. Possible questions could be: how reliable are forecasts a few days before a
hurricane; can we predict the number of casualties caused by a hurricane; can we predict wind
speeds, flooding levels, or other important parameters?
11. Arithmetic codes
We learned that a probability p can be encoded using -log2(p) bits, where log2(.) denotes the
base-two logarithm. A more complicated situation arises when encoding sequences of symbols that
contain time dependencies. In this case, the probability mass function (PMF) that governs the
distribution of each individual symbol often differs from the PMFs governing other symbols.
A well known technique to encode such sequences is arithmetic coding. Practically speaking, it
encodes an entire sequence of symbols to within 2-3 bits of the ideal coding length. This is
particularly impressive, because these ideal lengths are often non-integer. In this project,
your first goal is to learn about arithmetic codes, for example from a paper by Cleary, Bell,
and Witten. Next, you can implement your own arithmetic encoder and decoder; or you could download
an existing implementation and use it within your own coding scheme.
12. Mixtures vs MDL
We briefly discussed in class that mixtures are an alternative modeling approach to MDL. Instead of
choosing the one "best" model in a model class that describes the data most succinctly, mixtures
simultaneously describe the data using all the models, and the probability assigned to the data is a
weighted sum of probabilities assigned by all the models. In this project, you will learn about context
tree weighting (CTW), which is a specific mixture-based algorithm for encoding and decoding by Willems
et al.
13. Optimal model classes
We discussed in class an example of encoding a length-N i.i.d. binary sequence, where we quantize the
empirical probability of ones to a grid of roughly sqrt(N) possible levels. In this project, you'll learn
a bit more about the optimal quantization grid. Different works that considered optimal quantization
grids, including a precise characterization of coding lengths involves, include Rissanen in 1996 and
Clark and Barron (Andrew Barron; not Dror Baron) in the nineties.
14. Resource consumption in circuits
The computer science community has mainly focused on computational complexity when evaluating how the
resources consumed by algorithms scale, with lesser focus on memory use. In contrast, implementing an
algorithm on a dedicated electronic circuit features numerous trade-offs, including power consumption,
clock speed, area of silicon consumed, and the number of transistors. In this project, you'll identify
some algorithm that has interesting trade-offs in a hardware implementation.
A similar project could consider reducing consumption in wireless systems, where communication costs
may far exceed computational costs.
15. Minimizing cache misses
The random access machine model for computation assumes that each basic instruction and memory access
takes one unit of time. But computers often take quite different amounts of time for different instructions.
For example, floating point arithmetic such as exponents and sines may require dozens of clocks, while
incrementing an integer may be quicker. This problem of different instructions requiring varying amounts
of time is especially severe in the memory system, where accessing the registers often takes a single clock
cycle, whereas a cache miss to main memory may requires dozens or even hundreds of cycles. In this project,
you will identify two (or more) algorithms for computing the same overall function, where one algorithm
performs fewer total instructions but is limited by cache misses (and must access main memory) while the other
algorithm may perform more total instructions but has few cache misses and is thus faster. An interesting
part of this project could be taking some existing algorithm and modifying it in a way that reduces the
number of cache misses, thus offering faster real-world performance.
16. Sparse superposition codes.
This project shoud be especially appealing to students with a communication engineering background.
Sparse superposition codes (SPARCS) achieve the capacity of the additive white Gussian noise (AWGN)
channel by transmitting a superposition of individual columns of a matrix, where the matrix is partitioned
into sections, and in each section we choose exactly one column to transmit. With M columns per section,
the column chosen provides log2(M) bits of information. It has been proved that by carefully allocating
how much power each section of the matrix receives, and using an efficient decodoing algorithm, SPARCS
achieve the capacity of the AWGN channel despite only requiring moderate computation. There are many
directions to look into improving, and a group interested in this direction should contact Prof. Baron and
chat with him about it in person.
17. Finding the median among a set of numbers.
We discussed sorting algorithms in class and through a project. However, in some applications we are given
a set of numbers and are only interested in the minimal number in the set, or the maximal number, or
median, or perhaps the K'th largest number (this is called an order statistics). Identifying the minimum or
maximum is easily performed in linear time, but what about the median? You could sort N numbers in N*log(N)
time and then select the median, but it turns out that order statistics can be computed in linear time(!)
As part of this project, you will learn more about algorithms for computing order statistics, and implement
an algorithm that computes the median. In addition to conventional algorithms for computing the median,
Dr. Baron has a different idea, and you may want to look into that (chat with him for more details).
18. Adaptive nearest neighbors
In the K nearest neighbors approach to binary classification, we perform a majority vote. But what if the
majority vote is not conclusive, for example a 10-7 vote. In that case, we may want to increase K. We look
at more and more nearest neighbors until we have a statistically significant majority. Your role will be to
work through the details of this approach, and compare it to other classification approaches.
19. Quantum machine learning
You will learn about quantum computing and look at some quantum algorithm in the data science space.
Quantum machine learning algorithms would be prime candidates to look at; make sure to discuss this with
Dr. Baron. Your role in such a project will be to educate yourself and then other students about these topics.
You will be less likely to be able to simulate a quantum algorithm, and from your perspective such a project
will be more of a learning experience.
20. Prediction in the real estate market
In the US, financial assets (mostly stocks and bonds) represent roughly 45% of wealth, while real estate
represents roughly 35%. In contrast, in some other major economies, real estate represents a larger share
of wealth. Therefore, the US financial markets are the largest in the world, which has likely motivated
many students in past years to work on projects in topics such as financial prediction or optimization.
However, again, real estate is also a substantial part of the US economy. In this project, you will learn
about the real estate market and try to see whether you can predict... something... anything. Maybe you
can predict whether renters default on their contracts. Maybe you can predict future changes in housing
prices. Maybe you can predict housing prices in a zipcode based on people's incomes and other variables.
Dr. Baron can put you in touch with people who are knowledgeable in this area, and they will likely be
glad to give you some pointers.