Below are some possible topics for individual projects. This is by no means intended to be a comprehensive list; these are merely suggestions. Any topic you propose, if related to the course, will be fine. Note: below is "topic 0" from 2020-2021; it was timely then, but perhaps less so now. 0. Data science and the pandemic. Many data science challenges have come up since the beginning of the pandemic. We will be glad to hear about possible problems that students propose to study. 1. Kolmogorov complexity Learn about Kolmogorov complexity, including some ideas by Vitanyi and his coauthors about how to bring them closer to practice. Apply these to real data. Note: the topic below could be quite interesting in quantum information systems. Please email me to hear about the quantum connection. 2. Gaussian mixtures Gaussian mixture distributions feature a superposition of several Gaussian components. The signal can be generated by one of several possible classes, and each class corresponds to a different Gaussian component. These models appear in many applications. A project along these lines would consider data where Gaussian mixtures have potential to be useful, and then apply an algorithm for learning the mixture model as part of processing the data. Here the term "processing" is used loosely, and could entail classification, estimation, prediction, or other data science types of approaches. A well known paper that describes how to learn Gaussian mixture models is M. Figueiredo and A. K. Jain, "Unsupervised learning of finite mixture models", IEEE Trans. Pattern Analysis Machine Intelligence, vol. 24, no. 3, pp. 381-396, 2002. 3. Kolmogorov sampler In 2002, David Donoho published an interesting technical report, The Kolmogorov Sampler. This report shows how Kolmogorov complexity can be used as a denoising algorithm for cleaning up a noisy signal x in R^N observed through noisy measurements, y=x+n. The main idea is to search for the estimated signal x' with smallest complexity that is close to the (known) measurements y. Donoho provided some interesting theoretical results about this approach, and one of the highlights is that the algorithm can denoise *universally* without knowing the input distribution governing x. This project could take (at least) two forms. One would be theoretical, and discuss the results, and possibly see how they might be extended in other directions. Another possible route for the project involves implementing and running it on real data. I have an implementation of a related algorithm that could be modified and turned into a denoising algorithm per the approach proposed by Donoho. 4. Two part codes We have briefly touched on two part codes in class. This project will look at them in greater detail. In data compression, two part codes first encode the representation level that best fits the empirical distribution of the data, and in the second part the data is compressed based on this representation level. The second part can be implemented using arithmetic codes, and their implementation appears online. The first part relies heavily on the details of the two part code. This project will involve design of quantizers for the empirical parameter value. 5. Dictionary learning Homework 8 shows how dictionary learning or clustering can be used to design a set of patches that represent an image well. There are many possible extensions of the project, and some of them are discussed at the and of the project. You can certainly come up with other ideas for possible improvements; feel free to chat with the instructor about such ideas. Note: if you liked the homework about sorting algorithms and none of the other topics excites you, this is the one you've been waiting for! 6. Fast sorting algorithms Homework 3 evaluated how merge sort can be improved by mixing and matching various sorting algorithms, in particular as part of the "basis case" of the algorithm. In this project, you will explore these and other directions, learn about other sorting algorithms, and put together an improved sorting implementation. This project is more on the scientific programming side than some other projects. 7. LDA, QDA, and more We learned in class about how linear discriminant analysis (LDA) and its quadratic variant (QDA) can be used to classify data. In this project, you will consider some data set where classification should be performed, and see how to use these and or other classifiers on the data. 8. Matrix completion A well known data science problem from the last few years is known as the Netflix problem. A huge matrix is only partly observed. Columns correspond to different movies, rows are users, and each entry of the matrix relates how well the user liked that movie. Typical users have watched a few dozen movies and rated them. However, there are many thousands of movies, and data about a user's interest in the other movies is unknown. You can easily see that Netflix (and similar companies offering such services) might be interested in predicting how well a user may like other movies, and suggesting those movies. To do so, a technique called matrix completion is involved; unknown matrix entries are estimated using different models for the matrix. This project will involve identifying a data set where matrix completion may be useful, studying about matrix completion, and applying one of the algorithms to the data. Note: for 2024, I have a specific project in mind. An acquaintance of mine has data for various asset classes going back 50 years, and computing "the best" portfolios during this time could be interesting. 9. Financial prediction This project involves various possible ways to predict future price changes of financial assets. One example might be to put together a compression algorithm that compresses past stock returns. If we can indeed compress past returns, it implies that we have (a bit of) predictive ability for future returns. Instead of data compression, an approach based on linear regression might also be useful. 10. Hurricane prediction Ahead of Hurricane Florence in September 2018, we chatted in class about some issues related to the hurricane, including weather prediction. In this project, your role will be to learn about metereology, find data related to hurricanes, pose some interesting questions, and try to address them from the data. Possible questions could be: how reliable are forecasts a few days before a hurricane; can we predict the number of casualties caused by a hurricane; can we predict wind speeds, flooding levels, or other important parameters? 11. Arithmetic codes We learned that a probability p can be encoded using -log2(p) bits, where log2(.) denotes the base-two logarithm. A more complicated situation arises when encoding sequences of symbols that contain time dependencies. In this case, the probability mass function (PMF) that governs the distribution of each individual symbol often differs from the PMFs governing other symbols. A well known technique to encode such sequences is arithmetic coding. Practically speaking, it encodes an entire sequence of symbols to within 2-3 bits of the ideal coding length. This is particularly impressive, because these ideal lengths are often non-integer. In this project, your first goal is to learn about arithmetic codes, for example from a paper by Cleary, Bell, and Witten. Next, you can implement your own arithmetic encoder and decoder; or you could download an existing implementation and use it within your own coding scheme. 12. Mixtures vs MDL We briefly discussed in class that mixtures are an alternative modeling approach to MDL. Instead of choosing the one "best" model in a model class that describes the data most succinctly, mixtures simultaneously describe the data using all the models, and the probability assigned to the data is a weighted sum of probabilities assigned by all the models. In this project, you will learn about context tree weighting (CTW), which is a specific mixture-based algorithm for encoding and decoding by Willems et al. 13. Optimal model classes We discussed in class an example of encoding a length-N i.i.d. binary sequence, where we quantize the empirical probability of ones to a grid of roughly sqrt(N) possible levels. In this project, you'll learn a bit more about the optimal quantization grid. Different works that considered optimal quantization grids, including a precise characterization of coding lengths involves, include Rissanen in 1996 and Clark and Barron (Andrew Barron; not Dror Baron) in the nineties. 14. Resource consumption in circuits The computer science community has mainly focused on computational complexity when evaluating how the resources consumed by algorithms scale, with lesser focus on memory use. In contrast, implementing an algorithm on a dedicated electronic circuit features numerous trade-offs, including power consumption, clock speed, area of silicon consumed, and the number of transistors. In this project, you'll identify some algorithm that has interesting trade-offs in a hardware implementation. A similar project could consider reducing consumption in wireless systems, where communication costs may far exceed computational costs. 15. Minimizing cache misses The random access machine model for computation assumes that each basic instruction and memory access takes one unit of time. But computers often take quite different amounts of time for different instructions. For example, floating point arithmetic such as exponents and sines may require dozens of clocks, while incrementing an integer may be quicker. This problem of different instructions requiring varying amounts of time is especially severe in the memory system, where accessing the registers often takes a single clock cycle, whereas a cache miss to main memory may requires dozens or even hundreds of cycles. In this project, you will identify two (or more) algorithms for computing the same overall function, where one algorithm performs fewer total instructions but is limited by cache misses (and must access main memory) while the other algorithm may perform more total instructions but has few cache misses and is thus faster. An interesting part of this project could be taking some existing algorithm and modifying it in a way that reduces the number of cache misses, thus offering faster real-world performance. Note: again, there might be a quantum connection! Please contact me for details if you have the background. 16. Sparse superposition codes. This project shoud be especially appealing to students with a communication engineering background. Sparse superposition codes (SPARCS) achieve the capacity of the additive white Gussian noise (AWGN) channel by transmitting a superposition of individual columns of a matrix, where the matrix is partitioned into sections, and in each section we choose exactly one column to transmit. With M columns per section, the column chosen provides log2(M) bits of information. It has been proved that by carefully allocating how much power each section of the matrix receives, and using an efficient decodoing algorithm, SPARCS achieve the capacity of the AWGN channel despite only requiring moderate computation. There are many directions to look into improving, and a group interested in this direction should contact Prof. Baron and chat with him about it in person. 17. Finding the median among a set of numbers. We discussed sorting algorithms in class and through a project. However, in some applications we are given a set of numbers and are only interested in the minimal number in the set, or the maximal number, or median, or perhaps the K'th largest number (this is called an order statistics). Identifying the minimum or maximum is easily performed in linear time, but what about the median? You could sort N numbers in N*log(N) time and then select the median, but it turns out that order statistics can be computed in linear time(!) As part of this project, you will learn more about algorithms for computing order statistics, and implement an algorithm that computes the median. In addition to conventional algorithms for computing the median, Dr. Baron has a different idea, and you may want to look into that (chat with him for more details). 18. Adaptive nearest neighbors In the K nearest neighbors approach to binary classification, we perform a majority vote. But what if the majority vote is not conclusive, for example a 10-7 vote. In that case, we may want to increase K. We look at more and more nearest neighbors until we have a statistically significant majority. Your role will be to work through the details of this approach, and compare it to other classification approaches. 19. Quantum machine learning You will learn about quantum computing and look at some quantum algorithm in the data science space. Quantum machine learning algorithms would be prime candidates to look at; make sure to discuss this with Dr. Baron. Your role in such a project will be to educate yourself and then other students about these topics. You will be less likely to be able to simulate a quantum algorithm, and from your perspective such a project will be more of a learning experience. 20. Prediction in the real estate market In the US, financial assets (mostly stocks and bonds) represent roughly 45% of wealth, while real estate represents roughly 35%. In contrast, in some other major economies, real estate represents a larger share of wealth. Therefore, the US financial markets are the largest in the world, which has likely motivated many students in past years to work on projects in topics such as financial prediction or optimization. However, again, real estate is also a substantial part of the US economy. In this project, you will learn about the real estate market and try to see whether you can predict... something... anything. Maybe you can predict whether renters default on their contracts. Maybe you can predict future changes in housing prices. Maybe you can predict housing prices in a zipcode based on people's incomes and other variables. Dr. Baron can put you in touch with people who are knowledgeable in this area, and they will likely be glad to give you some pointers. 21. Dantzig selector and / or interior point methods Learn more about computational methods for solving linear programs, including the Dantzig selector and interior point methods. Tell us about them during your presentation, and try to implement your own versions. See how well it works compared to standard software packages.