The goal of this project is to use machine learning to identify topics covered in a lecture from the lecture content itself. It can then be used to locate lectures that cover any particular topic of interest.
The program has two parts. First, it learns to identify different subject categories from wikipedia pages. It does so by training a Naive Bayes classifier on the documents converged into a bag of words and transformed by Tf-idf vectorizer. To keep things simple, only glossary pages on some subjects are used. Then, the classifier is used to identify subjects of lecture texts fetched from MIT OCW website.
In the second part, Non-Negative Matrix Factorization (NMF) is used, which is a popular method for topic modelling. It groups together phrases that are likely to form a particular topic. The most frequent phrases from the dominant topic in each lecture is used to make a list of keywords for each lecture.