How It Works

The goal of this project is to use machine learning to identify topics covered in a lecture from the lecture content itself. It can then be used to locate lectures that cover any particular topic of interest.

The program has two parts. First, it learns to identify different subject categories from wikipedia pages. It does so by training a Naive Bayes classifier on the documents converged into a bag of words and transformed by Tf-idf vectorizer. To keep things simple, only glossary pages on some subjects are used. Then, the classifier is used to identify subjects of lecture texts fetched from MIT OCW website.

In the second part, Non-Negative Matrix Factorization (NMF) is used, which is a popular method for topic modelling. It groups together phrases that are likely to form a particular topic. The most frequent phrases from the dominant topic in each lecture is used to make a list of keywords for each lecture.

The app is developed in python using scikit-learn machine learning library and django web framework. It is hosted on the amazon elastic-beanstalk.

  Start

Learn from Wikipedia

The program is trained to identify subjects using subject glossary pages from Wikipedia. The following are included by default. Pages can be added or removed. Click GO to open the pages on a separate window. Note that, too many pages will slow down the calculation and too few will affect accuracy.

Click LEARN to train.

1 physics
GO
2 chemistry terms
GO
3 biology
GO
4 probability and statistics
GO
5 elementary quantum mechanics
GO
6 artificial intelligence
GO
7 astronomy
GO
8 game theory
GO
w

Get Lectures

Before the classifier can be used to make predictions, the lectures needs to be downloaded. Select a lecture from the dropdown below and a script will download its transcript into memory directly from the MIT website. The transcripts are under "Transcript" tab under the video page.

Identify Subjects

The downloaded lectures are tokenized and transformed to a Tf-Idf matrix and fed to the classifier to make predictions. The output is a list of subjects closely matching the text of each lectures. Obviously, for accurate predictions, the actual subject must be in the list of wikipedia pages during training the model

Click to identify subjects.

Results

1 chemistry terms
2 chemistry terms
3 elementary quantum mechanics
4 elementary quantum mechanics
5 elementary quantum mechanics
6 elementary quantum mechanics
7 elementary quantum mechanics
8 elementary quantum mechanics
9 elementary quantum mechanics
10 elementary quantum mechanics

Matrix Factorization

Non-negative matrix factorization is a non-supervised machine learning method to model topics of given set of documents It works by grouping keywords together, that are likely to form a topic. To apply NMF to the already tokenized Tf-Idf matrix generated from the lecture texts.

Results: 7 topics found.

1 function, phi, wave function, operator, equation
2 color, hardness, white, electrons, electron
3 momentum, function, squared, wave function, value
4 squared, equation, plus, psi, nodes
5 operator, hat, operators, function, x0
6 electrons, light, experiment, electron, slit
7 dagger, operator, adjoint, bar, phi

The matrix product of NMF components with Tf-Idf array offers a useful visualization of the topics in each lecture. The bar chart plots this product, which shows the distribution of topics in the lectures. Each color represents a topic and each bar represents a lecture. Lectures that cover similar topics have same dominant color.

Final Predictions

In the final step, the result of supervised classifier is combined with the subtopics extracted using NMF. The top keyphrases of the dominant topic in each lecture are selected and most frequent key phrases that are the most likely representatives are selected.

1

chemistry terms

electron, white, experiment, soft, send, path

5

chemistry terms

electron, wave, experiment, light, low, slit

2

elementary quantum mechanics

wave function, momentum, psi, wavelength, likely, uncertainty

2

elementary quantum mechanics

momentum, value, squared, derivative, state, average

0

elementary quantum mechanics

wave function, operator, time, acting, equation, value

0

elementary quantum mechanics

wave function, time, operator, equal, zero, acting

0

elementary quantum mechanics

wave function, phi, eigenfunctions, state, time, value

3

elementary quantum mechanics

squared, equation, psi, plus, minus, infinity

6

elementary quantum mechanics

operator, acting, state, dagger, minus, time

0

elementary quantum mechanics

state, value, operator, phi, eigenfunctions, time

To start over with default calculation, click Restore.