The goal of this project is to use machine learning to identify topics covered in a lecture from the lecture content itself. It can then be used to locate lectures that cover any particular topic of interest.

The program has two parts. First, it learns to identify different subject categories from wikipedia pages. It does so by training a Naive Bayes classifier on the documents converged into a bag of words and transformed by Tf-idf vectorizer. To keep things simple, only glossary pages on some subjects are used. Then, the classifier is used to identify subjects of lecture texts fetched from MIT OCW website.

In the second part, Non-Negative Matrix Factorization (NMF) is used, which is a popular method for topic modelling. It groups together phrases that are likely to form a particular topic. The most frequent phrases from the dominant topic in each lecture is used to make a list of keywords for each lecture.

The app is developed in python using scikit-learn machine learning library and django web framework. It is hosted on the amazon elastic-beanstalk.

StartThe program is trained to identify subjects using subject glossary pages from Wikipedia. The following are included by default. Pages can be added or removed. Click GO to open the pages on a separate window. Note that, too many pages will slow down the calculation and too few will affect accuracy.

Click LEARN to train.

1 | physics | |

2 | chemistry terms | |

3 | biology | |

4 | probability and statistics | |

5 | elementary quantum mechanics | |

6 | artificial intelligence | |

7 | astronomy | |

8 | game theory | |

Before the classifier can be used to make predictions, the lectures needs to be downloaded. Select a lecture from the dropdown below and a script will download its transcript into memory directly from the MIT website. The transcripts are under "Transcript" tab under the video page.

The downloaded lectures are tokenized and transformed to a Tf-Idf matrix and fed to the classifier to make predictions. The output is a list of subjects closely matching the text of each lectures. Obviously, for accurate predictions, the actual subject must be in the list of wikipedia pages during training the model

Click to identify subjects.

Results

1 | chemistry terms |

2 | chemistry terms |

3 | elementary quantum mechanics |

4 | elementary quantum mechanics |

5 | elementary quantum mechanics |

6 | elementary quantum mechanics |

7 | elementary quantum mechanics |

8 | elementary quantum mechanics |

9 | elementary quantum mechanics |

10 | elementary quantum mechanics |

Non-negative matrix factorization is a non-supervised machine learning method to model topics of given set of documents It works by grouping keywords together, that are likely to form a topic. To apply NMF to the already tokenized Tf-Idf matrix generated from the lecture texts.

Results: 7 topics found.

1 | function, phi, wave function, operator, equation |

2 | color, hardness, white, electrons, electron |

3 | momentum, function, squared, wave function, value |

4 | squared, equation, plus, psi, nodes |

5 | operator, hat, operators, function, x0 |

6 | electrons, light, experiment, electron, slit |

7 | dagger, operator, adjoint, bar, phi |

The matrix product of NMF components with Tf-Idf array offers a useful visualization of the topics in each lecture. The bar chart plots this product, which shows the distribution of topics in the lectures. Each color represents a topic and each bar represents a lecture. Lectures that cover similar topics have same dominant color.

In the final step, the result of supervised classifier is combined with the subtopics extracted using NMF. The top keyphrases of the dominant topic in each lecture are selected and most frequent key phrases that are the most likely representatives are selected.

1 | ## chemistry termselectron, white, experiment, soft, send, path |

5 | ## chemistry termselectron, wave, experiment, light, low, slit |

2 | ## elementary quantum mechanicswave function, momentum, psi, wavelength, likely, uncertainty |

2 | ## elementary quantum mechanicsmomentum, value, squared, derivative, state, average |

0 | ## elementary quantum mechanicswave function, operator, time, acting, equation, value |

0 | ## elementary quantum mechanicswave function, time, operator, equal, zero, acting |

0 | ## elementary quantum mechanicswave function, phi, eigenfunctions, state, time, value |

3 | ## elementary quantum mechanicssquared, equation, psi, plus, minus, infinity |

6 | ## elementary quantum mechanicsoperator, acting, state, dagger, minus, time |

0 | ## elementary quantum mechanicsstate, value, operator, phi, eigenfunctions, time |

To start over with default calculation, click Restore.