A text prediction application for movie reviews with sentiment analysis in real-time
The Large Movie Review Dataset (from Stanford University) consists of 50,000 movie reviews (50% negative and 50% positive). The set is divided into training and validation datasets (each with 25000 movie reviews with an equal number of positive and negative reviews).
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment. The data was collected by Stanford researchers and was used in a 2011 paper (Learning Word Vectors for Sentiment Analysis by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts) where a split of 50/50 of the data was used for training and test. An accuracy of 88.89% was achieved. The data was also used as the basis for a Kaggle competition titled "Bag of Words Meets Bags of Popcorn" in late 2014 to early 2015. Accuracy was achieved above 97%, with winners achieving 99%.
The project consists of two parts, the sentiment classification task, and the next word prediction task. For the sentiment classification task, several Machine Learning algorithms, including supervised learning and unsupervised learning, were implemented for building models and having analysis. For the text prediction task, the n-gram models were built for predicting the next word. Models will be evaluated and one of the models will be selected as the best model to combine with the language model to build a web application as an AI solution.
Detail Report can be found here
The application is basically built on python. A web application interface has been built through Anvil. The python progrom is currently running on the cloud server, digital ocean. Due to the limitation of the instance setting, 3 GB Memory and 60 GB Disk, the current running version of the application implements the neural network model for sentiment classification task and n-gram model (bigram model and trigam model) for next word prediction task.
The application is a prototype. There is still room for improvement.
The project is own by Jason Maloney, Bing-Je Wu, Maya Mileva and Antonio Llorens. This project is inspired by Kevin Markham from Data School. An adaptive version of the application was built based on the concept of this application.