Sentimental Analysis of Restaurant Reviews using NLP

Arman Khan
4 min readJun 16, 2021

Sentiment analysis is that the interpretation and classification of emotions (positive, negative, and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands, or services in online conversations and feedback.

With the recent advances in deep learning, the pliability of algorithms to analyze text has improved considerably. Creative use of advanced computing techniques could also be an honest tool for doing in-depth research. We believe it is important to classify incoming customer conversation a pair of name supported following:
1. Key aspects of a brand’s product and repair that customers care about.
2. Users’ underlying intentions and reactions concerning those aspects.

Also, sentiment analysis is the most common text classification tool that analyses incoming messages, social posts, comments on the forum, etc. which is believed as Intent Analysis Or Profanity Analysis

What does it do?

The sentiment analysis model detects the polarity within a text (positive or negative), understanding people’s emotions is incredibly important for any business since users can express themselves in reviews more freely than ever.
For example, A owner of a business used sentiment analysis on the reviews given by the purchasers and located that the bulk of the purchasers were happy with his product, as you will see within the image below.

Dataset: We will Download the dataset from Superdatascience Website. For our project we’ll be using a ‘’.tsv” file instead of a ‘’.csv ‘’ file. The reason behind that is that in a .csv file, all the columns are separated by a ‘,’ i.e. a “comma” and the problem with is there are many reviews that contain one to multiple “,”, “commas”, So there would be a lot of problem in differentiating columns.

Therefore, we’ll be using a .tsv file where the delimiter is “ “ i.e. “tab” as shown in the image below:

Code:

We start off by importing some of the important libraries and datasets.

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

We then import the dataset “Restaurant_Reviews.tsv” with the ‘tab’ delimiter ‘\t’. So pandas will know that our two columns are separated by a ‘tab’.

# Importing the dataset
dataset = pd.read_csv(‘Restaurant_Reviews.tsv’, delimiter = ‘\t’, quoting = 3)

Now, the reviews we have are not necessarily in the best form that we could just use them for sentiment analysis. Cleaning of the texts needs to be done.

So, here we see a lot of ‘text cleaning’ process going on. So each word in the reviews are stemmed down to their shortened form. Then, each word that is not there in the ‘stopword’ list are stored in the ‘review’ list. And then we finally convert the ‘review’ list into string.

# Cleaning the texts
import re
import nltk
nltk.download(‘stopwords’)
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub(‘[^a-zA-Z]’, ‘ ‘, dataset[‘Review’][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words(‘english’))]
review = ‘ ‘.join(review)
corpus.append(review)

Now We’re going to create this bagofwords model just to take all the different words of the 1000 reviews without taking the duplicates or triplicates we are just taking all the different but unique words.

So of course there are a lot of different words here so we will have a lot of columns and then we will put all these columns in a table where the rows are nothing else than the 1000 reviews and the columns correspond to each of the different words we can find here in all the reviews in this corpus.

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

We then split the dataset into 80% ‘Training set’ and ‘20% Testing Set’

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

We then fit the Naive Bayes classifeir into the Training set and then we predict it using the classifier.predict() on thetest data.

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

We finally calculate the confusion Matrix in order to find the accuracy of our model.

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

As we can see from the ‘confusion matrix’ table, out of 67 negative reviews, our model has got 55 predictions right and out of 133 positive reviews, our model has got 91 predictions right.

That means our accuracy is 73%.

So since we only had eight hundred reviews to train the model that’s actually not bad. If we had 1 million reviews we would get much fewer incorrect predictions because simply our Naive Bayes model would find more and stronger correlations between the cleaned text reviews and the outcome. But here by turning the model on only 800 reviews this is actually not bad.

--

--