3.2.1 Text Mining
First, we mined the reviews for weather-related terms (i.e. rainy, humid). 0.7% of reviews
include weather-related terms. A few reviews that mention the weather in a positive way are shown in the examples below.
Example 5: Positive weather mentions in Yelp reviews
"What an afternoon of perfection. Great location, great food and awesome weather! I know the
Back Porch isn't responsible for the cool breeze, but the ample.."
"What do you do when it's a rainy Saturday afternoon on the beach? You park it at a local beach
bar of course! Live music, relaxed feel and only a few people..."
"Came here on a rainy Friday afternoon with my husband. What a lovely spot! First of all the decor
and architecture of the space was intriguing and vibrant..."
In Example 6, the good weather is mentioned, but in a negative way since it leads to crowds at that particular restaurant.
Example 6: Negative weather mentions in Yelp Reviews
"Ugh! I have tried to eat at this place three times, forget about it. I was excited to go today the weather
is beautiful and they have outdoor seating, but I.."
Although only a few reviewers explicitly mention the weather in their Yelp review, the effect could still be implicit.
3.2.2 Correlation of Temperature with Yelp review ratings and sentiment
We do not find strong correlations between temperature and the review ratings or the sentiment. Both correlation
coefficients are near zero. Even when isolating by region, as shown below in Table 3.2.2, there is no correlation,
suggesting the temperature does not have much of an effect on Yelp ratings.
Table 3.2.2: Correlation Coefficients
Region | TAVG & review_rating | TAVG & sentiment_polarity |
Mountain | 0.06 | -0.06 |
SouthCentral |
0.04 |
-0.01 |
MidAtlantic |
0.03 |
0.02 |
Southeast |
0.02 |
0.02 |
NorthCentral |
0.01 |
0.02 |
Midwest |
0.00 |
0.03 |
NewEngland |
0.00 |
-0.03 |
Pacific |
0.00 |
-0.01 |
Northwest |
-0.06 |
0.06 |
All Regions |
0.02 |
0.01 |
For further visualization of the
correlations
3.2.3 Clustering Analysis
Next, we used clustering analysis to explore the relationship among temperature, sentiment and the location
of the restaurant (using
rest_lat,
rest_long,
TAVG and
sentiment_polarity).
We used both KMeans (with k=3) and DBScan. Results are visualized below in Figures 3.2.3 ad 3.2.4 using PCA.
Although DBScan better identifies the outliers, no clear clusters formed. Again, this suggests that
there is no strong relationship among temperature and Yelp reviews.
Figure 3.2.3: KMeans Clustering
Figure 3.2.4: DBSCAN
3.2.4 Multivariate Linear Regression
We conducted multivariate linear regression analysis to explore whether rainy climate or level of temperature
would influence customer's sentiment. We split the data into train and test datasets. Then we build a
multivariate linear regression model based on the train dataset. The X attributes are
PRCP and
TAVG.
The Y attribute is
sentiment_polarity. Next, we use the test dataset to examine the performance of the model.
The performance of this model is poor. The R
2 is -0.001, which shows there is no relationship between X
(weather attributes) and Y (
sentiment_polarity). The 3D plot below in Figure 3.2.4 also illustrates the same point.
The blue points represent the real data in our dataset. The red plane represents the prediction of regression model.
We could conclude that there is no significant relationship between the X and Y attributes.
In addition, we also selected five states, which were CA, VA, NE, NY and AL, to build state-level multivariate linear
regression models. However, the results were the same.
Figure 3.2.4: Multivariate Linear Regression
3.2.5 Machine Learning: Classifying Reviews based on Weather and Location
Lastly, we tested whether Yelp ratings could be predicted based on location and weather alone. We used five
different machine learning techniques: k-Nearest Neighbor (kNN), Decision Tree (CART), Naive Bayes (NB), Support Vector Machines (SVM), and Random Forest.
First, we grouped the rating into two classes: High (4-5 stars) and Low (0-3 stars). Next, we applied each
machine learning algorithm. The results for cross-validation are below in Table 3.2.5.
Naive Bayes and SVC performed the best in cross-validation, with ~80% accuracy.
Table 3.2.5: Cross-Validation Results
Method |
Mean |
Std. Dev. |
NB |
0.80 |
-0.01 |
SVC |
0.80 |
-0.01 |
KNN |
0.77 |
-0.01 |
RF |
0.76 |
-0.01 |
CART |
0.75 |
-0.01 |
Click here for
the confusion matrix results for each method.
The ROC curve for each method is below in Figure 3.2.5. The dotted line indicates the results
if the classification was left to random chance. Our machine learning algorithms just barely outperform the dotted line.
This indicates that we cannot predict the rating by weather and location alone better than random chance.
Figure 3.2.5: ROC Curve