Game Review Rating

Term Project – Data mining

Hengchao Wang 1001778272

Download my files

Link of my website

Link of Github repository

Link of video on youtube

Link of kernel on kaggle

Reference.

BoardGameGeek Reviews Baseline Model https://www.kaggle.com/ellpeeaxe/boardgamegeek-reviews-baseline-model

Word2vec In Supervised NLP Tasks. Shortcut https://www.kaggle.com/vladislavkisin/word2vec-in-supervised-nlp-tasks-shortcut/comments

Cuz the scale of the dataset is super big. Cannot use one hot expression to exprese words and sentences. It will cause the curse of dimensionality. Which means the matrix is big and sparse to be compute. So I decide to use Word2Vec word embedding model to reduce dimension of matrix. I have two references. The link is shown above.

The based task of this question is a regression problem. The imput data is 300-dimensional word vector, output is the prediction of rate for each review.

import numpy as np 
import pandas as pd 
import nltk
import re,string,unicodedata
import seaborn as sns
import gensim
import sklearn

from pandas import Series
from wordcloud import WordCloud,STOPWORDS
from bs4 import BeautifulSoup
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from gensim.models import word2vec, Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, BayesianRidge
import joblib
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn import ensemble

Get data from csv

# get review and rating columns
review_path = 'bgg-13m-reviews.csv'

data = pd.read_csv(review_path, usecols=[2,3])
data.head()

ratingcomment
010.0NaN
110.0NaN
210.0Currently, this sits on my list as my favorite...
310.0I know it says how many plays, but many, many ...
410.0NaN
# remove null comment
def remove_nan(data):
    data['comment']=data['comment'].fillna('null')
    data = data[~data['comment'].isin(['null'])]
    data = data.reset_index(drop=True)
    return data
data = remove_nan(data)
data.head()

ratecomment
010.0currently , thi sit list favorit game .
110.0know say mani plays , many , mani uncounted. l...
210.0never tire thi game .. awesom
310.0thi probabl best game ever played. requir thin...
410.0fantast game. got hook game .

This is data describtion. The number of review is 2.637756e+06

data.describe()

rating
count2.637756e+06
mean6.852070e+00
std1.775769e+00
min1.401300e-45
25%6.000000e+00
50%7.000000e+00
75%8.000000e+00
max1.000000e+01

Data preprocessing

For data preprocessing I using tokenizer() from NLTK library to tokenize the words. Load stopword from NLTK and load html strips from beautifulsoup4 library. Use regular expression to remove them and some special characters.

#Tokenization of text
tokenizer=ToktokTokenizer()
#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')
#Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
#Apply function on review column
data['comment']=data['comment'].apply(remove_between_square_brackets)
#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text
#Apply function on review column
data['comment']=data['comment'].apply(remove_special_characters)
#Stemming the text
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text
#Apply function on review column
data['comment']=data['comment'].apply(simple_stemmer)
#set stopwords to english
stop=set(stopwords.words('english'))
print(stop)

#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
#Apply function on review column
data['comment']=data['comment'].apply(remove_stopwords)
{"you'd", "that'll", 'other', 'any', "won't", "you're", 'have', 'yourselves', 'about', 'm', 'were', 'our', 'than', 'their', 'haven', 'being', 'over', 't', 'been', 'against', 'again', 'we', 'most', 'doesn', 'so', 'yourself', "aren't", 'mustn', 'under', 'just', 'down', 'ma', 'with', 'until', 'isn', 'don', 'shan', "shouldn't", 'myself', "you've", 'having', 'has', 'between', 'because', 'was', 'yours', 'nor', 'am', 'through', 'his', 'as', 'few', 'but', 'and', 'before', 'itself', 'hers', 'during', "mustn't", 'y', 'doing', 'an', "you'll", 'they', 'hasn', 'did', 'each', "couldn't", 'ours', 'weren', 'hadn', 'there', 'then', "doesn't", 'that', 'this', 'needn', 'no', 'i', 'aren', 'too', 'once', 'you', 'themselves', 'her', 'these', 'll', 'won', 'out', 'how', "it's", 'herself', 'to', 'when', 'o', 'my', 'of', 'into', "didn't", "hadn't", 'very', 'him', 'what', 'now', 'who', 'are', 'if', 'in', 'above', 'why', 'all', 'off', 'where', 'd', 'didn', 'couldn', 'while', 'does', 'she', 'wasn', 'theirs', 'the', 'your', "should've", 'by', 'up', 'whom', 'a', "weren't", 'same', "hasn't", 'mightn', "shan't", 'some', 'from', 'below', 're', 'which', 'those', "don't", "mightn't", 'will', 'its', 'only', "needn't", 'himself', 's', 'more', 'such', 'not', 'he', 'on', 'own', "she's", 'is', "haven't", 'be', 've', 'further', 'do', 'should', 'them', 'had', "wasn't", 'me', 'both', 'shouldn', 'or', 'can', 'for', 'ain', 'it', 'ourselves', "isn't", "wouldn't", 'here', 'at', 'after', 'wouldn'}

After we remove the stopword we need to remove the empty review again because come short review after remove stopword will change into empty.

data = remove_nan(data)
data.to_csv('data_after_remove_st.csv', header=False, index=False, encoding = 'utf-8')
columns = ['rate', 'comment']
data = pd.read_csv('data_after_remove_st.csv',names = columns)
data['comment'] = data.comment.str.lower()
data['document_sentences'] = data.comment.str.split('.') 
# data['tokenized_sentences'] = data['document_sentences']
data['tokenized_sentences'] = list(map(lambda sentences:list(map(nltk.word_tokenize, sentences)),data.document_sentences))  
data['tokenized_sentences'] = list(map(lambda sentences: list(filter(lambda lst: lst, sentences)), data.tokenized_sentences))
data.head()

ratecommentdocument_sentencestokenized_sentences
9930017.0good deduct game , go wrong question answer wr...[good deduct game , go wrong question answer w...[[good, deduct, game, ,, go, wrong, question, ...
19654606.0thi reason simpl area control game nice mechan...[thi reason simpl area control game nice mecha...[[thi, reason, simpl, area, control, game, nic...
2733307.8awesom game. sleeved .[awesom game, sleeved , ][[awesom, game], [sleeved]]
5795877.2thi everyth want bang ! , except much streamli...[thi everyth want bang ! , except much streaml...[[thi, everyth, want, bang, !, ,, except, much...
7404506.0fun parti trivia game .[fun parti trivia game , ][[fun, parti, trivia, game]]

Challenge 1

Here is a hint: Because the String[] cannot save as csv. The tokenized_sentences after save into csv will change the format into String and cannot load again. This is one of a challenge I met. At the first few round of training Word2Vec model. The final accuracy is super low. I check the word expression of each word. The output from Word2Vec is less than 0.0001. That means that these word almost doesn't appear in the dataset. That doesn't make sence. So I check the model. The model.wv.vocab.keys() is small too and the vocabelory are latters, not words. So it must be the split problem or the format problem. So I check the type of each variable. The type of “tokenized_sentences” is changes. After google the issue. I found the point is you cannot save string[] in csv.

I wrote the wrong code as a comment in next 2 cells.

# data.to_csv("data_after_pre.csv",sep=',',index=False, encoding = 'utf-8')
# data = pd.read_csv('data_after_pre.csv')

The next cell will not be run when I train the Word2Vec. I train the Word2Vec model by using the whole Dataset. The next cell will be run when I train the regression model. Cuz the computation I have only can use 50k reviews to train the regression model. So I use 10k and 50k reviews and compare them.

# Take the top 10k after random ordering
data = data.reindex(np.random.permutation(data.index))[:100000]
# split the data into training data and test data.
train, test, y_train, y_test = train_test_split(data, data['rate'], test_size=.2)
type(train.tokenized_sentences[993001])
list
#Collecting a vocabulary
voc = []
for sentence in train.tokenized_sentences:
    voc.extend(sentence)
#     print(sentence)

print("Number of sentences: {}.".format(len(voc)))
print("Number of rows: {}.".format(len(train)))
Number of sentences: 237600.
Number of rows: 80000.
voc[:10]
[['(', 'vanilla', 'game', 'only'],
 ['play',
  'beyond',
  'black',
  ',',
  'mechan',
  'chang',
  'present',
  'expans',
  'might',

Word2Vec model train, save and load

The number of feature in my Word2Vec model is 300. The matrix using one-hot expression is about 150k * 2.6M. Curse of dimensionality is gone.

# word2vector
num_features = 300    
min_word_count = 3     # Frequency < 3 will not be count in.
num_workers = 16       
context = 8           
downsampling = 1e-3   

# Initialize and train the model
W2Vmodel = Word2Vec(sentences=voc, sg=1, hs=0, workers=num_workers, size=num_features, min_count=min_word_count, window=context,
                    sample=downsampling, negative=5, iter=6)
model_voc = set(W2Vmodel.wv.vocab.keys()) 
print(len(model_voc))
151488
# model save
W2Vmodel.save("Word2Vec2")
# model load
W2Vmodel = Word2Vec.load('Word2Vec2')

Challenge 2

Train the model sentence by sentence is more accurate than the whole review. Cuz the length of the sentence are similar so that the feature of each input is similar. So I did not remove ‘.’ when I remove noise character. That's come from comparison.

def sentence_vectors(model, sentence):
    #Collecting all words in the text
#     print(sentence)
    sent_vector = np.zeros(model.vector_size, dtype="float32")
    if sentence == [[]] or sentence == []  :
        return sent_vector
    words=np.concatenate(sentence)
#     words = sentence
    #Collecting words that are known to the model
    model_voc = set(model.wv.vocab.keys()) 
#     print(len(model_voc))

    # Use a counter variable for number of words in a text
    nwords = 0
    # Sum up all words vectors that are know to the model
    for word in words:
        if word in model_voc: 
            sent_vector += model[word]
            nwords += 1.

    # Now get the average
    if nwords > 0:
        sent_vector /= nwords
    return sent_vector
train['sentence_vectors'] = list(map(lambda sen_group:
                                      sentence_vectors(W2Vmodel, sen_group),
                                      train.tokenized_sentences))
test['sentence_vectors'] = list(map(lambda sen_group:
                                    sentence_vectors(W2Vmodel, sen_group), 
                                    test.tokenized_sentences))
/home/sxy/anaconda3/envs/ML/lib/python3.6/site-packages/ipykernel_launcher.py:19: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
def vectors_to_feats(df, ndim):
    index=[]
    for i in range(ndim):
        df[f'w2v_{i}'] = df['sentence_vectors'].apply(lambda x: x[i])
        index.append(f'w2v_{i}')
    return df[index]
X_train = vectors_to_feats(train, 300)
X_test = vectors_to_feats(test, 300)
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)
train.to_csv('train_w2v_100k.csv')
test.to_csv('test_w2v_100k.csv')
train = pd.read_csv('train_w2v_1000k.csv').drop(columns = 'Unnamed: 0')
test = pd.read_csv('test_w2v_1000k.csv').drop(columns = 'Unnamed: 0')
X_train = train.drop(columns = 'rate')
X_test = test.drop(columns = 'rate')
y_train = train.rate
y_test = test.rate
X_test

w2v_0w2v_1w2v_2w2v_3w2v_4w2v_5w2v_6w2v_7w2v_8w2v_9...w2v_290w2v_291w2v_292w2v_293w2v_294w2v_295w2v_296w2v_297w2v_298w2v_299
0-0.064735-0.043941-0.3915600.1942730.038023-0.062682-0.003358-0.2201160.118839-0.210516...-0.1135870.034951-0.048320-0.084418-0.0167300.116862-0.0068450.0392910.216906-0.068584
1-0.108961-0.058336-0.3184530.1913890.005011-0.0720800.031846-0.1659230.149237-0.112924...-0.1653570.068865-0.048133-0.099376-0.0373510.0751340.0026590.0276520.179799-0.091966
20.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
3-0.037381-0.105713-0.2237360.230134-0.0583200.0158690.157899-0.3952700.309151-0.230134...-0.326233-0.136316-0.017143-0.0491900.1122810.129845-0.085892-0.0368400.082894-0.135891
4-0.073528-0.035113-0.2956320.2106200.0304010.0208460.056935-0.0366780.192137-0.148260...-0.098527-0.010514-0.077920-0.0307620.0246840.0835720.0479930.0496590.240063-0.042669
..................................................................
19995-0.046645-0.102675-0.3788000.147131-0.031686-0.0066550.074540-0.1693750.167047-0.096239...-0.178984-0.024323-0.100185-0.013699-0.0162140.1343380.0529310.0117610.187225-0.053723
19996-0.041224-0.043504-0.1818150.241858-0.0871890.0125120.010387-0.2685970.120241-0.173561...0.0445850.0427330.204142-0.184483-0.0972130.072322-0.0093120.0445820.361448-0.079877
199970.016242-0.070195-0.2352030.2570240.072520-0.119281-0.028535-0.2436680.219881-0.223677...-0.1170140.0831260.004575-0.0476020.0089020.131965-0.026648-0.0420320.170854-0.087977
19998-0.0883380.009536-0.1905050.197417-0.081475-0.0287960.044730-0.1189430.050266-0.045812...-0.1321290.2204610.029903-0.0256900.050592-0.1008970.0936190.0501970.166418-0.089344
19999-0.036372-0.024210-0.2839330.1397670.035674-0.0909930.046099-0.1372800.160993-0.107587...-0.1098230.013688-0.014184-0.152064-0.0377800.0030160.037712-0.0281410.235616-0.055234

20000 rows × 300 columns

implement different regression model

I implement 4 regression model and compare them with Root Mean Square Error (RMSE) and Mean absolute error(MAE).

RMSE: Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. It can tells you how concentrated the data is around the line of best fit.

MAE: Mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. It is thus an arithmetic average of the absolute errors |ei|=|yi-xi|, where yi is the prediction and xi the true value.

Linear regression model

Linear regression is a basic and commonly used type of predictive analysis. Parameter calculation of linear equation using least squares method.

Linear regression introduction

model_lr = LinearRegression()
model_lr.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
lr_y_predict=model_lr.predict(X_test)
y_test = np.array(y_test)
# (RMSE)
rmse = np.sqrt(mean_squared_error(y_test,lr_y_predict))

# (MAE)
mae = mean_absolute_error(y_test, lr_y_predict)

print('linear_regression_rmse = ', rmse)
print('linear_regression_mae = ', mae)
linear_regression_rmse =  1.5985795417920345
linear_regression_mae =  1.2179818164120766
joblib.dump(model_lr, 'save/model_lr.pkl')

# model_lr = joblib.load('save/model_lr_1000k.pkl')
['save/model_lr.pkl']

SVR model

Support vector regression(SVR) is an application of support vector machine(SVM) to regression problem.

Regression is like looking for the internal relationship of a bunch of data. Regardless of whether the pile of data consists of several categories, a formula is obtained to fit these data. When a new coordinate value is given, a new value can be obtained. So for SVR, it is to find a face or a function, and you can fit all the data (that is, all data points, regardless of the type, the closest distance from the data point to the face or function)

SVR introduction introduction

model_svm = SVR()
model_svm.fit(X_train, y_train)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svm_y_predict=model_svm.predict(X_test)
# (RMSE)
rmse = np.sqrt(mean_squared_error(y_test,svm_y_predict))

# (MAE)
mae = mean_absolute_error(y_test, svm_y_predict)

print('svm_rmse = ', rmse)
print('svm_mae = ', mae)
svm_rmse =  1.4967321667740556
svm_mae =  1.1245787830283758
joblib.dump(model_lr, 'save/model_svm.pkl')

['save/model_svm.pkl']

Bayesian Ridge model

In the Bayesian viewpoint, we formulate linear regression using probability distributions rather than point estimates. The response, y, is not estimated as a single value, but is assumed to be drawn from a probability distribution.

The output, y is generated from a normal (Gaussian) Distribution characterized by a mean and variance. The mean for linear regression is the transpose of the weight matrix multiplied by the predictor matrix. The variance is the square of the standard deviation σ (multiplied by the Identity matrix because this is a multi-dimensional formulation of the model).

Bayesian Ridge introduction

model_bayes_ridge = BayesianRidge()
model_bayes_ridge.fit(X_train, y_train)
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, alpha_init=None,
              compute_score=False, copy_X=True, fit_intercept=True,
              lambda_1=1e-06, lambda_2=1e-06, lambda_init=None, n_iter=300,
              normalize=False, tol=0.001, verbose=False)
bayes_y_predict = model_bayes_ridge.predict(X_test)
# (RMSE)
rmse = np.sqrt(mean_squared_error(y_test,bayes_y_predict))

# (MAE)
mae = mean_absolute_error(y_test, bayes_y_predict)

print('BayesianRidge_rmse = ', rmse)
print('BayesianRidge_mae = ', mae)
BayesianRidge_rmse =  1.5980023290295695
BayesianRidge_mae =  1.2175385536747287
joblib.dump(model_bayes_ridge, 'save/model_bayes.pkl')

['save/model_bayes.pkl']

Random Forest Regression model

Random forest is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees.

The throught of Random Forest Regression is using the Boosting and ensemble in decision tree. In the lecture mentioned.

Random Forest Regression introduction

model_random_forest_regressor = ensemble.RandomForestRegressor(n_estimators=20)
model_random_forest_regressor.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=20, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
random_forest_y_predict = model_random_forest_regressor.predict(X_test)
# (RMSE)
rmse = np.sqrt(mean_squared_error(y_test,random_forest_y_predict))

# (MAE)
mae = mean_absolute_error(y_test, random_forest_y_predict)

print('BayesianRidge_rmse = ', rmse)
print('BayesianRidge_mae = ', mae)
BayesianRidge_rmse =  1.6054778376150676
BayesianRidge_mae =  1.2233214206573564
joblib.dump(model_random_forest_regressor, 'save/model_random_forest.pkl')

['save/model_random_forest.pkl']

Predict function for one review with four model

def predict(text):
    model_lr = joblib.load('save/model_lr.pkl')
    model_svm = joblib.load('save/model_svm.pkl')
    model_random_forest_regressor = joblib.load('save/model_random_forest.pkl')
    model_bayes_ridge = joblib.load('save/model_bayes.pkl')
    data = {'comment': Series(text)}
    data = pd.DataFrame(data)
    print(data)
    data['comment'] = data['comment'].apply(remove_between_square_brackets)
    data['comment'] = data['comment'].apply(remove_special_characters)
    data['comment'] = data['comment'].apply(simple_stemmer)
    data['comment'] = data['comment'].apply(remove_stopwords)

    data['comment'] = data.comment.str.lower()
    data['document_sentences'] = data.comment.str.split('.')
    data['tokenized_sentences'] = data['document_sentences']
    data['tokenized_sentences'] = list(
        map(lambda sentences: list(map(nltk.word_tokenize, sentences)), data.document_sentences))
    data['tokenized_sentences'] = list(
        map(lambda sentences: list(filter(lambda lst: lst, sentences)), data.tokenized_sentences))
    print(data)
    # sentence = data['tokenized_sentences'][0]
    W2Vmodel = Word2Vec.load("Word2Vec2")

    data['sentence_vectors'] = list(map(lambda sen_group:
                                        sentence_vectors(W2Vmodel, sen_group),
                                        data.tokenized_sentences))
    text = vectors_to_feats(data, 300)
    print(text)
    lr_y_predict = model_lr.predict(text)
    svm_y_predict = model_svm.predict(text)
    bayes_y_predict = model_bayes_ridge.predict(text)
    random_forest_y_predict = model_random_forest_regressor.predict(text)

    return lr_y_predict, svm_y_predict, random_forest_y_predict, bayes_y_predict

print(predict(["This is a great game.  I've even got a number of non game players enjoying it.  Fast to learn and always changing.",
        "This is a great game.  I've even got a number of non game players enjoying it.  Fast to learn and always changing."]))
                                             comment
0  This is a great game.  I've even got a number ...
1  This is a great game.  I've even got a number ...
                                             comment  \
0  thi great game ive even got number non game pl...   
1  thi great game ive even got number non game pl...   

                                  document_sentences  \
0  [thi great game ive even got number non game p...   
1  [thi great game ive even got number non game p...   

                                 tokenized_sentences  
0  [[thi, great, game, ive, even, got, number, no...  
1  [[thi, great, game, ive, even, got, number, no...  


/home/sxy/anaconda3/envs/ML/lib/python3.6/site-packages/ipykernel_launcher.py:19: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


      w2v_0     w2v_1     w2v_2     w2v_3     w2v_4     w2v_5     w2v_6  \
0 -0.052897 -0.077122 -0.441616  0.210372  0.019172 -0.060663  0.048674   
1 -0.052897 -0.077122 -0.441616  0.210372  0.019172 -0.060663  0.048674   

      w2v_7     w2v_8     w2v_9  ...   w2v_290  w2v_291  w2v_292   w2v_293  \
0 -0.169603  0.132948 -0.137659  ... -0.135482   0.0026 -0.05121 -0.148072   
1 -0.169603  0.132948 -0.137659  ... -0.135482   0.0026 -0.05121 -0.148072   

    w2v_294  w2v_295   w2v_296   w2v_297   w2v_298  w2v_299  
0 -0.029361  0.08649 -0.070255 -0.040144  0.108867 -0.01677  
1 -0.029361  0.08649 -0.070255 -0.040144  0.108867 -0.01677  

[2 rows x 300 columns]
(array([8.09318704, 8.09318704]), array([8.09318704, 8.09318704]), array([8.41475, 8.41475]), array([8.06230953, 8.06230953]))
Avatar
Hengchao wang
Computer Science, M.S. Software Engineering, M.Eng.

My research interests include distributed robotics, mobile computing and programmable matter.