Ecommerce Review Dataset Analysis
Vo Thi Quynh Yen
Natural language processing is an important field in machine learning that has been used in many commercial and scientific applications. From a conversational text, the computer can analyse to extract many useful information such as the sentiment the user was feeling, the topic of the conversation, detect spam, etc. In this project, we will analyse reviews posted by consumers of women’s clothing and try to find out the relationship between such reviews and ratings to the popular level of the related item. We explore the relationship between the review texts and the popularity of the product.
We will use the Women’s E-Commerce Clothing Review dataset on Kaggle to carry out the analysis. The dataset can be downloaded from this URL:
The dataset contains customer reviews for clothing items. It has a total of 23486 rows and 10 columns representing 10 features:
Clothing ID: the unique ID of the item which is an Integer Categorical variable
Age: the age of the reviewer which is a Positive Integer variable
Title: the title of the review (String variable)
Review Text: review body (String variable).
Rating: the product score given by the customer from 1 to 5 (Ordinal Integer variable)
Recommended IND: 1 is recommended, 0 is not recommended (Binary variable)
Positive Feedback Count: number of other customers who found this review positive (Positive Integer)
Division Name: product high level division (Categorical)
Department Name: product department name (Categorical)
Class Name: product class name (Categorical name)
Data Preprocessing in Excel
First we need to extract further information from the data given. We want to identify trending products from current reviews. Generally, when a product is selling well and customers love it, they will post positive reviews and give high ratings. A product that is trending is product that has a lot of reviews, most of which are positive and the ratings are high.
We will create a number of columns to calculate the trendy nature of a product.
Sum_reviews: This is a total number of reviews for a given product
Average rating: average rating for a given product
Stdev of rating: Standard deviation of rating, calculated from average rating
Number of recommendations: the number of recommendations for a given product
Recommendation percentage: the percentage of people who recommended the product
Percentage of reviews: the percentage of reviews for a given product out of all reviews
Rating strength: how strong is the rating for a product
Hot item: determine if an item is hot or not depending on the rating strength. There is a cutoff threshold that is used to determine if a product is hot or not. The threshold used was 0.00036.
After having prepared the data and extract the information we want, we proceed to doing predictive modelling using RapidMiner.
Predictive Modelling in RapidMiner
We will be doing language processing, which will extract information from a review and then predict if an item is a hot item. We will split the data into training data and test data, so that after being trained, the model could work on new data and make correct predictions.
The workflow consists of:
The operator used:
- Retrieve: First we will import the processed Excel file into Rapidminer and then retrieve it.
- Sample: since the Excel data contains 19653 rows and this is huge, we would prefer taking a smaller sample out of it for testing the model, first because such a huge dataset is too heavy for a laptop computer with limited memory, secondly because if we use the whole dataset, it will run very slowly. In this case, we choose 2500 samples.
- Set role: We set the feature that we are interested in predicting, which is the Hot Item feature.
- Select Attributes: We will choose the number of attributes that are most important in helping train the model and discard the rest. In this case, we will not be using intermediate columns created to calculate the Hot Item feature. We choose the following features to be used for training the model: Age, Class Name, Rating, Recommended IND, Title, Review Text, Division Name, Department Name, and Hot Item.
- Nominal to Text: We convert the Review Text to text data type.
- Process Documents: This nested operator is used for text classification. It has the following operators within it:
These are the steps needed to process a typical text: First the text is broken into words whenever it encounters a list of provided characters, later it produces a n-grams of each word based on a specific length. N-grams is a sliding window that slides through the words in the text, with the size specific by the user. These n-grams will be added to a bag of words used to analyze the relationship between the word used and the final label. The tokens are stemmed to reduce them to their root form, later transformed to lowercase and then run through a filter to filter out tokens that don’t satisfy a condition. Finally the passed tokens generates n-grams that are added to a bag of words used for model learning.
- Store: We store the generated word list for later analysis
- Validation: After having generated the word list, now it’s time to prepare the data for training. The validation operator is a nested operator that contains the steps needed to train our model. First we will split the data into training and test data. We use a 70:30 ratio to split the data. 70% is training data and 30% is test data which is used afterwards to test the efficiency of our trained model. Below is the process used to train and test our data:
In this example, we use Gradient Boosted Trees as our model. The training data run through the model. The output is stored in a file for easy reference. Then we use the Apply Model operator to run the trained model through test data(the one we split during previous operation).
Finally, to gauge efficiency of the mode, the Performance operator will generate a contingency table with true positive, false positive, true negative, false negative that correspond with the number of time the model correctly or incorrectly predict a label. This table is used to calculate the accuracy percentage of the model.
We use three Machine Learning algos to train our data and then compare the results to find which one is the best: Gradient Boosted Trees, k-NN, and Deep Learning.
Gradient Boosted Trees Result
For the Gradient Boosted Tree algo, we obtain an accuracy of 71.07%
With a simple algo like k-NN, we get a 64% accuracy.
Deep Learning Result
Even though deep learning is a complex model that can tackle many complicated machine learning problems, the first time we run we only obtain a 59.33% accuracy. Perhaps with more analysis and parameter tuning, we will be able to increase accuracy.
Naive Bayes result
The Naive Bayes model yields only 52.8% accuracy
From the above results, we can conclude that the Gradient Boosted Trees give the best accuracy. Perhaps with more parameter tuning, we will be able to improve the accuracy percentage even more.