First, I had to collect a sizeable dataset of critical texts, along with all the relevant metadata. For that I found the easiest solution would be to use metacritic.com, which already collects references to critical articles, and normalizes the score from 1 to 100.
To collect the data, I wrote a web scraper in python. The only problem that occurred to me was that the actual articles were not stored on metacritic, but would rather be accessible through a link to each specialized journal’s website.
I did not have time, nor really wanted, to write a specific procedure for each website, so I opted for a more generalist approach, simply capturing all the text from the page, and then trimming what would obviously not be text from the article, such as headers, footers, sidebars etc.
The raw text data was then written to XML files in order to store the metadata with the text. Of course, the data extracted with such a method needed to be cleaned by hand afterwards, which limited the amount of articles I was able to account for in my analysis.
My final dataset was composed of 382 articles in english, coming from 98 different journals and covering 36 games published between 2000 and 2020. The games were selected according to the number of available reviews and diversity.
I needed to retrieve a much greater amount of “bad” games than “good” games in order to achieve a balanced dataset, for the simple reason that good games are often much more discussed about.
The first thing I had to do was to recode the raw score value (ranging between 1 and 100) to fit all reviews into defined categories.
This separation in categories will be necessary in order to get meaningful predictions further down the pipeline.
Then the text is passed through TreeTagger, a Textable extension which will allow us to isolate all the adjectives of the corpus. I then used that list to create a list of adjectives that could be used to express the quality of a game, like “good” or “bad”, for example, to be used as predictors.
The intersection of these two lists give us the list of predictors in the texts.
An important step here for the prediction process is to use Textable’s “Categorize” widget to define the score category of the texts to be known by the program as the target variable we want to predict.
The same process must be accomplished on any test data (reviews without score), without the annotation step. The training and test data MUST have exactly the same structure.
Predictions and Models
After the textual data was preprocessed, I decided to write the resulting tables to files, that would then be imported in another workflow dedicated to the prediction task, as doing everything on the same workflow caused major performance issues when processing the whole dataset.
One thing that I did not know how to do before attempting this project was how to use Orange’s prediction widget with textual data.
Using the “Test and Score” widget, we can assess the models’ accuracy to determine a review’s score. The “Predictions” widget allows us to make the real predictions over unscored reviews.
Our best model, SVM, displays a Classification Accuracy (CA) of 63.6%.
Given that we have roughly the same amount of reviews for each score category, the trivial case (picking a category at random), would give an accuracy of 33.3%.
We can conclude that our the models are about 30% better than random at this task.
My test dataset contained some “real” unscored reviews, where the author decided not to grade the game, and a few scored reviews for which I stripped the score, just to see what would happen.
For the latest, the predictions seem to be holding their grounds, but for the real unscored reviews, it would seem like our predictions are most often “mixed”.
I would emmit the assumption that these reviews were left unscored specifically because the author was undecided, which would show in the use of more typically “mixed” adjectives. Note that this hypothesis is based on very little data, and further exploring of the matter could be interesting.
This little experiment seems to have been relatively successful. A classification accuracy of 60% is quite satisfying for an analysis of natural language, for which we could expect lower rates, and we can reasonably predict the rating that would be given by the author of a review. We could still improve the analysis by fiddling with the predictors list, so there might be room to improve there.
Is it any useful ? I don’t know, but it sure was fun to try it out.