Definition of our movie corpus

How do we define the selection criteria?


The thought that generated this analysis came during the very first phases of our work. We had to choose a parameter to rank the movies and create a list of films relevant for the audience and, at the same time, adherent to the topic.
We considered several criteria as reference for measuring what happens if someone approaches a movie on the web or through more traditional channels (like cinema and tv). The box office data or the views of Youtube trailers, available movies lists in specialized platform (such as Imdb) were some of our possibilities.
In this analysis we compared two datasets: the box office and the views on Youtube, one as index for the traditional ways of enjoying movies, the other as reference for what potentially happens on the web.


1. We used IMDB to search the box office data. There we only found material for approximately half of our movies.

2. We continued the research through Themdb and Google. Lots of these data was not existing.

3. We decided to just work on the top movies with available data. Box office data from Imdb were crawled with the help of Kimono and put in an Excel file.

4. We searched for every movie using the query “*movie title* trailer” on Youtube and ordered the results by “most viewed”. The movies without a trailer was substituted by the first clip which appeared at the top of the results.

5. The first results were selected as our parameters and the number of views was crawled with Kimono.

6. A barchart was realized on Illustrator to allow us to compare the results. On the X axis we displayed the movies: above we can see the box office data sorted by the higher to the lower and below we can see the Youtube views.

How to read it

This visualization is a barchart. The top ten movies for box office money are highlighted in the top left part of the chart while the top ten movies based on Youtube are scattered below the line In the box office section the taller the bar heading up, the greater the income.
On the lower side the graph works the same way, but mirrored: the taller the bar heading down, the greater the number of Youtube views.


We realized that most of the movies whose data seemed to be unavailable had a particular reason for this: they were either old movies, or TV-only productions, or small, foreign and indie productions which never saw the mainstream market. The visual output suggests that there’s no real correlation between the two types of data.
This is due to many factors. For instance the Youtube videos that we found uploaded and figured in the top positions, had not the same origins for all the movies. Many of them were in fact official teasers, but also a great part was just uploaded as unofficial footage or user-sourced material. This evidences an inconsistency in the database of Youtube itself, and reveals it as not totally appropriate and useful for our aims.
The box office data has the flaws discussed before, i.e. mostly it is incomplete by nature because it’s from movies which come from very different backgrounds – mainstream, TV, home-video documentaries only – and markets – indie or small foreign productions.
We can say that these two databases provided data which had intrinsic flaws and thus wouldn’t probably make us able to reach most interesting results.