Related topics

How are these topics linked to each other?


With this visualization we wanted explore the connection between specific words to help us to understand the keywords in a context. We imagined it as something that could help us in defining the main topics of our area of interest and shows us how these topics interacted with each other.
These keywords are written by Imdb users, so we can say that these words are in a way an hybrid between the perception of a person watching a movie and what the director wants to communicate.
The aim was to see how the clusters of tagged keywords interact with each other, see if there were peculiar behaviors, exceptions, and if so, which ones.


1. All Imdb keywords were collected and tagged (see previous chapter)
2. A contingency table was created with R starting from an excel table with two columns: “Movie title”, “Keywords”. This table compares couples of keywords and verifies in how many movies these couples are present.
3. Once this matrix has been done, it was re-converted in a list using Python. This list contains 831,825 lines, with a couple each. The couples that were not matching at all (value=0) were deleted.
4. The occurrences of the couples formed by words related to these tag were imported in Gephi: job, arts, immigration, family, criminality, cultural differences, sex, violence and security forces.
5. In the “nodes” section of the data laboratory we imported a spreadsheet including the name of the keyword, its weight (given by the occurrences in all movies) and its degree (the number of links with other keywords). A columns contained the tags related to each keyword.
6. In the “edges” section of the data laboratory we imported the datasheet containing the links between keywords. Each couple of keyword was considered as a “source” and a “target” (the two nodes involved in the relationship) as well as other parameters. We used the “Type” parameter set to “undirected” and we also made advantage of the Weight parameter. This latter parameter allowed us to manage the visual representation in a more meaningful way. To a greater edge weight corresponded a visually closer relationship between keywords.
7. We used the “Force Atlas 2” algorithm as “layout” parameter.
8. We edited the settings in order to control the overlapping of the information in the areas of the graph in which information is composed by very little circles.
9. The size of nodes is given by the “Degree”.
10. The colors of the nodes are given by the tag we assigned by theme.
11. We exported our visualization with the Sigma.js Gephi plugin. It is a plugin which writes an animated webpage with JavaScript code, for an interactive and more complete visualization of the graph.

How to read it

The nodes represent the single keywords and the edges are the relationships between them. The dimension of each circle represents the “Degree” level of each keyword, which means that the bigger a circle is, the highest number of connections with other keywords it has.
In the center of the visualization there are the biggest nodes, representing the most common keywords. They’re in the center of the graph because they have many connections, which prevent them to be pushed to the sides.
We can also see that there are two macro-clusters, one that is more red and another which is more blue. The red one contains the themes that we defined as violent, and therefore we gave them a similar red color. The blue one has more to do with themes which are more positive, like the arts, family and jobs. The user is able to move in the interactive graph seeing at first glance the most common keywords, and then focusing on particular relationships and interesting clusters: in order to do that is suggested to open the visualization in full screen to better use the included tools.
Clicking on a keyword it’s possibile to see all its network, its weight (on the panel appearing on the right), and the category it belongs to.
It is also possible to see a single group of keywords based on the topics, available in the grouping tool on the left panel.

Click HERE to see the visualization full screen on a new tab


The themes in the two macro clusters (the red ones and the blue ones) don’t often mix their keywords. Anyway we can see some exceptions that can be studied navigating the graph. Themes characterized by a neutral or quite neutral connotation (like Cultural difference or Security forces) appear more diffused through the whole visualization.
From the graph also emerges that some movies tend to appear in clusters because their keywords are unique or not very shared with other films. An example of this is surely the most red cluster on the left which represents the movie “Machete”.
More popular movies tend to have more detailed description also through the high precision used by Imdb users to write them down and a confirmation of this comes when observing the keywords in detail.
Furthermore, keywords related to violence, criminality or to the sexual sphere appear to be more detailed. For instance if “shot” would be a relevant descriptor for one movie the keywords that appears instead are way more precise.
All possible conjugation are listed: “shot in the chest”, “shot in the back", “shot in the head”, “shot in the forehead”, “shot in the arm”, “shot in the shoulder”, “shot in the leg”, “shot in the throat”, “shot in the eye”, “child shot in the head”, “shot through the mouth”, “shot through the eye”, “shot in the knee” and more others.
In this way movies have an higher amount of keywords not only because they are more popular but also because, when it comes to these topics, they are obsessively described by the users who, once they saw the movie, spend their time writing extremely precise keywords on Imdb.
On the other hand it is also true that Imdb is used as a tool for professionals, like actors and directors, who want to be able to search specific actions or scenes to get help with their professional activity.