This page presents datasets, examples, and extensions to the analysis described in the following article, presented at the ISMIR 2008 conference in Philadelphia (U.S.A.):
My work is based on the observation of the co-occurrences of 4,000 musical artists in a dataset of playlists compiled by the members of MusicStrands.
The dataset is an elaboration of 1,030,068 user-compiled playlists, collected from the MusicStrands Web site on July 31st, 2007. The dataset indicates, for 4,000 popular artists, the number of times they co-occur in a playlist (separated by less than 3 songs) and the tags applied by MusicStrands users to each artist.
My assumption is that when two artists or genres co-occur together and closely in many playlists, they have some form of shared cultural affinity. I use both the existence (occurring together on a playlist), and the distance (occurring closely within one or two songs of each other) to produce a single “affinity” metric between artists and genres.
For example, since I observe that a large number of songs by Madonna co-occur often and closely with songs whose artists are classified in the Latin genre, then I argue that Madonna has a certain affinity with Latin, even though the only genre-label attached to Madonna in the dataset is Rock/Pop.
This graph shows the genre-affinity degree of Madonna to five different genres.
Note that her affinity to R&B is higher than her affinity to Rock/Pop (her original genre in the dataset), revealing that people tend to associate Madonna in playlists more with R&B than with Rock/Pop artists.
The affinity degrees are normalised, so the fact that there are more Rock/Pop than R&B artists in the dataset does not overemphasize associations in any one genre.
The genre-centrality of Madonna to a genre g is the percentage of artists whose genre affinity to g is lower or equal than the genre affinity degree of Madonna.
This graph shows the genre-centrality of Madonna to eight different genres. Note that her genre-centrality to Latin is almost 86%, showing that Madonna tends to occur very often with Latin artists; while her genre-centrality to Jazz is only 30%, showing a small association with Jazz artists in playlists.
Artists with the highest genre-centrality values are called core artists, and are good ‘social’ representatives of a genre. For instance, the core artists for Latin are: Ricardo Arjona, Diego Torres, Marc Anthony, Chayanne, and Luis Miguel.
This graph compares the genre-centrality of Madonna and of The Jacksons to four different genres.
Although the two artists have different genre labels in the dataset (Rock/Pop and R&B), they present similar affinities to the same genres:
both are strongly associated with R&B and Rap artists, but not as much with Rock/Pop or Country artists.
Determining a “Ground Truth” for artist or genre similarity is a contentious topic in conventional comparitive musicology. The genre affinity method follows a strictly social approach towards understanding relationships between music. It distinguishes artists and genres using the aggregation of multiple user preferences, in the form of human-compiled playlists, rather than an acoustic or symbolic heuristic pertaining to the musical content.
I state that two genres g and h are correlated when artists with a high genre affinity degree to g tend to have a high genre affinity degree to h;
and two genres are independent when
their affinity degrees do not show any sign of correlation.
This graph shows, for each artist in the dataset, its affinity degree to Rap (x axis), to R&B (y axis), and its original genre label (point type). Note that artists with high affinity to Rap (in the rightmost section) tend to have a high affinity to R&B as well (in the uppermost section), suggesting that the two genres are correlated: artists associated (in playlists) with R&B artists are often associated with Rap artists as well.
To quantify the correlation between any two genres g and h, I employ Pearson's coefficient which assesses the linear correlation between the affinity degree of artists from two genres.
In the case of Rap and R&B, the coefficient equals 0.6, sustaining the hypothesis that the two genres are ‘socially related’. For Rap and Country, on the other hand, the coefficient is almost null (0.1), sustaining the hypothesis that the two genres are independent, as suggested by this graph, showing for each artist in the dataset its affinity degree to Rap (x axis), to Country (y axis), and its original genre label.
This analysis exposes strong cultural relations between genres that are often not present in acoustic or symbolic analysis.
Artists are represented as vectors of genre-affinity degrees, and this allows to compare any pair of artists by calculating the distance between their genre-affinity vectors.
For example, the vectors for Madonna and The Jacksons are close, which highlights that people “use” these artists in a similar way, associating them with artists from the same genres in their playlists.
The graph shows a 2D visualisation of all the artists in the dataset, based on a dimensionality reduction of their genre-affinity vectors. This technique algorithimically positions the artists in two dimensions, preserving the Euclidean distance with a minimum of error.
The dataset also contains valuable information about tags that MusicStrands users applied to artists. Then, the techniques illustrated can be applied to uncover fuzzy affinities, centrality values and correlations between artists and tags.
For instance, I discovered that the core artists for the tag Heavy Metal are 3 Inches Of Blood, In Flames, Opeth, Diecast, and A Life Once Lost, and those for the tag Jazz Instrumental are Bob James, Peter White, Donald Byrd, Keiko Matsui, and Najee. Once again, this is a valuable knowledge, obtained aggregating the preferences of a large audience, without any a priori information about the meaning of a specific tag.
The social-based analysis presented provides new tools and a new ontology to describe richer relationships between musical artists and genres or tags.
The uncovered knowledge does not come from experts or from automatic, content-based algorithms, but from the analysis of the collective behaviour of listeners.
The importance of the proposed analysis is that it can be applied to any domain that presents objects (e.g., artists), categories (e.g., genres), and social behaviour data (e.g., multiple human-compiled playlists), in order to uncover a more detailed and “social-aware” knowledge on that domain.
As an example, I applied the same analysis on the domain of movies. First, I retrieved from IMDB a list of popular movies, together with their genre labels (note that a movie can be labelled with multiple genres). Next, I retrieved from Netflix the Netflix Prize dataset, which lists the movies watched by a subset of Netflix users, together with the rating assigned. Then, for each pair of movies, I checked the number of Netflix users who loved both movies (that is, users who watched them both and assigned them the highest possible rating). Finally, I uncovered movie-to-movie and movie-to-genre affinities with the same methods described in my paper.
This graph shows the genre-centrality of Breakfast at Tiffany's (Comedy, Drama, Romance) to six different genres.
Using Euclidean distance among genre-affinity vectors, I discovered that the nearest neighbours for this movie are: A Little Romance (Comedy, Romance) Ethan Frome (Drama) Auntie Mame (Comedy, Drama), Valmont (Drama, Romance), and Moonstruck (Comedy, Drama, Romance).
I also investigated core movies for the different genres; for instances core movies for Action are: Knock Off (Action, Thriller) Darkman II: The Return of Durant (Action, Horror, Sci-Fi) Street Fighter (Action, Thriller, Adventure, Sci-Fi) The Triangle (Drama, Mystery, Action, Thriller, Adventure), and Death Warrant (Drama, Mystery, Action).
This table shows the correlation of seven different genres in terms of Pearson's coefficient:
Although the method I propose shares similarities with network analysis techniques, my focus is different. For instance, I do not identify as core objects of a category g those objects that are more “interconnected”, or with the highest “in-degree” with objects of category g. Here, core objects of g are those objects that, whenever they are “used” (e.g., artists in playlists, movies watched and rated), they are used together with objects of category g. This also explains why core objects are generally not the most popular ones, since popular objects tend to occur with objects by any category. For example, The Beatles are not core artist for Rock/Pop, because The Beatles are so famous that they co-occur in playlists with artists by almost any other genre.
In order for others to repeat the experiments, examples, and graphs described so far, I have produced the affinity R package. This package contains all the functions described, as well as the MusicStrands dataset.
$ R > install.packages("XML") > install.packages("scatterplot3d") > quit() $ R CMD INSTALL /path/to/downloaded/affinity.tar.gz
To run some sample analysis and graphs about musical artists and genre-affinity, type:
$ R > library(affinity) > data(mystrandsGenres) > runExamples() > plotExamples()
To analyse tag-affinity, rather than genre-affinity, change the third instruction to:
In order to obtain the actual names of the artists and genres, an OpenStrands ID is required, which can be freely obtained from OpenStrands Web page.
The affinity R package is released freely under a GPL General Public License.