tagtraum genre annotations
for the Million Song Dataset

The Million Song Dataset (MSD) is a collection of one million songs annotated with features from The Echonest (now part of Spotify). Additional annotations to the MSD are provided by datasets like The Last.fm Dataset, musiXmatch, or the Million Song Dataset Benchmarks by Schindler et al. Amongst other features, the latter also contains song-level genre annotations derived from the All Music Guide.

To increase the accuracy and granularity of MSD genre annotations, and thus facilitate music genre recognition research, the tagtraum genre annotations are based on multiple source datasets and allow for ambiguity. Details can be found in this publication.
The slides for the oral presentation are available here.

A similar method was also used to learn genre ontologies from crowd-sourced genre labels.

Genre Ground Truth

These three ground truths were generated based on the Last.fm dataset, the Top-MAGD dataset and the beaTunes Genre Dataset (BGD).

Name Labels File Description
CD1 133,676 msd_tagtraum_cd1.cls.zip Constructed from BGD, LFMGD, and Top-MAGD, same labels as Top-MAGD, contains minority votes.
CD2 280,831 msd_tagtraum_cd2.cls.zip Based on modified BGD and LFMGD. Additional labels Metal and Punk, International = World, removed Vocal. Some labels ambiguous.
CD2C 191,401 msd_tagtraum_cd2c.cls.zip Same as CD2 without ambiguous annotations.

Classification Tasks

These tasks are meant to be similarly constructed as the ones published by Schindler. However, there is no correspondence on the identifier level, i.e. these are independent tasks.

Non-stratified splits
90% training data CD1 CD2 CD2C
80% training data CD1 CD2 CD2C
66% training data CD1 CD2 CD2C
55% training data CD1 CD2 CD2C
Stratified splits
90% training data CD1 CD2 CD2C
80% training data CD1 CD2 CD2C
66% training data CD1 CD2 CD2C
55% training data CD1 CD2 CD2C
Splits with fixed size per genre
1,000 samples training data / genre set CD1 CD2 CD2C
2,000 samples training data / genre set CD1 CD2 CD2C
3,000 samples training data / genre set - CD2 -

Co-occurrences & Trees

BGD and LFMGD are generated based on co-occurrences and derived genre trees (taxonomies). These files contains both the relative co-occurrences (values below 0.0001 were dropped) and the generated genre trees in JSON format.

Note, that by far the most user submissions came from English speaking users, followed by German, French, and Spanish. In the publication, only the labels submitted by English speakers were used.

Source User-Language File Description
Last.fm Unspecified lastfm.json.zip Used for CD1, CD2, CD2C.
beaTunes English beatunes_eng.json.zip Based on 521,070,246 submissions. Used for CD1, CD2, CD2C.
beaTunes German beatunes_deu.json.zip Based on 97,876,937 submissions. Informative only.
beaTunes French beatunes_fra.json.zip Based on 43,316,474 submissions. Informative only.
beaTunes Spanish beatunes_spa.json.zip Based on 27,142,179 submissions. Informative only.
beaTunes Dutch beatunes_nld.json.zip Based on 21,164,860 submissions. Informative only.
beaTunes Italian beatunes_ita.json.zip Based on 14,012,314 submissions. Informative only.
beaTunes Japanese beatunes_jpn.json.zip Based on 11,034,788 submissions. Informative only.
beaTunes Portuguese beatunes_por.json.zip Based on 8,440,576 submissions. Informative only.
beaTunes Danish beatunes_dan.json.zip Based on 4,997,361 submissions. Informative only.
beaTunes Russian beatunes_rus.json.zip Based on 4,521,323 submissions. Informative only.
beaTunes Swedish beatunes_swe.json.zip Based on 4,569,080 submissions. Informative only.
beaTunes Chinese beatunes_zho.json.zip Based on 3,311,139 submissions. Informative only.
beaTunes Polish beatunes_pol.json.zip Based on 1,099,730 submissions. Informative only.
beaTunes Korean beatunes_kor.json.zip Based on 805,969 submissions. Informative only.

Inferred Genre Annotations

Using co-occurrences and derived trees, we annotated both the Last.fm dataset and the matched beaTunes songs with seed-level genres.

Source Name File Description
Last.fm LFMGD msd_lastfm_map.cls.zip Last.fm dataset with additional inferred genre annotations.
beaTunes BGD msd_beatunes_map.cls.zip beaTunes database matched with MSD. Original genre labels and inferred genre annotations.


What is the licensing?

Research only, strictly non-commercial.

How to cite the dataset?

Hendrik Schreiber. Improving Genre Annotations for the Million Song Dataset. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 241-247, Málaga, Spain, Oct. 2015. [slides]

Other research.