Analyzing My Music Library with Elasticsearch

Published on

I listen to music a lot. Looking at my iTunes music library the other day, I started making "smart" playlists to classify my music collection per year of albums releases. It's been a tedious process – and I only went back from 2015 to 1991. Then I realized that the process of classifying and analyzing my music collection could actually be an interesting activity, and I started wondering about how to do it efficiently. Having only used Elasticsearch at work for trivial logs indexing and searching so far, I figured I could use it in a more "advanced" way to help my dig through my gigabytes of music tracks.

Indexing iTunes library into Elasticsearch

In order to search through my music library I first had to extract the tracks metadata from iTunes and index them in Elasticsearch.

The first step was quite simple: extract the library from iTunes in a file library.xml – iTunes menu File > Library > Export Library. The result is a XML Property Lists file that I had to convert into JSON documents indexable by Elasticsearch. A raw library track entry looks like this:

<key>1566</key>
<dict>
    <key>Track ID</key><integer>1566</integer>
    <key>Name</key><string>Gravity</string>
    <key>Artist</key><string>A Perfect Circle</string>
    <key>Composer</key><string>Maynard James Keenan, Billy Howerdel, Josh Freese, Troy Van Leeuwen, Paz Lenchantin</string>
    <key>Album</key><string>Thirteenth Step</string>
    <key>Genre</key><string>Rock</string>
    <key>Kind</key><string>MPEG audio file</string>
    <key>Size</key><integer>12286195</integer>
    <key>Total Time</key><integer>306128</integer>
    <key>Disc Number</key><integer>1</integer>
    <key>Disc Count</key><integer>1</integer>
    <key>Track Number</key><integer>12</integer>
    <key>Track Count</key><integer>12</integer>
    <key>Year</key><integer>2003</integer>
    <key>Date Modified</key><date>2009-11-14T11:46:42Z</date>
    <key>Date Added</key><date>2013-04-15T21:12:53Z</date>
    <key>Bit Rate</key><integer>320</integer>
    <key>Sample Rate</key><integer>44100</integer>
    <key>Play Count</key><integer>4</integer>
    <key>Play Date</key><integer>3488975157</integer>
    <key>Play Date UTC</key><date>2014-07-23T13:45:57Z</date>
    <key>Normalization</key><integer>2325</integer>
    <key>Artwork Count</key><integer>1</integer>
    <key>Sort Artist</key><string>Perfect Circle</string>
    <key>Persistent ID</key><string>D9E761709EA898D1</string>
    <key>Track Type</key><string>File</string>
    <key>Location</key><string>file:///Users/marc/Music/A%20Perfect%20Circle/Thirteenth%20Step/12%20Gravity.mp3</string>
    <key>File Folder Count</key><integer>4</integer>
    <key>Library Folder Count</key><integer>1</integer>
</dict>

I've written a quick'n dirty Python script that parses this file in a trivial and sub-optimal way – especially if like me the exported library file is several megabytes long:

As you can see, it only keeps a few fields from the library tracks metadata – the ones I've found relevant to my analysis.

Running the script passing the exported library file as argument generates a library.json file containing as many JSON-formatted lines as there are tracks in the library:

$ ./itunes2json.py library.xml > library.json

$ wc -l library.json
8408 library.json

$ head -n 1 library.json
{"album": "Fight Club", "total_time": 303, "kind": "MPEG audio file", "name": "Who Is Tyler Durden ?", "artist": "The Dust Brothers", "play_count": 2, "bit_rate": 320, "year": "1999", "genre": "Soundtrack", "size": 12239773}

The more efficient way I've found to index the JSON documents into Elasticsearch is to use its Bulk API. This method requires a bit of shell scripting to bulk the records:

$ while read track; do
  echo '{"index":{"_index":"library","_type":"track"}}'
  echo $track
done < library.json > bulk

The library Elasticsearch index settings and mapping for the "track" document type looks like this:

$ cat library.index
{
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    }
  },
  "mappings": {
    "track": {
      "properties": {
        "year": { "format": "year", "type": "date" },
        "album": { "index": "not_analyzed", "type": "string" },
        "artist": { "index": "not_analyzed", "type": "string" },
        "genre": { "index": "not_analyzed", "type": "string" },
        "kind": { "index": "not_analyzed", "type": "string" },
        "name": { "index": "not_analyzed", "type": "string" },
        "play_count": { "type": "long" },
        "total_time": { "type": "long" },
        "bit_rate": { "type": "long" },
        "size": { "type": "long"
        }
      }
    }
  }
}

# Create the index
$ curl -X PUT -d @library.index localhost:9200/library
{"acknowledged":true}

# Bulk load the documents
$ curl --data-binary @bulk localhost:9200/library/track/_bulk?pretty
...
    "create" : {
      "_index" : "library",
      "_type" : "track",
      "_id" : "AVAqUo18SEui9dB0w9wD",
      "_version" : 1,
      "status" : 201
    }
  } ]
}

$ curl localhost:9200/library/_stats/docs/?pretty
{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "docs" : {
        "count" : 8392,
        "deleted" : 0
      }
    },
    "total" : {
      "docs" : {
        "count" : 8392,
        "deleted" : 0
      }
    }
  },
  "indices" : {
    "library" : {
      "primaries" : {
        "docs" : {
          "count" : 8392,
          "deleted" : 0
        }
      },
      "total" : {
        "docs" : {
          "count" : 8392,
          "deleted" : 0
        }
      }
    }
  }
}

You might have noticed than I ended up with fewer documents indexed than expected (8392 instead of 8408): during the bulk indexing some encoding errors occurred, causing Elasticsearch to discard some documents. For instance:

...
    "create" : {
      "_index" : "library",
      "_type" : "track",
      "_id" : "AVAqUo1zSEui9dB0w9g8",
      "status" : 400,
      "error" : "MapperParsingException[failed to parse]; nested: JsonParseException[Unexpected character ('D' (code 68)): was expecting comma to separate OBJECT entries\n at [Source: [B@afbf6; line: 1, column: 100]]; "
    }
...

Oh well... Let's move to the fun part ;)

Analyzing my music collection

First, but that goes without saying: the results obtained are only as good as my music files' metadata, e.g. the ID3 tags for MP3 files. I tried my best to keep them clean and exact, but there might be some inconsistencies here and there.

I've used Kibana for the first part of my analysis. When looking up at our indexed documents without any specific query, here is what we can find:

I started with a few visualization widgets summarizing trivial stats on my collection:

Top 10 artists/bands

That one was quite a surprise to me: although I used to listen to a lot of Final Fantasy soundtracks, I didn't expect Nobuo Uematsu to be the most represented artist in my collection.

Distribution per year

Not much to say here: I tend to listen to fairly recent music.

Top 10 musical genres

For what it's worth when trying to classify music into strict genres, the trend is quite clear here: I listen mostly to (movie, video games) soundtracks, Rock and Metal.

File kinds

Nothing special to say about this either: my music collection is essentially composed of MP3 ripped from original CD, and a few tracks bought in the Apple iTunes store.

OK, time to level up a bit. The next queries have been made directly via Elasticsearch's search API – usually leveraging aggregations –, as I haven't been able to find how to do it using Kibana.

10 Most played tracks

$ curl 'localhost:9200/library/track/_search?q=*&sort=play_count:desc&fields=artist,name,album,play_count&size=10&pretty'

(Raw query result)

Artist/band Track Name Album # played
Slipknot Nomadic .5: The Gray Chapter 143
Slipknot Goodbye .5: The Gray Chapter 122
Slipknot The One That Kills The Least .5: The Gray Chapter 116
Slipknot Lech .5: The Gray Chapter 99
Slipknot Killpop .5: The Gray Chapter 93
Slipknot AOV .5: The Gray Chapter 90
Slipknot The Devil In I .5: The Gray Chapter 90
Slipknot Skeptic .5: The Gray Chapter 81
Slipknot Sarcastrophe .5: The Gray Chapter 80
Asking Alexandria Dear Insanity Reckless And Relentless 68

Well, I love this Slipknot album...

Top 10 most albums per artist/band

This query ranks artists/bands by the number of albums – that I own, of course:

$ curl -d '{
  "query": {
    "query_string": { "query": "*", "analyze_wildcard": true }
  },
  "aggregations": {
    "per_artist": {
      "terms": {
        "field": "artist",
        "size": 10,
        "order": { "artist_albums.value": "desc" }
      },
      "aggregations": {
        "artist_albums": {
          "cardinality": {
            "field": "album"
          }
        }
      }
    }
  }
}' 'localhost:9200/library/track/_search?search_type=count&pretty'

(Raw query result)

Artist/band # albums
Thrice 14
Muse 12
Nine Inch Nails 9
The Used 9
Avenged Sevenfold 8
EZ3kiel 8
Nobuo Uematsu 8
Disturbed 7
Feeder 7
Foo Fighters 7

Shortest/Longest track duration

Shortest track:

$ curl 'localhost:9200/library/track/_search?q=*&sort=total_time:asc&fields=artist,name,album,total_time&size=1&pretty'

(Raw query result)

Artist/band Track Name Album Track duration
Enhancer Glock II Street Trash 5s

Longest track:

$ curl 'localhost:9200/library/track/_search?q=*&sort=total_time:desc&fields=artist,name,album,total_time&size=1&pretty'

(Raw query result)

Artist/band Track Name Album Track duration
Hans Zimmer Alabama Crimson Tide Movie Soundtrack 23m50s

Longest single album duration

curl -d '{
  "query": {
    "query_string": { "query": "*", "analyze_wildcard": true }
  },
  "aggregations": {
    "per_artist": {
      "terms": { "field": "artist", "size": 1 },
      "aggregations": {
        "per_album": {
          "terms": { "field": "album", "size": 1 },
          "aggregations": {
            "album_duration": {
              "sum": { "field": "total_time" }
            }
          }
        }
      }
    }
  }
}' 'localhost:9200/library/track/_search?search_type=count&pretty'

(Raw query result)

Artist/band Album Album duration
Nobuo Uematsu Final Fantasy VIII Soundtrack 4h8m52s

Finding this one has been a bit problematic: I've found the correct result, but the method is flawed.

The method should have been:

  1. Aggregate tracks per artist
  2. Sub-aggregate the aggregated tracks per artist per album
  3. Sub-aggregate the aggregated tracks per artist per album by summing their total_time field
  4. Sort the results by the total_time summed value of each artist album

I've managed to do all that except the final sorting, because Elasticsearch is only able to perform "deep" metrics sorting on nested sub-aggregations when all nested buckets on the path to sorting metric are single-valued, and this is not the case here. The only reason I've got the correct result in my case is because there is completely unrelated to the aggregation results: by default Elasticsearch sorts results by the number of documents per aggregated bucket, and it happens that the longest album in duration is by an artist/band that also features the most indexed documents in my collection. The problem can be observed when looking beyond the first result, for instance with a top 10 of the longest albums:

curl -d '{
  "query": {
    "query_string": { "query": "*", "analyze_wildcard": true }
  },
  "aggregations": {
    "per_artist": {
      "terms": { "field": "artist", "size": 10 },
      "aggregations": {
        "per_album": {
          "terms": { "field": "album", "size": 1 },
          "aggregations": {
            "album_duration": {
              "sum": { "field": "total_time" }
            }
          }
        }
      }
    }
  }
}' 'localhost:9200/library/track/_search?search_type=count&pretty'

(Raw query result)

Artist/band Album Album duration # total artist/band tracks indexed
Nobuo Uematsu Final Fantasy VIII Original Soundtrack 4h8m52s 166
Nine Inch Nails Ghosts I-IV 1h50m9s 156
Muse B-sides (grouping of all B-sides) 1h34m23s 147
Thrice Anthology 1h43m7s 143
Slipknot Slipknot 1h18m46s 113
The Used Shallow Believer 1h7m28s 113
Disturbed The Sickness 1h9m19s 102
Foo Fighters In Your Honor 1h23m16s 95
Korn Issues 53m9s 95
EZ3kiel LUX 1h10m18s 94

Then I figured out a simpler way – incidentally providing correct results – but I had to let go the "artist/band:album" relation:

$ curl -d '{
  "query": {
    "query_string": { "query": "*", "analyze_wildcard": true }
  },

  "aggregations": {
    "per_album": {
      "terms": {
        "field": "album",
        "size": 10,
        "order": { "album_duration.value": "desc" }
      },
      "aggregations": {
        "album_duration": {
          "sum": { "field": "total_time" }
        }
      }
    }
  }
}' 'localhost:9200/library/track/_search?search_type=count&pretty'

(Raw query result)

Album Album duration
Final Fantasy VIII Original Soundtrack 4h8m52s
The Girl With The Dragon Tattoo (Movie Soundtrack) 2h53m34s
Chopin: Essential Classic 2h38m30s
The Dark Knight (Movie Soundtrack) 2h26m16s
Mad Max: Fury Road (Movie Soundtrack) 2h5m8s
The Amazing Spider-Man 2 (Movie Soundtrack) 1h55m6s
The Incredible Hulk (Movie Soundtrack) 1h51m
Ghosts I-IV 1h50m9s
Chopin - 19 Nocturnes 1h46m58s
The Fragile 1h43m39s

Largest single album in tracks number

$ curl -d '{
  "query": {
    "query_string": { "query": "*", "analyze_wildcard": true }
  },
  "aggregations": {
    "per_album": {
      "terms": { "field": "album", "size": 1 }
    }
  }
}' 'localhost:9200/library/track/_search?search_type=count&pretty'

(Raw query result)

Artist/band Album # tracks
Nobuo Uematsu Final Fantasy VIII Soundtrack 74

Top 10 longest combined music duration per artist/band

$ curl -d '{
  "query": {
    "query_string": { "query": "*", "analyze_wildcard": true }
  },
  "aggregations": {
    "per_album": {
      "terms": { "field": "album", "size": 1 }
    }
  }
}' 'localhost:9200/library/track/_search?search_type=count&pretty'

(Raw query result)

Artist/band Total duration
Nine Inch Nails 10h47m38s
Muse 10h32m13s
Nobuo Uematsu 10h0m1s
Thrice 9h12m59s
EZ3kiel 8h2m56s
Slipknot 7h51m44s
The Used 6h57m28s
Disturbed 6h52m19s
Korn 6h44m36s

Conclusion

This exercise allowed me to extract interesting facts and trends about my tastes from my own music collection, and got me to know Elasticsearch a little better in the process. It's been a fun ride :)