Authorship Attribution Applied to Music

Completed in The University of Manchester, 2022

Authorship attribution is a multi-class classification problem aimed at identifying the author of a given piece of text from a predefined list of authors. This project explored the application of authorship attribution techniques to song lyrics, leveraging advanced neural network architectures such as LSTM, GRU, and Siamese networks. The goal was to classify lyrics by artist and investigate the potential for creating a music recommendation system based solely on lyrical content.

Project Overview

  • Duration: May 2022 (1 month)
  • Technologies and Tools: Python, PyTorch, GloVe embeddings, GRU, LSTM, Siamese networks, k-means clustering, k-nearest neighbors (k-NN)
  • Skills Developed: Natural Language Processing (NLP), neural network architecture design, data preprocessing, model evaluation, and hyperparameter tuning

Key Objectives

  1. Authorship Attribution on IMDB62 Dataset:
    • Implemented and compared LSTM, GRU, and Siamese networks for authorship attribution on the IMDB62 dataset, achieving accuracies of 85.45%, 88.60%, and 86.81%, respectively.
    • Outperformed baseline methods (e.g., TF-IDF) by a significant margin, demonstrating the effectiveness of deep learning models for stylometric feature extraction.
  2. Lyric-Author Classification:
    • Applied the same architectures to a dataset of song lyrics from 1,000 artists, achieving a maximum accuracy of 25.65% with the GRU model.
    • Addressed challenges such as data imbalance, limited samples per artist, and noisy data (e.g., unknown tokens and punctuation).
  3. Music Recommendation System:
    • Explored the use of vector representations from the trained models to create a music recommendation system.
    • Used k-means clustering and k-NN to group similar songs based on lyrical content, with promising results for artists the model was trained on.
    • Identified limitations for unseen artists and proposed future improvements, such as incorporating genre classification and user modeling.
  4. Deal with Data Imbalance: Addressed by resampling the dataset to ensure 250 samples per artist.

  5. Handle Noisy Data: Retained punctuation and capitalization to capture stylistic features, but faced challenges with out-of-vocabulary words. Proposed using fastText for better word vector representations.

  6. Overcome Underfitting: Due to the small number of samples per artist, the models showed signs of underfitting. Future work could involve tuning hyperparameters and increasing the dataset size.

Results and Insights

  • The models demonstrated strong performance on the IMDB62 dataset, validating their ability to capture stylometric features.
  • For lyric-author classification, the GRU model outperformed the LSTM, likely due to its fewer parameters and reduced risk of overfitting.
  • The music recommendation system showed potential for artists the model was trained on, but struggled with unseen artists. Manual evaluation suggested that lyrical similarity could be a useful feature for recommendations, though further testing is required.

Future Work

  • Expand the dataset to include more samples per artist and explore genre-based classification.
  • Incorporate fastText embeddings to handle out-of-vocabulary words.
  • Combine lyrical analysis with user modeling to improve recommendation accuracy.
  • Conduct a comprehensive manual evaluation of the recommendation system to assess its effectiveness.

Conclusion

This project successfully applied authorship attribution techniques to song lyrics, demonstrating the potential for using lyrical content in music recommendation systems. While challenges such as data imbalance and underfitting were encountered, the results provide a strong foundation for future work in this area. The project also highlighted the importance of combining multiple data sources (e.g., lyrics, user preferences) to build robust recommendation systems.