Auxiliary self-supervision to metric learning for music similarity-based retrieval and auto-tagging

Taketo Akama; Hiroaki Kitano; Katsuhiro Takematsu; Yasushi Miyajima; Natalia Polouliakh

doi:10.1371/journal.pone.0294643

Auxiliary self-supervision to metric learning for music similarity-based retrieval and auto-tagging

PLoS One. 2023 Nov 30;18(11):e0294643. doi: 10.1371/journal.pone.0294643. eCollection 2023.

Authors

Taketo Akama¹, Hiroaki Kitano¹, Katsuhiro Takematsu², Yasushi Miyajima², Natalia Polouliakh¹

Affiliations

¹ Sony Computer Science Laboratories, Inc, Tokyo, Japan.
² Koozyt, Inc, Tokyo, Japan.

Abstract

In the realm of music information retrieval, similarity-based retrieval and auto-tagging serve as essential components. Similarity-based retrieval involves automatically analyzing a music track and fetching analogous tracks from a database. Auto-tagging, on the other hand, assesses a music track to deduce associated tags, such as genre and mood. Given the limitations and non-scalability of human supervision signals, it becomes crucial for models to learn from alternative sources to enhance their performance. Contrastive learning-based self-supervised learning, which exclusively relies on learning signals derived from music audio data, has demonstrated its efficacy in the context of auto-tagging. In this work, we propose a model that builds on the self-supervised learning approach to address the similarity-based retrieval challenge by introducing our method of metric learning with a self-supervised auxiliary loss. Furthermore, diverging from conventional self-supervised learning methodologies, we discovered the advantages of concurrently training the model with both self-supervision and supervision signals, without freezing pre-trained models. We also found that refraining from employing augmentation during the fine-tuning phase yields better results. Our experimental results confirm that the proposed methodology enhances retrieval and tagging performance metrics in two distinct scenarios: one where human-annotated tags are consistently available for all music tracks, and another where such tags are accessible only for a subset of music tracks.

Copyright: © 2023 Akama et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Affect
Benchmarking
Databases, Factual
Humans
Information Storage and Retrieval
Music*
Skin Neoplasms*

Grants and funding

The authors received no specific funding for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.