When: Wednesday, February 11th, 2015 @ 5:00 PM – 6:00 PM
Where: LT-A, Alison House (Reid School of Music), Nicolson Square, University of Edinburgh
Out of the barn and into the yard, and other colourful results from my recent paroxysm about the practice of evaluation in machine music listening
I call attention to what I call the “crisis of evaluation” in music information retrieval (MIR) research. Among other things, MIR seeks to address the variety of needs for music information of listeners, music recording archives, and music companies. A large portion of MIR research has thus been devoted to the automated description of music in terms of genre, mood, and other meaningful terms.
However, my recent work reveals four things: 1) many published results unknowingly use datasets with faults that render them meaningless; 2) state-of-the-art (“high classification accuracy”) systems are fooled by irrelevant factors; 3) most published results are based upon an invalid evaluation design; and 4) a lot of work has unknowingly built, tuned, tested, compared and advertised “horses” instead of solutions. (The true story of the horse Clever Hans provides the most appropriate illustration.) I argue why these problems have occurred, and how we can address them by adopting the formal design and evaluation of experiments, and other best practices.