MCE 2018:

The 1st Multi-target speaker detection and

identification Challenge Evaluation

Overview

The Multi-target Challenge aims to assess how well current machine learning approaches are able to determine whether or not a recorded utterance was spoken by one of a large number of "blacklisted" speakers [1]. It is a form of multi-target speaker detection based on real-world telephone conversations. Data recordings are generated from call center customer-agent conversations. Each conversation is represented by a single i-vector [2]. Given a pool of training and development data from non-Blacklist and Blacklist speakers, the task is to measure how accurately one can detect 1) whether a test recording is spoken by a Blacklist speaker, and 2) which specific Blacklist speaker was talking.

Task and Baseline

Although the original data is from the acoustic signal, no prior knowledge of speech processing is needed because each acoustic waveform is represented by a 600-dimensional vector which is called i-vector. We used a Kaldi recipe (egs/sre10/v1) to train the i-vector extractor. 13,000 hours of unlabeled speech are used to train the i-vector extractor. We provide i-vectors from 41,845 utterances for the challenge training set and 8,631 utterances for the development set. For evaluation, we will release test set i-vector from 16,017 utterances. The challenge will measure system's performance using this test set. The same blacklist speakers appear in all three sets, but the background speakers are different between train, dev and test sets.

Baseline multi-target detector system for MCE 2018

The concept of baseline system is from the multi-target detector in [1]. For each input, we rank the multi-target detector scores and accept the top-k hypotheses if the rank-1 score is above a detection threshold. If k is the size of our blacklist (S), the system only cares if input is from anyone in the blacklist or not (top-S detector). If k is 1, the system further needs to detemine who on the blacklist is speaking (top-1 detector). Additionally, multi-target score normalization (M-Norm) is applied to reduce the variability of decision score on multi-target. The purpose of M-Norm is shift and scale of score distribution between multi-target speakers and multi-target utterance to standard normal distribution. The performance will be measured in terms of Equal Error Rate (EER) for both top-S and top-1 detector cases. For more detailed information and how to measure performance, you can find at MCE 2018 Plan

An implementation of the baseline can be found here: https://github.com/swshon/multi-speakerID
In the baseline, we use the train set for training detectors and test on the development set. The result shows Top-S detector EER is 2.00% and Top-1 detector EER is 13.41% (Total confusion error is 492).

'via Blog this'

asrman

Blog Archive

Thursday, May 17, 2018

MCE 2018