Blog Archive

Monday, May 16, 2011

ALIZE vs. Patrik Kenny-JFA

NOTE:
1) ALIZE is Factor Analysis based system, but not JFA.


A Study of Interspeaker Variability in Speaker Verification




G. Note on Baum–Welch Statistics
The results we have obtained using speaker factors are clearly
much better than those obtained using alone, but the reader
may have noticed that the figures presented in the fourth rows
of Tables I and II are not quite as good as the best results that
have been reported with comparable stand-alone GMM/UBM
systems as in [30], [5], and [31]. These systems are comparable
because they use relevance MAP for speaker enrollment and
channel factors to compensate for intersession variability. As
we mentioned in Section II-C1, relevance MAP is essentially a
special type of diagonal factor analysis model.
The reason for the discrepancy in performance is that we use
the UBM to extract Baum–Welch statistics in our system rather
than speaker-dependent GMMs. It turns out that, in the case
of a diagonal factor analysis model, using speaker-dependent
GMMs does indeed produce better results. For example, on the
English language trials in the core condition, a diagonal model
with 100 channel factors produces an EER of 2.8% for male
speakers, which is similar to the results presented in [30], [5],
and [31] (but not as good as the result in the first line of Table V).
However, for a factor analysis model with 300 speaker factors,
using speaker-dependent GMMs (estimated with speaker factors)
to extract Baum–Welch statistics turns out to be harmful.
For example, on the English language trials in the core condition,
a factor analysis model with 300 speaker factors and 100
channel factors produces an EER of 4.2% for male speakers if
the Baum–Welch statistics are extracted with speaker-dependent
GMMs;onthe other hand, anEERof1.4%is obtained if theUBM
is used for this purpose. This is the reason why we have always
used the UBM to extract Baum–Welch statistics in our work
on factor analysis. (The extraordinarily low error rate of 1.4%
is attributable to using 100 channel factors rather than 50 as in
Table V.)



[31] B. G. B. Fauve, D. Matrouf, N. Scheffer, J.-F. Bonastre, and J. S. D.
Mason, “State-of-the-art performance in text-independent speaker verification
through open-source software,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 15, no. 7, pp. 1960–1968, 2007.




http://eegalilee.swan.ac.uk/publications/B.G.Fauve/Fauve07-TASLP.pdf


Notice that in the case where U=0 the speaker-session model corresponds to the classical MAP adaptation

No comments:

Post a Comment