Publication Details

Development of ABC systems for the 2021 edition of NIST Speaker Recognition evaluation

ALAM Jahangir, BURGET Lukáš, GLEMBEK Ondřej, MATĚJKA Pavel, MOŠNER Ladislav, PLCHOT Oldřich, ROHDIN Johan A., SILNOVA Anna and STAFYLAKIS Themos et al. Development of ABC systems for the 2021 edition of NIST Speaker Recognition evaluation. In: Proceedings of The Speaker and Language Recognition Workshop (Odyssey 2022). Beijing: International Speech Communication Association, 2022, pp. 346-353. Available from: https://www.isca-speech.org/archive/pdfs/odyssey_2022/alam22_odyssey.pdf

Czech title

Vývoj ABC systémů pro ročník 2021 NIST evalulace systémů pro rozpoznávání mluvčího

Type

conference paper

Language

english

Authors

Alam Jahangir (CRIM)
Burget Lukáš, doc. Ing., Ph.D. (DCGM FIT BUT)
Glembek Ondřej, Ing., Ph.D. (DCGM FIT BUT)
Matějka Pavel, Ing., Ph.D. (DCGM FIT BUT)
Mošner Ladislav, Ing. (DCGM FIT BUT)
Plchot Oldřich, Ing., Ph.D. (DCGM FIT BUT)
Rohdin Johan A., Dr. (DCGM FIT BUT)
Silnova Anna, MSc., Ph.D. (DCGM FIT BUT)
and others

URL

Keywords

speaker verification, recognition, evaluation

Abstract

In this contribution, we provide a description of the ABC teams collaborative efforts toward the development of speaker verification systems for the NIST Speaker Recognition Evaluation 2021 (NISTSRE2021). Cross-lingual and cross-dataset trials are the two main challenges introduced in the NIST-SRE2021. Submissions of ABC team are the result of active collaboration of researchers from BUT, CRIM, Omilia and Innovatrics. We took part in all three close condition tracks for audio-only, audio-visual and visual-only verification tasks. Our audio-only systems follow deep speaker embeddings (e.g., x-vectors) with a subsequent PLDA scoring paradigm. As embeddings extractor, we select some variants of residual neural network (ResNet), factored time delay neural network (FTDNN) and Hybrid Neural Network (HNN) architectures. The HNN embeddings extractor employs CNN, LSTM and TDNN networks and incorporates a multi-level global-local statistics pooling method in order to aggregate the speaker information within short time-span and utterance-level context. Our visual-only systems are based on pretrained embeddings extractors employing some variants of ResNet and the scoring is based on cosine distance. When developing an audio-visual system, we simply fuse the outputs of independent audio and visual systems. Our final submitted systems are obtained by performing score level fusion of subsystems followed by score calibration.

Published

2022

Pages

346-353

Proceedings

Proceedings of The Speaker and Language Recognition Workshop (Odyssey 2022)

Conference

Odyssey 2022: The Speaker and Language Recognition Workshop, Beijing, CN

Publisher

International Speech Communication Association

Place

Beijing, CN

DOI

10.21437/Odyssey.2022-48

BibTeX

@INPROCEEDINGS{FITPUB12843,
   author = "Jahangir Alam and Luk\'{a}\v{s} Burget and Ond\v{r}ej Glembek and Pavel Mat\v{e}jka and Ladislav Mo\v{s}ner and Old\v{r}ich Plchot and A. Johan Rohdin and Anna Silnova and Themos Stafylakis and et al.",
   title = "Development of ABC systems for the 2021 edition of NIST Speaker Recognition evaluation",
   pages = "346--353",
   booktitle = "Proceedings of The Speaker and Language Recognition Workshop (Odyssey 2022)",
   year = 2022,
   location = "Beijing, CN",
   publisher = "International Speech Communication Association",
   doi = "10.21437/Odyssey.2022-48",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/12843"
}