Bhusan Chettri Gives An Overview Of The Technology Behind Voice Authentication Using Computer

What is Automatic Speaker Recognition?

Automatic Speaker Recognition is the task of recognizing humans through their voice by using a computer. Automatic Speaker Recognition generally comprises of two tasks: Speaker identification and Speaker verification. Speaker identification involves finding the correct person from a given pool of known speakers or voices. A speaker identification usually comprises of a set of N speakers who are already registered in the system and these N speakers can only have access to the system. Speaker verification on the other hand involves verifying whether a person is who he/she claims to be using their voice sample.

These systems are further classified into two categories depending upon the level of user cooperation: (1) Text dependent (2) Text independent. In text dependent application, the system has prior knowledge of the spoken text and therefore expects same utterance during test time (or deployment phase). For example, pass-phrase such as “My voice is my password” will be used both during speaker enrollment (registration) and during deployment (when the system is running). On the contrary, in text independent systems there is no prior knowledge about the lexical contents, and therefore these systems are much more complex than text dependent ones.

So how does the speaker verification algorithm work? How are they trained and deployed?

Well, in order to build automatic speaker recognition systems first thing we need is data. Big amount of speech data collected from hundreds and thousands of speakers spoken across varied acoustic conditions. It would be nice to have pictures illustrating the methodology as pictures speak louder than thousand words. The block diagram shown below summarises a typical speaker verification system. It consists of speaker enrollment phase (Fig a) and speaker verification phase (Fig b). The role of a feature extraction module is to transform the raw speech signal into some representation (features) that retains speaker specific attributes useful to the downstream components in building speaker models. The enrollment phase comprises offline and online modes of building models. During the offline mode, background models are trained on features computed from a large speech collection representing a diverse population of speakers. The online phase comprises building a target speaker model using features computed from target speaker’s speech. Usually, training the target speaker model from scratch is avoided because learning reliable model parameters requires a sufficiently large amount of speech data,which is usually not available for every individual speaker. To overcome this, the parameters of a pretrained background model representing the speaker population are adapted using the speaker data yielding a reliable speaker model estimate. During the speaker verification phase, for a given test speech utterance, a claimed speaker’s model and the background model (representing the world of all other possible speakers) is used to derive a confidence score. The decision logic module then makes a binary decision: it either accepts the claimed identity as a genuine speaker or rejects it as an impostor based on some decision threshold.

(a) Speaker enrollment phase. The goal here is to build speaker specific models by adapting a background model which is trained on a large speech database.

(b) Speaker verification phase. For a given speech utterance the system obtains a verification score and makes a decision whether to accept or reject the claimed identity.

How has the state-of-the-art changed and driven by big-data and AI?

Bhusan Chettri explains that there has been a big paradigm shift in the way we build these systems. To bring clarity on this, Mr Bhusan Chettri summarises the recent advancement in state-of-the-art in two broad categories. (1) Traditional approaches (2) Deep learning (and Big data) approaches.

Traditional methods. By traditional methods he refers to approaches driven by a Gaussian mixture model — universal background model (GMM-UBM) that were adopted in the ASV literature until deep learning techniques became popular in the field. Mel-frequency cepstral coefficients (MFCCs) were popular frame-level feature representations used in speaker verification. Using short-term MFCC feature vectors, utterance level features such as i-vectors are often derived which have shown state-of-the-art performance in speaker verification. The background models such as the Universal back-ground model (UBM) and total variability (T) matrix are learned in an offline phase using a large collection of speech data. The UBM and T matrix are used in computing i-vector (this is just a fixed length vector representing a variable-length speech utterance) representations. The training process involves learning model (target or background) parameters from training data. As for modelling techniques, vector quantization (VQ) was one of the earliest approaches used to represent a speaker, after which Gaussian mixture models (GMMs), an extension to VQ methods, and Support vector machines became popular methods for speaker modelling. The traditional approach also includes training an i-vector extractor (GMM-UBM, T-matrix) on MFCCs and using a probabilistic linear discriminant analysis (PLDA) backend for scoring.

Deep learning methods. In deep learning based approaches for ASV, features are often learned in a data-driven manner directly from the raw speech signal or from some intermediate speech representations such as filter bank energies. Handcrafted features, for example MFCCs, are often used as input to train deep neural network (DNN) based ASV systems. Features learned from DNNs are often used to build traditional ASV systems. Researchers have used the output from the penultimate layer of a pre-trained DNN as features to train a traditional i-vector PLDA setup (replacing i-vectors with DNN features). Extracting bottleneck features (output from a hidden layer with a relatively small number of units) from a DNN to train a GMM-UBM system which uses the log-likelihood ratio as scoring is also used commonly. Utterance-level discriminative features, so called embeddings extracted from pre-trained DNNs have become popular recently, demonstrating good results. End-to-end modelling approaches have also been extensively studied in speaker verification showing promising results. In this setting, both feature learning and model training are jointly optimised from the raw speech input. A wide range of neural architectures have been studied for speaker verification. This includes feed forward neural networks, commonly referred as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks, and attention models. Training background models in deep learning approaches can be thought of as a pretrainng phase where network parameters are trained on a large dataset. Speaker models are then derived by adapting the pretrained model parameters using speaker specific data, much like the same way a traditional GMM-UBM system operates.

What are the applications of these technology?

These can be used across wide-range of domains such as (a) access control — voice based access control systems (b) in banking applications for authenticating a transaction © personalisation: in mobile devices, lock/unlock vehicle door (engine start/off) based on specific user etc.

Are they safe and secure? Are they prone to any manipulation when they are deployed?

Bhusan Chettri further explains that although the current advancement in algorithms with the aid of big data have shown remarkable state-of-the-art results, these systems are not 100% secure. They are prone to spoofing attacks where an attacker aims to manipulate voice to sound like registered user and gain illegitimate access to their system. A significant amount of research is being promoted by the ASV community recently along this direction.

References

[1] Google scholar: https://scholar.google.co.uk/citations?user=Ht6H2WgAAAAJ&hl=en
[2] Thesis: https://theses.eurasip.org/theses/866/voice-biometric-system-security-design-and/

[3] Published On My Medium Blog Earlier https://medium.com/@bhusanchettriresearch/automatic-speaker-recognition-and-ai-f2e504b7d660