Speaker Recognition: Literature Review & Research Proposal

Words: 3978 Pages: 13

Speaker recognition refers to the process used to recognize a speaker from a spoken phrase (Furui, n.d. 1). It is a useful biometric tool with wide applications e.g. in audio or video document retrieval. Speaker recognition is dominated by two procedures namely segmentation and classification. Research and development have been ongoing to design new algorithms or to improve on old ones that are used for doing segmentation and classification.

Statistical concepts dominate the field of speaker recognition and they are used for developing models. Machines that are used for speaker recognition purposes are referred to as automatic speech recognition (ASR) machines. ASR machines are either used to identify a person or to authenticate the person’s claimed identity (Softwarepractice, n.d., p.1). The following is a discussion of various improvements that have been suggested in the field of speaker recognition.

Two processes that are of importance in doing speaker recognition are audio classification and segmentation. These two processes are carried out using computer algorithms. In developing an ideal procedure for the process of audio classification, it is important to consider the effect of background noise. Because of this factor, an auditory model has been put forward by Chu and Champagne that exhibits excellent performance even in a noisy background.

To achieve such robustness in a noisy background the model inherently has a self-normalization mechanism. The simpler form of the auditory model is expressed as a three-stage processing progression through which an audio signal goes through an alteration to turn into an auditory spectrum, which is models inside neural illustration. Shortcomings associated with the use of this model are that it involves nonlinear processing and high computational requirements.

These shortcomings necessitate the need for a simpler version of the model. A proposal put forward by the Chu and Champagne (2006)suggests modifications on the model that create a simpler version of it that is linear except in getting the square-root value of energy (p. 775). The modification is done on four of the original processing steps namely pre-emphasis, nonlinear compression, half-wave rectification, and temporal integration. To reduce its computational complexity the Parseval theorem is applied which enables the simplified model to be implemented in the frequency domain.

The resultant effect of these modifications is a self-normalized FFT-based model that has been applied and tested in speech/music/noise classification. The test is done with the use of a support vector machine (SVM) as the classifier. The result of this test indicates that a comparison of the original and proposed auditory spectrum to a conventional FFT-based spectrum suggests a more robust performance in noisy environments (p.775). Additionally, the results suggest that by reducing the computational complexity, the performance of the conventional FFT-based spectrum is almost the same as that of the original auditory spectrum (p.775).

One of the important processes in speaker recognition and in radio recordings is speech/music discrimination.. The discrimination is done using speech/music discriminators. The discriminator proposed by Giannakopoulos et al. involves a segmentation algorithm (V-809). Audio signals exhibit changes in the distribution of energy (RMS) and it is on this property that the audio segmentation algorithm is founded on. The discriminator proposed by Giannakopoulos et al involves the use of Bayesian networks (V-809). A strategic move, which is ideal in the classification stage of radio recordings, is the adoption of Bayesian networks.

Each of the classifiers is trained on a single and distinct feature, thus, at any given classification nine features are involved in the process. By operating in distinct feature spaces, the independence between the classifiers is increased. This quality is desirable, as the results of the classifiers have to be combined by the Bayesian network in place. The nine commonly targeted features, which are extracted from an audio segment, are Spectral Centroid, Spectral Flux, Spectral Rolloff, Zero Crossing Rate, Frame Energy and 4 Mel-frequency cepstral coefficients.

The new feature selection scheme that is integrated on the discriminator is based on the Bayesian networks (Giannakopoulos et al, V-809). Three Bayesian network architectures are considered and the performance of each is determined. The BNC Bayesian network has been determined experimentally, and found to be the best of the three owing to reduced error rate (Giannakopoulos et al, V-812). This proposed discriminator has worked on real internet broadcasts of the British Broadcasting Corporation (BBC) radio stations (Giannakopoulos et al, V-809).

An important issue that arises in speaker recognition is the ability to determine the number of speakers involved in an audio session. Swamy et al. (2007) have put forward a mechanism that is able to determine the number of speakers (481). In this mechanism, the value is determined from multispeaker speech signals.

According to Swamy et al., one pair of microphones that are spatially separated is sufficient to capture the speech signals (481). A feature of this mechanism is the time delay experienced in the arrival of these speech signals. This delay is because of the spatial separation of the microphones.

The mechanism has its basis on the fact that different speakers will exhibit different time delay lengths. Thus, it is this variation in the length of the time delay, which is exploited in order to determine the number of speakers. In order to estimate the length of time delay, a cross-correlation procedure is undertaken. The procedure cross-correlates to the Hilbert envelopes, which correspond to linear prediction residuals of the speech signals.

According to Zhang and Zhou (2004), audio segmentation is one of the most important processes in multimedia applications (IV-349). One of the typical problems in audio segmentation is accuracy. It is also desirable that the segmentation procedure can be done online. Algorithms that have attempted to deal with these two issues have one thing in common. The algorithms are designed to handle the classification of features at small-scale levels.

These algorithms additionally result in high false alarm rates. Results obtained from experiments reveal that the classification of large-scale audio is easily compared to small-scale audio. It is this fact that has necessitated an extensive framework that increases robustness in audio segmentation. The proposed segmentation methodology can be described in two steps. In the first step, the segmentation is described as rough and the classification is large-scale.

This step is taken as a measure of ensuring that there is integrality with respect to the content segments. By accomplishing this step you ensure that audio that is consecutive and that is from one source is not partitioned into different pieces thus homogeneity is preserved. In the second step, the segmentation is termed subtle and is undertaken to find segment points. These segment points correspond to boundary regions, which are the output of the first step.

Results obtained from experiments also reveal that it is possible to achieve a desirable balance between the false alarm and low missing rate. The balance is desirable only when these two rates are kept at low levels (Zhang & Zhou, IV-349).

According to Dutta and Haubold (2009), the human voice conveys speech and is useful in providing gender, nativity, ethnicity and other demographics about a speaker (422). Additionally, it also possesses other non-linguistic features that are unique to a given speaker (422). These facts about the human voice are helpful in doing audio/video retrieval. In order to do a classification of speaker characteristics, an evaluation is done on features that are categorized either as low-, mid- or high – level.

MFCCs, LPCs, and six spectral features comprise the low-level features that are signal-based. Mid-level features are statistical in nature and used to model the low-level features. High-level features are semantic in nature and are found on specific phonemes that are selected. This describes the methodology that has been put forward by Dutta and Haubold (Dutta &Haubold, 2009, p.422).

The data set that is used in assessing the performance of the methodology is made up of about 76.4 hours of annotated audio. In addition, 2786 segments that are unique to speakers are used for classification purposes. The results from the experiment reveal that the methodology put forward by Dutta and Haubold yields accuracy rates as high as 98.6% (Dutta & Haubold, 422). However, this accuracy rate is only achievable under certain conditions.

The first condition is that test data is for male or female classification. The second condition to be observed is that in the experiment only mid-level features are used. The third condition is that the support vector machine used should posses a linear kernel. The results also reveal that mid- and high- level features are the most effective in identifying speaker characteristics.

To automate the processes of speech recognition and spoken document retrieval the impact of unsupervised audio classification and segmentation has to be considered thoroughly. Huang and Hansen (2006) propose a new algorithm for audio classification to be used in automatic speech recognition (ASR) procedures (907).

GMM networks that are weighted form the core feature of this new algorithm. Captured within this algorithm are the VSF and VZCR. VSF and VZCR are, additionally, extended-time features that are crucial to the performance of the algorithm. VSF and VZCR perform a pre-classification of the audio and additionally attach weights to the output probabilities of the GMM networks. After these two processes, the WGN networks implement the classification procedure.

For the segmentation process in automatic speech recognition (ASR) procedures, Huang and Hansen (2006) propose a compound segmentation algorithm that captures 19 features (p.907). The figure below presents the features proposed

Figure 1. Proposed features.

Number required	Feature name
1	2-mean distance metric
1	perceptual minimum variance distortionless response ( PMVDR)
1	Smoothed zero-crossing rate (SZCR)
1	False alarm compensation procedure
14	Filterbank log energy coefficients (FBLC)

The 14 FBLCs proposed are implemented in 14 noisy environments where they are used to determine the best overall robust features with respect to these conditions. Turns lasting up to 5 seconds can be enhanced for short segment. In such case 2-mean distance metric is can be installed. The false alarm compensation procedure has been determined to boost efficiency of the rate at a cost effective manner.

A comparison involving Huang and Hansen’s proposed classification algorithm against a GMM network baseline algorithm for classification reveals a 50% improvement in performance. Similarly, a comparison involving Huang and Hansen’s proposed compound segmentation algorithm against a baseline Mel-frequency cepstral coefficients (MFCC) and traditional Bayesian information criterion (BIC) algorithm reveals a 23%-10% improvement in all aspects (Huang and Hansen, 2006, p. 907).

The data set used for the comparison procedure comprises of broadcast news evaluation data gotten from DARPA. DARPA is short for Defense Advanced Research Projects Agency. According to Huang and Hansen (2006), these two proposed algorithms achieve satisfactory results in the National Gallery of the Spoken Word (NGSW) corp, which is a more diverse, and challenging test.

The basis of speaker recognition technology in use today is predominated by the process of statistical modeling. The statistical model formed is of short-time features that are extracted from acoustic speech signals. Two factors come into play when determining the recognition performance; these are the discrimination power associated with the acoustic features and the effectiveness of the statistical modeling techniques.

The work of Chan et al is an analysis of the speaker discrimination power as it relates to two vocal features (1884). These two vocal features are either vocal source or conventional vocal tract related. The analysis draws a comparison between these two features. The features that are related to the vocal source are called wavelet octave coefficients of residues (WOCOR) and these have to be extracted from the audio signal. In order to perform the extraction process linear predictive (LP) residual signals have to be induced.

This is because the linear predictive (LP) residual signals are compatible with the pitch-synchronous wavelet transform that perform the actual extraction. To determine between WOCOR and conventional MFCC features, which are least discriminative when a limit is placed on the amount of audio data consideration, is made to the degree of sensitivity to speech.

Being less sensitive to spoken content and more discriminative in the face of a limited amount of training data are the two advantages that make WOCOR suitable for use in the task of speaker segmentation in telephone conversations (Chan et al, 1884). Such a task is characterized by building statistical speaker models upon short segments of speech. Additionally, experiments undertaken also reveal a significant reduction of errors associated with the segmentation process when WOCORs are used (Chan et al, 1884).

Automatic speaker recognition (ASR) is the process through which a person is recognized from a spoken phrase by the aid of an ASR machine (Campbell, 1997, p.1437). Automatic speaker recognition (ASR) systems are designed and developed to operate in two modes depending on the nature of the problem to be solved. In one of the modes, they are used for identification purposes and in the other; they are used for verification or authentication purposes.

In the first mode, the process is known as automatic speaker verification (ASV) while in the second the process is known as automatic speaker identification (ASI). In ASV procedures, the person’s claimed identity is authenticated by the ASR machine using the person’s voice. In ASI procedures unlike the ASV ones there is no claimed identity thus it is up to the ASR machine to determine the identity of the individual and the group to which the person belongs. Known sources of error in ASV procedures are shown in the table below

Tab.2 Sources of verification errors.

Misspoken or misread prompted phases

Stress, duress and other extreme emotional states

Multipath, noise and any other poor or inconsistent room acoustics

The use of different microphones for verification and enrolment or any other cause of Chanel mismatch

Sicknesses especially those that alter the vocal tract

Aging

Time varying microphone placement

According to Campbell, a new automatic speaker recognition system is available and the recognizer is known to perform with 98.9% correct identification levels (p.1437 Signal acquisition is a basic building block for the recognizer. Feature extraction and selection is the second basic unit of the recognizer. Pattern matching is the third basic unit of the recognizer. A decision criterion is the fourth basic unit of the proposed recognizer.

According to Ben-Harush et al. (2009), speaker diarization systems are used in assigning temporal speech segments in a conversation to the appropriate speaker (p.1). The system also assigns non-speech segments to non-speech. The problem that speaker diarization systems attempt to solve is captured in the query “who spoke when?” An inherent shortcoming in most of the diarization systems in use today is that they are unable to handle speech that is overlapped or co-channeled. To this end, algorithms have been developed in recent times seeking to address this challenge.

However, most of these require unique conditions in order to perform and necessitate the need for high computational complexity. They also require that an audio data analysis with respect to time and frequency domain be undertaken. Ben-Harush et al. (2009) have proposed a methodology that uses frame based entropy analysis, Gaussian Mixture Modeling (GMM) and well known classification algorithms to counter this challenge (p.1).

To perform overlapped speech detection, the methodology suggests an algorithm that is centered on a single feature. This single feature is an entropy analysis of the audio data in the time domain. To identify speech segments that are overlapped the methodology uses the combined force of Gaussian Mixture Modeling (GMM) and well-known classification algorithms. The methodology proposed by Ben-Harush et al is known to detect 60.0 % of frames containing overlapped speech (p.1).

This value is achieved when the segmentation is at baseline level (p.1). It is capable of achieving this value while it maintains the rate of false alarm at 5 (p.1). Overlapped speech (OS) contributes to degrading the performance of automatic speaker recognition systems. Conversations over the telephone or during a meeting possess high quantities of overlapped speech.

Du et al (200&) brings out audio segmentation as a problem in TV series, movies and other forms of practical media (I-205). Practical media exhibits audio segments of varying lengths but of these, short ones are easily noticeable due to their number. Through audio segmentation, an audio stream is broken down into parts that are homogenous with respect to speaker identity, acoustic class and environmental conditions..Du et al. (2007) has formulated an approach to unsupervised audio segmentation to be used in all forms of practical media.

Included in this approach is a segmentation-stage at which potential acoustic changes are detected. Also included is a refinement-stage during which the detected acoustic changes are refined by a tri-model Bayesian Information Criterion (BIC). Results from experiments suggest that the approach possesses a high capability for detecting short segments (Du et al, I-205). Additionally, the results suggest that the tri-model BIC is effective in improving the overall segmentation performance (Du et al, I-205).

According to Hosseinzadeh and Krishnan (2007), the concept of speaker recognition processes seven spectral features. The first of these spectral features is the Spectral centroid (SC). Hosseinzadeh and Krishnan (2007, p.205), state “the second spectral feature is Spectral bandwidth (SBW), the third is spectral band energy (SBE), the fourth is spectral crest factor (SCF), the fifth is Spectral flatness measure (SFM), the sixth is Shannon entropy (SE) and the seventh is Renyi entropy (RE)”.

The seven features are used for quantification, which is important in speaker recognition since it is the case where vocal source information and the vocal tract function complements each other. The vocal truct function is determined specifically using two coefficients these are the MFCC and LPCC. MFCC stands for Mel frequency coefficients and LPCC stands for linear prediction cepstral coefficients. The quantification is quite significant in speaker detection as it is the container where verbal supply information and the verbal tract function are meant to balance.

Very important in an experiment done to analyze the performance of these features is the use of a speaker identification system (SIS). ). A cohort Gaussian mixture model which is additionally text-independent is forms the ideal choice of a speaker identification method that is used in the experiment. The results from such an experiment reveal that these features achieve an identification accuracy of 99.33%. This accuracy level is achieved only when these features are combined with those that are MFCC based and additionally when undistorted speech is used.

In conclusion, the following proposed methodologies have been tried and tested. The results obtained from these experiments show that indeed these methodologies do improve the practice of speaker recognition.

Research proposal

Summary

One of the main challenges in today’s world is security. Fraudsters are everywhere and can disguise as anyone, thereby making people vulnerable to their frauds. People are willing to embrace new technology that comes into their aid against such characters. The field of speaker recognition if exploited well can form the basis of new technology that can significantly reduce the rate of crime.

In addition to personal identification (PIN) numbers, passwords and other forms of identification we can have voice signatures to enhance security further in restricted areas. When a crime is committed (e.g. a bank robbery) by simply retrieving audio footage of the crime in progress the identity of the assailants can be determined by the aid of a voice database. To elaborate further, the voice database holds audio data of everyone, potential and known criminals, thus you only need to get a match between what is on the database and what has been retrieved.

Project summary

The project is large scale and is undertaken after approval by the relevant authorities. It seeks to create a voice database that will be used distinctively for the purpose of minimizing crime. The methodology adopted for speaker recognition is that of Chu and Champagne, that is, a methodology whose performance is consistent even in noisy backgrounds. The database will hold audio data for all adults.

Statement of the problem

The project seeks to determine whether crime can be reduced significantly by using speaker recognition as a forensic tool. Clearly, security is a primary concern to people in today’s world.

Introduction and background

Relevant literature review

According to Dutta and Haubold (2009), the human voice conveys speech and is useful in providing gender, nativity, ethnicity and other demographics about a speaker. Additionally, it also possesses other non-linguistic features that are unique to a given speaker. Hosseinzadeh and Krishnan advance these two facts by arguing that speaker recognition provides a cost effective and more practical technology to fight crime (p.365).

Preliminary data

The data to be collected for this project is both audio and non-audio. Non-audio data to be collected includes a person’s identification number, names, gender, age and physical address. The audio data is recorded by the use of microphones. Speakers are expected to say certain phrases on the microphone, which are then recorded and stored in the database. These data will be used in the construction of the database.

Conceptual or empirical model

As stated above, the speech recognition methodology that forms the basis of this research is that of Chu and Champagne.

Justification of approach or novel methods

The above methodology is adopted because it provides a speaker recognition system that boats of robustness in noisy environments. Thus, injecting much needed flexibility in the project.

Research Plan

Once the database is put in place and is set to function properly, the effectiveness of this tool will be determined over a one-year period. The number of crimes that have been successfully solved with the aid of this tool within a one-year period will be determined. The value p will be determined, where p = number of crimes solved with the aid of the tool / total number of crimes

The hypothesis,

H_o: p = 0 (the tool does not aid in reducing crime)

H₁: p > 0 ( the tool aids in reducing crime)

Which will be tested at 99% confidence.

Considering crime scenes and the nature of most crimes, it is expected that this tool will reduce the level of crime. For instance, telephone conversations, which are the mode of choice for kidnappers and abductors, can be exploited to reveal the assailant’s identities. Thus, this nature of crime is discouraged and in the event it happens, investigators are better positioned to identify the suspects.

Research Timetable

Research stage	Name of stage	Stage duration	Stage description
1	Preparation	1 month	Presenting the research proposal to the relevant authorities Acquiring approval by the authorities to undertake the research Identifying and acquiring necessary materials for the research e.g. microphones Identifying the target group (potential and known criminals) Developing the speech recognition system to use Construction of voice database Forming a research team
2	Enrollment	1 month	Taking of speech samples from target group Developing voice prints or templates Electronically storing the templates
3	Data collection	12 months	Collection of crime data within a year that has been solved with the aid of speaker recognition
4	Analysis	1 month	Computing the proportion, p Developing and testing the hypothesis Drawing the inference guided by the results of the hypothesis
5	Conclusion	1 day	Presenting the findings to the relevant authorities Presenting recommendations to the relevant authorities

The research should start and end in less than 16 months. The timetable captures the five stages of the lifecycle of the research process. In the first stage (Preparation), it is important that the research be approved by authorities because it makes use of sensitive data that is collected for the purposes of planning by a government.

The research team should comprise of qualified individuals from the field of speaker recognition and statistics. Each member of the team should be aware of his duties and period allotted to finish the task. On successful completion of the first four stages, it is important that in the fifth stage (Conclusion) findings be accompanied by recommendations based on the research.

References

Ben-Harush, O., Guterman H., & Lapidot I. (2009) Frame level entropy based overlapped speech detection as a pre-processing stage for speaker diarization, pp. 1-6.

Campbell, J. P. (1997) Speaker recognition: a tutorial, pp. 1437-1462.

Chan, W. N., Zheng N., & Lee,T. (2007) Discrimination power of vocal source and vocal tract related features for speaker segmentation, pp. 1884-1892Hosseinzadeh, D. &.

Chu, W. & Champagne B. (2006) A simplified early auditory model with application in speech/music classification, pp. 775 – 778.

Du, Y.,Hu W.,Yan Y., Wang T., & Zhang Y. (2007) Audio segmentation via tri-model bayesian information criterion, pp. I-205 – I-208.

Dutta, P. & Haubold A. (2009) Audio-based classification of speaker characteristics, pp. 422 – 425.

Furui S. (n.d.) Speaker recognition. Web.

Giannakopoulos, T. Pikrakis A. & Theodoridis S. (2006) A speech/music discriminator for radio recordings using Bayesian networks, pp. V-809 – V.812.

Huang, R. & Hansen, J. H. L. (2006) Advances in unsupervised audio classification and segmentation for the broadcast news and ngsw corpora, pp. 907-919.

Krishnan, S. (2007) Combining vocal source and mfcc features for enhanced speaker recognition performance using gmms, pp. 365.

Softwarepractic. (n.d.) Speaker recognition. Web.

Swamy, K. R., Murti, S. K. & Yegnanarayana, B. (2007) Determining number of speakers from multispeaker speech signals using excitation source information, pp. 481-484.

Zhang, Y. & Zhou J. (2004) Audio segmentation based on multi-scale audio classification, pp. IV-349 – IV-352.