Team D Speaker Recognition
From SoftwarePractice.org
Group D Members
Mohammad Kamrul Hasan (10110498) Simon Sisavanh (02045351) Shruti Tripathi (10212874) Emir Hodzic (02075479)
Introduction
This project encompasses the implementation of a speaker recognition program in Matlab. Speaker recognition systems can be characterised as text-independent or text-independent. The system we have developed is the latter, text-independent, meaning the system can identify the speaker regardless of what is being said.
The program will contain two functionalities: A training mode, a recognition mode. The training mode will allow the user to record voice and make a feature model of that voice. The recognition mode will use the information that the user has provided in the training mode and attempt to isolate and identify the speaker.
Two systems have been developed utilising different principles of signal processing. System A is the fundamental system that analysis multiple trained voice signals based on their frequency spectrum and compares this information to an unknown speaker’s frequency spectrum. The trained signal which presents the closest likeness to the tested signal will be the chosen answer. In contrast, System B has been developed using more advanced speaker recognition principles and concepts. Concepts such as Mel-Frequency Cepstral Analysis and Vector Quantization have been integral to the development of this system.
Literature Review and Discussion
Most of us are aware of the fact that voices of different individuals do not sound alike. This important property of speech-of being speaker dependent-is what enables us to recognize a friend over a telephone. Speech is usable for identification because it is a product of the speaeker’s individual anatomy and linguistic background. In more specific, the speech signal produced by a given individual is affected by both the organic characteristics of the speaker (in terms of vocal tract geometry) and learned differences due to ethnic or social factors [2]. To consider the above concept as a basic, we have tried to establish an “Automatic Speaker Recognition System” by using the simulation software Matlab.
The Automatic Speaker Recognition System is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. By checking the voice characteristics of utterance, this system makes it possible to authenticate their identity and control access to services. This will add extra level of security to any system.
There are mainly two forms of speaker recognition:
Text-dependent recognition – Recognition system knows text spoken by person – Examples: fixed phrase, prompted phrase – Used for applications with strong control over user input – Knowledge of spoken text can improve system performance
Text-independent recognition – Recognition system does not know text spoken by person – Examples: User selected phrase, conversational speech – Used for applications with less control over user input – More flexible system but also more difficult problem – Speech recognition can provide knowledge of spoken text.
In this paper, we will discuss only the text independent but speaker dependent Speaker Recognition system.
All technologies of speaker recognition, identification and verification, text-independent and text-dependent, each has its own advantages and disadvantages and may requires different treatments and techniques. The choice of which technology to use is application-specific. At the highest level, all speaker recognition systems contain two main modules: feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers.
A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC).LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue [4].
Another popular speech feature representation is known as RASTA-PLP, an acronym for Relative Spectral Transform - Perceptual Linear Prediction. PLP was originally proposed by Hynek Hermansky as a way of warping spectra to minimize the differences between speakers while preserving the important speech information [Herm90]. RASTA is a separate technique that applies a band-pass filter to the energy in each frequency subband in order to smooth over short-term noise variations and to remove any constant offset resulting from static spectral coloration in the speech channel e.g. from a telephone line [HermM94].
MFCC’s are based on the known variation of the human ear’s critical bandwidths with frequency, filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz [3]. The process of computing MFCCs is described in more detail in the Theory of Operation section.
MFCC is perhaps the best known and most popular. Moreover,We got some hints on how to make a function to calculate mel-frequency cepstrum coefficients from [3]. So we decided to use mffcc in our project.
The problem of speaker recognition belongs to a much broader topic in scientific and engineering so called pattern recognition. The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. The objects of interest are generically called patterns and in our case are sequences of acoustic vectors that are extracted from an input speech using the techniques described in the previous section. The classes here refer to individual speakers. Since the classification procedure in our case is applied on extracted features, it can be also referred to as feature matching.
The state-of-the-art in feature matching techniques used in speaker recognition include Dynamic Time Warping (DTW), Hidden Markov Modelling (HMM), and Vector Quantization (VQ) [3]. There is also another techniques which is called gmm estimation(Gaussian Mixture Modelling). This is an expectation maximisation algorithm.
In this project, the VQ approach will be used, due to ease of implementation and high accuracy. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword. The collection of all codewords is called a codebook. The Automatic Speaker Recognition System will compare the codebooks of the tested speaker with the codebooks of the trained speaker. The best matching result will be the desired speaker.
Our Aim
In this project, our aim was to build a text-independent "Speaker Recognition System". As we did not have any previous experience with speaker recognition, we tried to implement a simple system first based on fft algorithm. We named that system as "System A". After that, we tried to implement a more effecient system based on mffcc algorithm and vector quantization method. We named that system as "System B". In the rest of the document, we will discuss about our both systems.
Theory of Operation
System A
The core function of this system is “findfreq”. The “findfreq” function extracts the simple features of a voice signal. This function is based on mainly fast fourier transform. The FFT converts each frame of N samples from the time domain into the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) which is defined on the set of N samples {xn}, as follow:
Then the imaginary numbers in FFT matrix are represented as real via multiplying by its conjugate and also taken the magnitudes of the important part of that. This is done by the following Matlab code:
r=f.*conj(f); r=abs(r(1:5000));
After that the FFT matrix is divided into n bins and get the average of each and also being standardized the return value by dividing the matrix by it's sum. For example,
for n=50, the following code operates the above theory:
for i=1:50 t=(i-1)*100+1; p(i)=sum(r(t:t+99)); end p=p/sum(p);
The value of “p” will represent the single extracted characteristic of a speaker’s voice signal.
System B
Phase 1: Feature Extraction
Mel Frequency Cepstral Coefficients (MFCCs) are coefficients that represent audio, based on perception. It is derived from the Fourier Transform or the Discrete Cosine Transform of the audio clip. The basic difference between the FFT/DCT and the MFCC is that in the MFCC, the frequency bands are positioned logarithmically (on the mel scale) which approximates the human auditory system's response more closely than the linearly spaced frequency bands of FFT or DCT. This allows for better processing of data, for example, in audio compression.The main purpose of the MFCC processor is to mimic the behavior of the human ears.
Overall The MFCC process has 5 steps.
Step 1 – Frame Blocking
Here a continuous speech signal is divided into frames of N samples. Adjacent frames are being separated by M(M<N). The values used are M = 256, and N = 156 The first frame will consist of N samples (i.e. 256 samples). The next frame will begin M samples (i.e. 156 samples) after the first frame, and it will overlap the first frame by N-M samples (256 – 156 = 100 samples). Then the third frame will start at 2M samples after the first frame and it will overlap first frame by N-2M. The fourth frame will start at 3M samples after the fisrt, and it will ovelap it by N-3M. The process will continue until all input signal is accounted for.
In mfcc.m Frame Blocking is done using for loop:
N=256;
M=156; a(1:256,1:83)=0;
a(:,1)=s(1:256,1);
for j=2:83
a(:,j)=s((N-M)*j+1:(N-M)*j+N,1);
end;
The result are then plotted using Matlab plot command and displayed in Figure 1.
figure(1)
plot(a(:));
title('Frame blocking of the Speech Signal');
Step 2 – Windowing
Here each individual frames from step 1 are windowed in order to minimize signal discontinuities and spectral distortion. The Hamming window is used to decrease the signal to zero at the beginning and end of each frame.
The Hamming window used in mfcc.m has following form:
h(n) = 0.54 – 0.46 cos(2πn / N -1), 0≤ n ≤N -1
So that if the window is defined as:
h(n), 0≤ n ≤N -1
The output signal y1 is equal to: b(n) = a(n) h(n), 0≤ n ≤N -1
Where:
- N is the number of samples in each frame
- h(n) = Hamming Window
- b(n) = output signal
- a(n) = input signal
Here is the Matlab code used to create a hamming window, and then a for loop is used to window frames from step 1
h=hamming(256);
for j=1:83
b(:,j)=a(:,j).*h;
end;
figure(2)
plot(h(:));
title('Hamming Window');
figure(3)
plot(b(:));
title('Hamming Window applied to each frame');
Step 3 – Fast Fourier Transform (FFT)
Now FFT is used to convert each frame of N samples from the time domain into the frequency domain.
Inbuilt Matlab fft command is used to obtain FFT in mfcc.m
for j=1:83
y(1:256,j)=fft(b(1:256,j));
end
Step 4 – Mel-frequency Wrapping
Here the Mel-frequency Wrapping is used to obtain a mel-scale spectrum of the signal from the step 3. The code in this step was obtained from: http://lcavwww.epfl.ch/~minhdo/asr_project/code/melfb.m
m = melfb(nof,N,fs);
figure(5)
plot(linspace(0, (fs/2), 129), m);
xlabel('frquency (Hz)');
title('Mel-spaced filterbank')
Step 5 – Cepstrum
In this final step we use the Discrete Cosine Transform (DCT) to convert the log mel-scale spectrum back to time domain. The result of the conversion is MFCC.
for j=1:83
y(1:256,j)=fft(b(1:256,j));
n2=1+floor(N/2);
ms=m*abs(y(1:n2,j)).^2;
v(:,j)=dct(log(ms+0.1));
end;
figure(4)
plot(y(:));
title('speech signal after fft');
figure(6)
plot(ms(:));
title('speach signal after frequency wrapping');
figure(7)
plot(v(:));
title('mel cepstral coefficients in time domain(after dct)');
Phase 2: Feature Matching
VQ - Vector Quantization
Speaker recognition systems are inherent of a database, which stores information used to compare the test speaker against a set of trained speaker voices. Ideally, storing as much data obtained from feature extraction techniques is advised to ensure a high degree of accuracy, but realistically this cannot be achieved. The number of feature vectors would be so large that storing and accessing this information using current technology would be unfeasable and impractical.[5]
Vector Quantization (VQ) is a quantization technique used to compress the information and manipulate the data such in a way to maintain the most prominent characteristics. VQ is used in many applications such as data compression (i.e. image and voice compression), voice recognition, etc.
VQ in its application in speaker recognition technology assists by creating a classification system for each speaker. Given the extracted feature vectors (known as codewords) from each speaker, each codeword is used to construct a codebook. This process is applied to every single speaker to be trained into the system. VQ codebook algorithms are inherently difficult to implement. Although numerous VQ algorithms exist, we have chosen to use Linde-Buzo-Gray or LBG VQ Algorithm since it is the easiest to implement [6].
VQ-Linde Buzo Gray (LBG) Algorithm
Flow Diagram of VQ-LBG Algorithm [3]
System B's Vector Quantization technique has been developed using the following pseudo code obtained from http://lcavwww.epfl.ch/~minhdo/asr_project/asr_project.html.
Vector Quantization LBG Pseudo Code
The LBG can be classified as an iterative procedure. [6] The LBG algorithm is operated on a given codebook. LBG splits the codebook into segments and performs an exhaustive analysis on each segment. The analysis compresses the training vector information creating a new codebook which is then used to compute the next segment.
Based on Fuzzy Clustering. This code book is obtained using a splitting method. In this method an initial codevector is set as the average, and then split in two vectors. Then the iterative algorithm is run with those two vectors. Resulting two vectors are then split again into 2 vectors each. This give us now four vectors, and the process is then repeated until the desired number of codevectors is obtained.
The process continues until all segments have been processed and the new codebook is created. The aim of this algorithm is to minimise any distortions in the data creating a codebook which is computationally optimised, while providing a sub-optimal solution.
The performance of VQ analysis is highly dependent on the length of the voice file which is operated upon.
The above image shows the codebook generated by our Matlab System B. Each colour represents the characteristic vector (centroid) extracted from each segment pass. The codebook contains the cumulative information of the centroids found.
Phase 3: Decision Process
The decision making logic is handled by a concept known as the threshold. The threshold determines the acceptable boundaries dictating the final answer. The system will only result in a solution if two of the following criteria are met.
- The system has found the lowest Euclidean Distance between the codebook tested and the various trained codebooks.
- The distance calculated falls below a pre-defined threshold of acceptance.
Both requirements must be satisfied in order for the system to produce a result, otherwise the voice signal in test will be rendered as an “unknown speaker”.
Explanation of project implementation
Structure and Implementation of System A
The Structure of the System A can be clearly shown by the following Block diagram.
In training session, the system A performs the “findfreq” function for each of trained voice files. Thus, the system stored the standardized average characteristic value for each of voice files in separate variables, for example: a,b,c etc. In testing session, the system performs the same “findfreq” function for the new voice files and stored the characteristic value in different variable, for example: z. Then it performs the “Sum” function of the “z” variable with the other trained session variables. Then the system will observe which trained variables give the highest result after performing “Sum” function with the tested variables. The correspondent speaker of the trained variables will be the desired speaker.
To make our system user friendly, we have also implemented a simple gui. The screen-shot of our GUI is given below:
In the GUI, there are only two inputs for the users.
- numbers of training voices (integer number eg: 1, 2, 3 ...n1 )
- voice code(integer number eg: 1, 2, 3...n2)
Here
n1= The number of last speaker in the train folder
n2= The number of last speaker in the test folder
Structure and Implementation of System B
The Structure of the System B can be clearly shown by the following Block diagram.
Process description of System B
To implement the System B we have used four auxiliary functions from [3]. The functions are:
* melfb() * distelu() * train() * test()
System Training Phase
1. User enters the number of speech signals to be trained 2. User enters a voice reference number to be tested 3. Function train is executed 4. Function train uses in-built wavread function to input wav file into matlab. 5. Function mfcc is called and executed on the speech signal (s) and its respective sample rate (fs) 6. Compute Mel Spectrum 7. Compute Mel-frequency cepstrum coefficients 8. returns MFCC output 9. Function vqlbg (Vector Quantization using Linde-Buzo-Gray algorithm) is executed using MFCC as input. 10. Compute a nearest neighbour search using function disteu 11. Find Centroids and update for each speech signal 12. Codebooks are created
System Testing Phase
13. Function test is executed
14. Function test uses in-built wavread function to input wav file into matlab.
15. Function mfcc is called and executed on the speech signal (s) and its respective sample rate (fs)
16. For each trained codebook function disteu is executed. This function computes the Euclidean distance between columns of two vectors.
17. The system will identify which calculation yields the lowest value and checks this value against
a constraint threshold. If the value is lower than the threshold, the system ouputs an answer. Conversely,
if the value is greater than the threshold, the test speaker will be flagged as an unknown speaker.
Performance Evaluation
System A
For testing purpose, we recorded voice files from 6 different speakers with no more than 4 seconds long. The sample rate was 44.1 kHz.
First, we trained our system A with 5 voice files of different 6 speakers. They all said the same words “Testing 1 2 3 4 5”. Among them Shaila1 and shruti1 are female voices and Arif1, Shurud1, kokhon1 and Emir1 are mail voices.
We also recorded 13 voice files from those 6 speakers but with different words spoken. The words were
1) This is my voice
2) I study Engineering.
3) My name is (speaker’s name).
For 13 trials, the system was able to detect correct speaker 12 times. The only wrong result was for Arif4 with the Surud2. Then we listen both voice files and they sound much similar to us as well. So we can say that the correct detection percentage of our system is about 90% and the system gets confused if the speaker’s voices are much similar.
System B
System Static Variables:
Male 1 = Emir Male 2 = Mohamed Male 3 = Arif Female 1 = Shruti Female 2 = Shaila Female 3 = Nina Test file duration : approx 40 seconds Trained file duration : approx 40 seconds
M/F Test Case 1: Same Speakers
| Test Case | Trained Speaker1 | Trained Speaker2 | Test Speaker | Actual System Result | |
|---|---|---|---|---|---|
| Male Female Test 1 | Female1 | Male1 | Male1 | Male1 | |
| Male Female Test 2 | Female1 | Male1 | Female1 | Female1 |
M/F Test Case 2: Different speakers
| Test Case | Trained Speaker1 | Trained Speaker2 | Test Speaker | Actual System Result | |
|---|---|---|---|---|---|
| Test 1 | Male1 | Male2 | Male1 | Male1 | |
| Test 2 | Male1 | Male2 | Male2 | Male2 | |
| Test 3 | Female1 | Female2 | Female1 | Female1 | |
| Test 4 | Female1 | Female2 | Female1 | Female1 |
Unknown Speaker Test Case 1: Male Speaker Unknown
| Test Case | Trained Speaker1 | Trained Speaker2 | Test Speaker | Actual System Result | |
|---|---|---|---|---|---|
| Test 1 | Male2 | Male3 | Male1 | Male2 | |
| Test 2 | Male1 | Male 3 | Male2 | Unknown Speaker detected | |
| Test 3 | Male1 | Male2 | Male3 | Unknown Speaker detected |
Unknown Speaker Test Case 1: Female Speaker Unknown
| Test Case | Trained Speaker1 | Trained Speaker2 | Test Speaker | Actual System Result | |
|---|---|---|---|---|---|
| Test 1 | Female2 | Female3 | Female1 | Unknown Speaker detected | |
| Test 2 | Female1 | Female3 | Female2 | Female1 | |
| Test 3 | Female1 | Female2 | Female3 | Unknown Speaker detected |
Performance Results
Results from the analysis show that the system is currently operating at a success rate of 83% (10/12). Additional tests with different speakers is required to further increase the accuracy of the performance of System B. Additional tests have revealed that by increased the threshold value, greatly increases the likelihood that the test speaker is identified in the codebook, despite not being trained.
Tests using short length duration wave files ( i.e. 2 - 5 seconds) produce results which are higher in error than their longer duration ( > 40 seconds) counterparts. This notion directly correlates to the theory of Vector Quantization where a larger set of vectors produces a more accurate representation of a speakers voice. [6]
Limitations System A
- The system is not real time. The user has to record their voices separately and store them in the train or test folder manually. Moreover, users have to rename the voice files in the following format, speaker1,speaker2.......So it is time consuming.
- The voice files' name should be on order like speaker1,speaker2.....speakern. If any number is missing from middle, the system can not perform the operation.
- The system does not perform the speaker verification. It assumes that the speaker is known to the system. It just identifies the speaker. That’s why even though a speaker is not in the trained folder; it will still show a closest match.
- The voice files should not be longer than 4 seconds duration.
- If the voices of the the two speaker's are extremely similar, the system can not detect the desired speaker. Then the detection rate will be around 90%.
Limitations System B
- The system requires the trained voice signal to be recorded with a clear voice with minimal background or interference noise. This ensures that a more accurate codebook is created for each speaker.
- The system does not allow the user to adjust the threshold level of accuracy. The threshold determines the matching range of the trained codebook vectors and the tested codebook vectors. A fixed threshold level has been used on this system.
- Training path and Testing path needs to be defined prior to the project function execution. The code will need to be modified prior operation.
- The system does not operate voice recording in real time, therefore the use of an external wave recording application is required.
- The system requires the voice files to be renamed in a sequence of files in the format s1.wav, s2.wav, s3.wav…sn.wav. We acknowledge that this process is cumbersome since it will require the user to maintain knowledge of which filename refers to which speaker.
- The accuracy of the system is greatly determined by the length of the voice file. This is an inherent characteristic of the Vector Quantization LGB algorithm used to create codebooks. The longer the recording, the greater the resolution of the codebooks created.
- For optimum recognition performance, please ensure that the *.wav voice files are recorded with respect to the following characteristics:
- 8000Hz Sample Frequency
- 8-Bit
- Mono
- Sound file durations ranging from 20 – 30 seconds produce satisfactory results.
Encountered Problems
• We had a problem to pick the right constants in findfreq function for System A. We just tried with different constants and observed the performance of the function. After observation, we found that 100000 points for “fft” functions and 50 bins of matrices give us the most correct result.
• When speakers were uttering same words, the System B recognized the desired speaker. But when we tried with different words spoken voice files, the system B could not recognize the correct speaker. So we recorded new 4 voice files with 40 seconds duration. Our previous voice files were about 4 seconds long. We trained our system with those voice files and use the previous test voice files for testing. This time, our system was able to detect the correct speaker. The point is that the longer voice files help to make well modeled speaker’s individual codebook.
Possible Future Developments
• An ability to record the voice files in real time using the Matlab wavrecord.m function. The in-built wavrecord function in Matlab can provide a means to record the voice signal directly into the Matlab workspace. This will negate the need to use an external application to record the voices. An integration to both system enabling the cancellation of background noise. This will greatly increase the identifying accuracy of both systems.
• Improvements to the GUI for improved ease-of-use especially for System B.
• Implementaion of a combination system of System A and System B. This will increase the confidence of user to detect the correct speaker.
How to use the two systems
To get the detailed information on how to use both systems, please click on the following links.
Real Time Application
Though our systems are not real time application at present, they can be implemented in real time product with some develpments. If we are able to remove all the limitations of our sytstems, they can be implemented in some security mechanisms. Our systems can be used whether or not in combination with fingerprints, facial information, identity card or signature. This will improve the robustness of the security system. If we leave out forensic and military security systems there are a lot of commercial possibilities where to implement our two speaker recognition systems. For example, safes, quick access to doors, car protection against thefts, the protection of electronic systems like TV, video etc.
Our system A can be implemented in the internal security system where it is assumed that all the users in that place have been authenticated by other system. The reason is that our System A is easier to implement but does not perform the speaker verification task. It only does the speaker identification task. In other hand, Our System B can be implemented as external security system due to it’s ability to identify a non-trained speaker from the codebook database. However, the limitation of this system is inherent in the requirement to have the speaker voice trained for a long duration.
The addition of a timed counter would enable these systems to be used in a attendance logging application system of workplace. We think our system A will work better in this area as in that system no speaker verification is required .
As our both systems’ correct speaker detection rate is not 100%, we do not recommend our systems to be used in telephone banking system. However, if we can improve the detection rate of our System B with automatic voice recording function, it can be implemented in telephone banking system.
Conclusion
Speaker Recognition is a complicated task and it is still in active research area. We just attempted to implement a simple solution within a short period of time. We even did not have enough sample voice files to test our systems. Our systems’ detection accuracy rate is also not 100%. However, we are still happy that we have implemented two systems that are almost working. More research and work are necessary to reduce the limitations of those systems. Looking at the project as a whole, we would consider it a success.
Reference
[1] http://www.govtech.net/magazine/story.php?id=94868&story_pg=2 Speech Recognition vs. Speaker Recognition December 1, 2000 By Peter M. Hermsen
[2] T. Kinnunen, P. Fränti: "Speaker discriminative weighting method for VQ-based speaker identification", Proc. 3rd International Conference on audio-and video-based biometric person authentication (AVBPA), pp. 150-156, Halmstad, Sweden, 2001
[3] http://lcavwww.epfl.ch/~minhdo/asr_project/ , Digital Signal Processing Mini-Project:An Automatic Speaker Recognition System, By Minh N. Do
[4] http://www.otolith.com/otolith/olt/lpc.html
[5] C.-H. Lee, F.K. Soong, K.K. Paliwal: "Automatic Speech and Speaker Recognition" - advanced topics. Kluwer Academic Publishers, pp. 42-44, Norwell, Massachusetts, USA, 1996
[6] S. Sookpotharom, S. Manas "Codebook Design Algorithm for Classified Vector Quantization" Bangkok University, Pathumtani, Thailand pp. 751-753 2002
Group Documentation







![Conceptual Diagram of VQ codebook [3]](/mediawiki/images/6/6f/Example_codebook.gif)




