SoftwarePractice.org: Home | Courseware | Wiki | Archive

Team E Speaker recognition

From SoftwarePractice.org

Contents

TEAM E

Matthew long

Jia rong hunag

Ka Ho Lo

Roger cliffford

Our group has decided to undertake the project of "Speaker Recognition". We chose this project because we thought it would be fun and most related to real-time signal processing systems. We have great enthusiasm for this project, and hope to achieve a rewarding outcome.

Introduction and Brief History:

Identity verification in the 21 century is largely accepted as a normal part of everyday existence. Hand written signature based verification has long been a common method within modern society, and more than ever personal identification (PIN) codes are being used for access to homes, vehicles, ATM service points and the like. The rapid rise of technology has enabled the pursuit of even more complex biometric identity verification methods. Today, the fields of science and technology pursue identification methods using individual biological characteristics such as DNA analysis, finger prints, and voice analysis.

The overwhelming advantage of biometric identity verification would seem obvious in that while you may forget a PIN number over time, it would be rather difficult to forget ‘yourself’. With the extensive utilisation and evolution of land line and wireless telephony, there has arisen the need to be able to authenticate a speaker on the far end of the line based upon voice characteristics alone. When as humans we recognise the voice of someone we know, we are able to match a speaker's name to his/her voice characteristics. We do it automatically, all the time. Recent developments within the fields of Digital Signal Processing have made speaker verification an increasingly effective and accurate identification methodology.

Speaker verification is used to determine (using voice alone) whether a given person is who they claim to be. Given a characteristic human voice signal (voice: pressure -> time) and an unknown speaker (speaker), the objective is to determine if 'voice' was spoken by 'speaker'. The fundamental assumption here is that there are quantifiable characteristics in each person's voice that are unique among individuals and therefore can be measured.

The speaker recognition task.
Enlarge
The speaker recognition task.

LAWRENCE KERSTA at Bell Labs made the first major step from speaker verification by humans, toward speaker verification by computers in the early 1960s. During this time he introduced the term 'voiceprint' for a spectrogram which was generated by a complicated electro-mechanica device. The voiceprint was matched with a verification algorithm that was based on visual comparison.

A 3D graphic of what speech looks like.
Enlarge
A 3D graphic of what speech looks like.

Terminology:

Spectrogram

A spectrogram displays frequency as a function of time, in a three dimensional context. It shows the true nature and relationships between frequency, time and amplitude.

By  visual  inspection  of  the  plots, compared  to  the female speaker, the male speaker produces more energy at the  lower frequencies and less at the higher frequencies.
Enlarge
By visual inspection of the plots, compared to the female speaker, the male speaker produces more energy at the lower frequencies and less at the higher frequencies.

A normal spectrum shows only the various frequency components, different amplitudes and nothing about when the signal occurred in time (i.e. zero temporal resolution). Using a spectrogram in this project, we are able to analyse both female and male speech by looking at several spectral features. These may include (not sure yet, just a first stage proposal):

  • Time occurrence of speech.
  • Energy level concentration for each gender
  • At what frequency range most of the energy of the signals are concentrated
  • Energy level spread - whether a phrase spoken by one gender has a wider spread of energy in the frequency range than the other.

Fast Fourier Transform

Both the DFT and the FFT algorithm can be applied to a complex signal to produce the frequency spectrum that is a series of magnitude (amplitude) vs. frequency data points.

The frequency domain gives a better perspective into the properties of complex time-domain signals; hence the FFT was used to examine the sound signals in the frequency domain so that the different characteristics in between the male and female can be determined. The FFT method was chosen to be used for this project instead of the DFT method because the FFT is a more efficient way of computing the DFT to get the same result as the method of direct computation. Furthermore, it should also be highlighted that the DTFT method was not chosen as the spectrum is not continuous.

vector quantization (VQ)

Vector quantization (VQ) is a lossy data compression method based on the principle of block coding. It is a fixed-to-fixed length algorithm. In the earlier days, the design of a vector quantizer (VQ) is considered to be a challenging problem due to the need for multi-dimensional integration.

1 dimentional VQ

VQ codebook s consisting of a small number of representative feature vectors are used as an efficient means of characterizing speaker-specific features. A speaker-specific codebook is generated by clustering the training feature vectors of each speaker. In the recognition stage, an input utterance is vector-quantized using the codebook of each reference speaker and the VQ distortion accumulated over the entire input utterance is used to make the recognition decision.

Our Group Task:

The project asks us to implement a speaker recognition program in Matlab, which can differentiate the true origin of an input sound waveform. In another word, the function should be able to determine whether the sound is from a male or a female, and output the corresponding string of the sound’s origin, and echo the string: male or female.

The aim of this report is to describe in detail how a complete working version of the MATLAB function – ‘voice_analysis.m’ was developed. With a high level of accuracy, appropriate signal processing theories were taken into account during the design and development stages of the project and they were correctly applied to the design of the system. More importantly, throughout this assignment, we wish to utilise our knowledge in such manner, in return to obtain a high level of learning outcome. Below is a block diagram showing an overview of the voice analysis system.

Enlarge

inside the Matlab process blackbox we got 3 filters.

Re-design of the system
Enlarge
Re-design of the system

Data Collection

In the voices collection process, the only equipment we need is a notebook computer with an external microphone attached. all voices are recorded digitally into the computer via the microphone at 16bit, Mono, 22.05kHz sample rate, the reason we choose 22.05kHz is because the aliasing with this frequency is particularly low, so no further noise cancellation is require in the later stage.

The criteria for choosing the appropriate sounds are as follows:

• No clipping

• At reasonable level of amplitude (above 0.2)

• With minimum noise recorded into the voice

• Sample rate in 22050 Hz and resolution of 16 bits, Mono

Although the recording process went through quiet smoothly, however it was not an easy task in order to obtain a good quality voice, with high level of amplitudes and minimum noise. In particular with the problem of changing the quality settings with every new person contributing to the recording process, it certainly increased the difficulty of the task. Fortunately, we successfully collected 17 voices, 10 males and 7 females respectively.

Analysis of Spectrogram

There are a large number of wave files that are readily available on our CD for both the male and female voices. In order to carry out an extended spectral analysis on the voices of the two species, a large pool of files has to be tested. Below, we present the spectrogram and the corresponding MATLAB codes from two of the best quality voices file that we collected, one is from a male (file name: male1.wav), the other one is from a female (file name: female1.wav). The reason that we present these two sound files in the report is due to the quality of the wave file (e.g. no clipping, reasonable amplitude) and the distinct features that are presented in their spectrograms.

[x,fs]=wavread('sound/filename.wav');
specgram(x,1024,fs);

Image:Malespecgram.jpg

Male's voice is focus on the low frequecy.And not often reach to a high frequecy.

Image:femalespecgram.jpg

Female's voice can reach in a high frequecy range often as female had a high-pitched voice.

Analysis of Fast Fourier Transform (FFT)

From the FFT plots of the male and female voices we were able to identify a couple of characteristics between the two species. Firstly, we noticed that in most of the plots for the male voice it had a wider spectrum on frequencies; this can be seen in the following plot where the male voice file [man3.wav] had a frequency range be 0 to 12.5 kHz. Whereas, for most of the female they had a much narrower frequency range and are concentrated on the high frequency end, for instance the example below shows that the female voice frequency range is between 0 to around 12.5 kHz. This characteristic of the width of the frequency ranges will help us in designing the filters (stop and pass frequencies) for the voice analysis system. The other characteristic that our group noticed in a number of the plots was that in most cases the amplitude of the male's voice was significantly higher than the amplitude of the female's voice. However, our group did not pursue with this ideal as we felt that there was not enough cases to support this theory.

[x,fs]=wavread('sound/filename.wav');
N=13230; 	% Length of window
xx=x(13230:26459); 	% section of the signal to be examined
XX=fft(xx)/length(xx); 	% takes the FFT of the section of signal
f=[-fs/2:fs/N:(N/2-1)*fs/N]; 	% frequency axis
figure(1);
plot(f,fftshift(abs(XX))); 	% plots amplitude against frequency
xlabel('Frequency (Hz)'); ylabel('Amplitude’'); title('Frequency Spectrum');
figure(2);
plot(f,20*log10(fftshift(abs(XX)))); 	% plots amplitude in decibels against frequency
xlabel('Frequency (Hz)'); ylabel('Amplitude (Decibels)'); title('Frequency Spectrum');

Image:Comparemalefemale.JPGImage:Compare2malefemale.JPG

Design of Filters

The purpose of implementing filters in the voices analysis system is to filter part of the energy of the signal and divided by the total amount of energy in the signal to find an energy ratio. By finding a range of energy ratios for a particular filter we will be able to establish the energy ratio threshold in order to determine whether the signal came from a male or a female.

For this system we designed three filters by using FDATool in MatLab, there are a band-pass filter, low-pass filter and a high-pass filter to cover different frequency ranges and to gain greater accuracy in our results. All the filters designed were IIR filters because they require less computation to implement in comparison to FIR filters.

Band-pass filter

Our group first decided to design a band-pass filter with the pass bands between the frequency ranges of 3 to 7 kHz. Our reason for this design is that from analysing the spectrogram and the FFT plots we found that the collection of our wave file had greater intensity between those frequency ranges. The parameters that were chosen for the band-pass filter includes sampling frequency of 22.05 kHz, transition bandwidth of 200 Hz, pass band ripple of 0.1 dB and stop band attenuation of 50 dB.

We chose to use an elliptic filter because it had the lowest order of 18 meaning less computation time in comparison to the Butterworth response (202 order), Chebyshev Type 1 response (40 order) and Chebyshev Type 2 response (44 order).

Due to the narrow transition band (200 Hz) the Butterworth response had the highest order (this is because the Butterworth response has the least sharpest roll-off out of all the response tested). Another reason for choosing elliptic response was because it gave the steepest roll off and is the closet filter to being an ideal filter. We also tried using different parameters for the pass band ripple and the stop band attenuation; from our experimentation we found that the best results (distinct differences in the energy ratio) were when the pass band ripple was 0.1 and the stop band attenuation between 40 to 60 dB.

Image:bandpass.jpg

Low-pass filter

In order to increase the accuracy of the system our group decided to implement multiple filters and the Chebyshev Type 1 filter. We decided to use the Chebyshev Type 1 filter because we wanted to see if a less steep roll off filter would give more accurate results.

To ensure a more flat roll off our group had designed the filter with a 6 kHz transition bandwidth (approx. 15 times larger than the band-pass filter) between the frequencies 4 -10 kHz (transition band). We had chosen a stop band attenuation of 50 dB and not above this figure because when we had experimented with 60 & 70 dB there were no distinct differences in the energy ratio for male and female. Furthermore, we have chosen the pass band frequency between 0 to 4 kHz because most of the male's energy is below 4 kHz; hence the energy ratio for the male would be greater than the female's energy ratio.

Image:lowpass.jpg

High-pass filter

Our group realised that in order to implement multiple filters into the voice analysis system we needed to have more than two filters, as there are situations whereby both filters may give contradicting results hence a third filter is required to verify results.

We had chosen to use the high-pass because from analysing the spectrograms we found that the male had lower energy concentration above 5 kHz and the fact that the other two filters have not filtered the energy concentrated at higher frequency ranges. Hence we made the assumption that if we were to filter the energy concentration of just above 5 kHz. however overall energy ratio of the female should be significantly less than the male. Like the previous filter we had used the Chebyshev Type 1 response with the only difference in the transition band which is between 5 to 10 kHz and the stop band attenuation being only 50 dB.

Image:highpass.jpg

Determining the Energy Ratio Threshold for each Filter

In order to test the reliability of the filter and to determine the energy threshold of each of the three filters, we tested 17 wave files, 10 males and 7 females for the design.

The following code is used to find the energy ratio of a signal:

load lowpass.mat; 
y=filter(C,D,x); % Filters the signal x
lowpass_energy_filter=sum(y.^2);  % calculates the energy of the filtered signal 
energy_total=sum(x.^2);           % calculates the energy of the signal
lowpass_energy_ratio=lowpass_energy_filter/energy_total   % calculates the energy ratio of the signal

and do the same as the designed filters of High-pass and Band-pass.

Here is the table of the Energy ratio amoung three filters:

Image:Tablev2.JPG

From an analytical analysis of the results obtained. Firstly, for the band-pass filter we found that the largest energy ratio for the male voice was 0.0035, hence we decided that signals passed through this filter with an energy ratio that is less than or equal to 0.0035 is a male (which mean greater than this value would be female).

For the low-pass filter we found that the maximum energy ratio for the human voice was 0.9964, hence we decided that signals passed through this filter with an energy ratio that is greater than or equal to 0.9964 is male (less than this value would be female).

Finally, for the high-pass filer we found the smallest ratio of the male voice was 0.0000127. Therefore, we decided that the threshold for this filter would any energy ratio that is smaller than 0.0000127 is female voice (greater than that would be male).

Note that we had an outlier of man2.wav that had an distinguish voice.

Code of voice Analysis

The voice analysis system is constructred by the following Matlab codes. A counter has been added and implemented in order to determine whether the signal’s source is a male or a female. The counter is incremented each time the male's threshold is satisfied of each filter, hence the maximum value for the counter is three (all three filter’s energy ratio threshold is satisfied) and the minimum value of the counter is zero (all three filter’s energy ratio is not satisfied, hence is a female). If two out of three of the filter is satisfied then the system determine that the signal is from a male, then the output would give the result as male, else the system would give the output result as female.

function origin = voice_analysis(x,fs)

% The sampling frequency must be 22.05kHz else error message is displayed

if fs ~= 22050;
    error('The sampling frequency must be 22.05 kHz please re-enter signal');
end

counter=0;                % This counter is used to record each time the if statement is true
energy_total=sum(x.^2);   % Calculates the total energy of the signal

% The first filter the signal is passed through is a band-pass filter

load bandpass2.mat;       % Loads the coefficients of band-pass filter
y=filter(A,B,x);          % Filters the vector x of the signal
bandpass_energy_filter=sum(y.^2);   % Calculates the energy of filter output
bandpass_energy_ratio=bandpass_energy_filter/energy_total % Calculates the energy ratio
if bandpass_energy_ratio <= 0.0035  % This is for a male voice
    counter=counter+1;    % counter is incremented
end;

% The signal will also be passed through a low-pass filter, following
% similar steps of the band-pass filter
load lowpass.mat; % Loads the coefficients of the low-pass filter
y=filter(C,D,x);
lowpass_energy_filter=sum(y.^2);
lowpass_energy_ratio=lowpass_energy_filter/energy_total
if lowpass_energy_ratio >=0.9964  % ratio large or equal to the number is detemine as male voice
    counter=counter+1;
end;

% Signal is passed through a high-pass filter using similar previous steps 
load highpass.mat; % Loads the coefficients of the high-pass filter
y=filter(Num,Den,x);
highpass_energy_filter=sum(y.^2);
highpass_energy_ratio=highpass_energy_filter/energy_total
if highpass_energy_ratio <=0.0000127
    counter=counter+1;
end;
counter
if counter >= 2 % more than or equal to 2 filters, the system thinks it is a male voice
  origin = ('male');% return the string male
else
  origin = ('female'); % return the string female
end; 

Test Table

Image:Newtable1.JPG

The files that have been shaded in the above table means that the actual output was incorrect (that is not the same as the expected output). From the table it can be seen that only two out of seventeen data signals actual outputs were incorrect.

Overall Accuracy of the system

Hence, the overall accuracy of the system can be calculated as follows:

Percentage of Accuracy = 1 - (Number of Errors/Total data tested) = 1 - (2/17) = 0.8824 = 88.24%

From the above calculation it illustrates that the overall accuracy of the system is 88.24%.

Discussion of the system limitations

1. The system will only tell us either the input singal is male or a female, it can't fully recognise the speaker

2. The energy ratio threshold is not 100% accuracy, some sample does not match with the threshold.

3. we don't have enough sample sources file (currently 17 files) to fully test the system

4. There are chance to record an distinguish voice such as male can produce an female voice.

5. The system is not in real time, speaker have to record digitally in a wav.file in order to be analysis

6. The ouput result is not 100% accurate, our accuracy of the system is slightly less than 90%

7. We only approach the task with a single method: energy ratio. as far as i know, there are other methods like Power density function and VQ

Future Innovation

Although many recent advances and successes in speaker recognition have been achieved, there are still many problems for which good solutions remain to be found. Most of these problems arise from variability, including speaker-generated variability and variability in channel and recording conditions. It is very important to investigate feature parameters that are stable over time, insensitive to the variation of speaking manner, including the speaking rate and level, and robust against variations in voice quality due to causes such as voice disguise or colds. It is also important to develop a method to cope with the problem of distortion due to telephone sets and channels, and background and channel noises.

From the human-interface point of view, it is important to consider how the users should be prompted, and how recognition errors should be handled. Studies on ways to automatically extract the speech periods of each person separately from a dialogue involving more than two people have recently appeared as an extension of speaker recognition technology.

Conclusion

We successfully implement a simple text-independent speaker recognition system that can determine the input signal as male or female sound, however this system is not perfect, it could not fully identify the speaker, and due to our poor amounts of samples, we only obtain a accuracy rate slightly less than 90%, and more importantly the system is not a real time application, but eventually the system does what the project asked to do, we quite happy as what it is now, more research and work need to be done in order to reduce the limitation. Overall, we enjoy the whole process of the system implementation, we all gains the knowledge of the speaking recognition system operation and joyful experiences of team work.

References:

[1]http://cslu.cse.ogi.edu/HLTsurvey/ch1node9.html Sadaoki Furui NTT Human Interface Laboratories, Tokyo, Japan

[2]http://en.wikipedia.org/wiki/Speaker_identification Elisabeth Zetterholm, Voice Imitation. A Phonetic Study of Perceptual Illusions and Acoustic Success. Phd thesis, Lund University. (2003)

[3]http://www.bergdata.com/downloads/Introduction%20to%20Speaker%20Recognition%20Technology.pdf Gerik Alexander von Graevenitz Bergdata Biometrics GmbH, Bonn, Germany

[4]http://www.nist.gov/speech/tests/spk/index.htm Speech Communication, Volume 31, Issue 2-3. June 2000, Elsevier Science B.V (North-Holland)

[5]http://www.idealibray.com/links/toc/dspr/10/1/0 Digital Signal Processing, Volume 10, Numbers 1-3. January/April/July 2000, Academic Press, available online

[6]http://www.ll.mit.edu/IST/pubs/aaas00-dar-pres.pdf RLA2C conference in Avignon, France, in April of 1998.

[7] G. Papcun, “Commensurability among biometric systems: How to know when three apples probably equals seven oranges,” in Proc. Biometric Consortium, 9th Meeting, J. Campbell, Ed., Crystal City, VA, Apr. 8–9, 1997. (See also the Biometric Consortium’s web site. Available: http://www.biometrics.org:8080/).

[8] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” J. Acoust. Soc. Amer., vol. 55, no. 6, pp. 1304–1312, 1974.

[9] , “Automatic recognition of speakers from their voices,” Proc. IEEE, vol. 64, pp. 460–475, 1976.

[10] J. Attili, M. Savic, and J. Campbell, “A TMS32020-based real time, text-independent, automatic speaker verification system,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, New York, 1988, pp. 599–602.

Group Documentation

Group Meeting Log