J Team J: Speaker Recognition
From SoftwarePractice.org
Contents |
Members
Student Names |
Student Numbers |
|---|---|
Ali Hamze |
10037531 |
Nawaz Isaji |
02040057 |
Shivam Lall |
10143194 |
Zheng Hang Liu |
10230732 |
Brief
We are required to implement a speaker recognition program using the capabilities of Matlab. This program will be used to distinguish between a male and a female voice, using solely the voiceprint of the given voice. The program is to have two different ‘modes’: training and recognition. Furthermore, the program must be text-independent, that is, it must work no matter what is spoken to it, or even how many spoken words it is given.
Theory of Speaker Recognition
The basic premise of speaker recognition is for a computer or other device to take a sample of a persons voice, and determine from this, various attributes about the speaker. Given a database, this may be the identity of a speaker, how old they are, or what sex they are. At this point it would be prudent to note that there the brief given relates to speaker recognition and not to speech recognition. Speaker recognition allows a device to determine who a speaker is, as distinct from speech recognition, which determines what a speaker is saying.
The basic technique utilized in speaker recognition, is to extract various components (such as frequency, energy and spectral power) from a persons voice and compare these components to a set of reference components, which are either hard-wired into the program, or taken from known male or female prints.
It is this basic idea that we will try and emulate using Matlab. It is also interesting to note that we as humans are designed with this capability built in. Almost 100% of the time, a human can tell the difference between a male or female without seeing the speaker. The fact that our brains can do this in a fraction of a second, without error, is only the smallest of miracles that our brains perform every second of every day.
A Brief History of Speaker Recognition
The tradition of both speaker and speech recognition dates back to the 1870s, when Alexander Graham Bell came up with the idea of a machine that would make speech visible to help people with hearing difficulties. By discovering how to convert air pressure waves (sound) into electrical impulses, he began the process of uncovering the scientific and mathematical basis of understanding speech. Despite this, suffice to say that he is more famous for the invention of the telephone.[1]
Modern development of voice identification systems began as early as the 1960s with exploration into voiceprint analysis, where characteristics of an individual’s voice were thought to be able to characterise the uniqueness of an individual much like a fingerprint.
The early systems had many flaws and research ensued to derive a more reliable method of distinguishing between two different types of speeches or voices. Voice identification research continues today under the realm of the field of digital signal processing where many advances have taken place in recent years. Also, we are seeing further implementation of speaker recognition as a biometric identifier.
Scope
Obviously, the main purpose of this project is to develop a core algorithm to recognize and identify different input voices. Additionally, to provide human access to the code, a simple graphical or textual user interface is also required. This will provide the user with the ability to easily perform the recognition process.
Other basic requirements of the program are:
- It neither have complicated functions nor require high processing power – as this program is to be run on Matlab enabled home computers, it cannot take more than a reasonable time (a few seconds or even less) to process information, display graphs or return results.
- It must be able to add/update its database entries so that the recognition results can be improved – As the program is given more information about the speaker, it should be able to update its entries to include the new data.
- It must go through several iterations in order to get more information about a users voice-print – if the gender of the speaker cannot be determined to a realiable degree for whatever reason, the program should be able to ask again for another voice sample.
- It must be reliable to some at least some degree.
Planning and Problems
Abridged Methodology
Our basic methodology at this stage of the project is tantamount to a list of things that we might try in order to get the project to work. As we continue, it likely that this methodology will change.
1. Record one word using the voices of everyone in the group (as everyone in the group is male), and get some girls to record that same word. This will give us a good control waveform. We decided at this point to simply use the word "word".
2. Perform the Matlab waveform analysis on the waveforms to see the reference points that indicate differences between male and female voices. The analysis will be:
- Frequency analysis - assuming that females have a higher voice than males in terms of frequency;
- A power spectral density analysis - here we are not positive, but we believe that females will have a lower PSD than males given the same thing said. This will need to be verified given our control waveforms; and
- Energy analysis - again, we believe that females will have a lower amount of energy in their voices, giving us another threshold value to compare against.
3. Code a program that will record a persons voice print and then perform the same frequency, power and energy analyses Compare these values to the now established reference value. If it is under (or above) a certain threshold, then the program knows whether that aspect of the voiceprint is male or female and prints the decision.
Foreseeable Problems
At the early stages of the project, before any coding was written, the group as a whole voiced their concerns over problems that they thought would need to be overcome in order to create the program. At this early stage, the problems were limited to:
- Choice of word – would the word spoken need to be kept the same in both the training and recognition modes?
- Speed that the word is spoken – would the speed that the word was spoken in some way alter the frequency of a persons voice. This seemed to stand to reason – a 33⅓-RPM record played at 45-RPM sounds decidedly higher pitched.
- Intonation of certain syllables in the word – certain people stress certain syllables when they say a particular word, and this changes depending on accent and language.
- Volume of the spoken word – it was thought by the group that the volume that the word was spoken would likely differ between speakers, both due to the differing natural volumes that people spoke, and also how close they held the microphone to their mouths. It was also thought that talking too loudly might in some way distort the words and subsequently degrade the quality of the data.
- SNR – background noise was thought to be a major problem.
- ‘Blank’ time before and after the word is spoken – Naturally, when a person speaks, they do not speak as soon as the computer is ready to accept and record their voice, nor does the computer know exactly when they have stop speaking – particularly true in the case of multiple words or a sentence. This ‘blank’ time might affect the data.
Setup of the system
The requirements needed to run the program must be simple. Almost an computer (whether running on a Apple or Microsoft platform) that has the following can run the program:
- Matlab
- A sound card
- A microphone attached
Minutes of meetings
A page which shows an overview of the groups' meeting minutes is shown following.
Code Analysis
At this point, we thought that time was passing us by, and the only way to determine if we were on the right track was to start coding at to see what happens.
Following this is a link to the breakdown of our code which provides an insight to the code.
Terminology
The following is a list of definitions of some of the terminology used in our project.
- Frequency - Frequency is defined as the number of times a particular event occurs over a given period of time. In relation to our project we are using the voice frequency, which is used for the transmission of speech. This is usually given to be in the band of the electromagnetic spectrum in the range of 300 - 3000 Hz.[2]
- Energy - Energy is the work that a force can do. In this case, the force is that of a human voice and the energy is the work that is done to change the movement of a persons vocal chords into sound. This energy can be reproduced using signal processing.
- FFT - This is any algorithm that allows for the computation of the discrete Fourier transform (DFT) of a signal. The DFT in turn is used to analyse the frequencies used in a sampled signal.
- Specgram - This Matlab algorithm computes the windowed discrete time Fourier transform (DTFT) of a signal using a sliding window.
- SNR - The signal-to-noise ratio (SNR) is a power ratio between a signal and any background noise.
- Normalisation - In relation to our project, the normalisation that we are mostly using is audio normalisation. Audio normalisation is the process of increasing (or decreasing) the amplitude or volume of the waveform.
- Power - Power is simply the energy of the given waveform, divided by the length of time that waveform is produced.
- PSD - The power spectral density (PSD) describes how the power of a signal or time series is distributed with frequency. It states at what particular frequencies the power of a waveform lies.
Standardisation
There are a few overall standards to adhere to when using the system. They are as follows:
1. Keep the microphone approx. 3 cm away from the mouth.
2. Speak loudly and clearly.
3. Keep minimum background noise for accurate results.
4. Restrain to the time limit set for the recording.
5. Use good quality microphone to avoid any interference.
6. For pre-recorded sound files use a sampling frequency of 22KHz on 16 Bit Mono.
Rationalisation of procedures
Data acquisition
The parameters used for the acquisition of sound data were:
- 16 bit
- Single stream - i.e.. mono
- 22.05kHz sampling rate
The reason that the group choose 22.05kHz is because the the propensity for the signal to alias at this level is particularly low. This means that an anti-aliasing filter would be unnecessary later in the system design as well as subsequent computation.
Normalisation
Normalisation of the signals is the first thing that occurs once the waveforms have been entered into Matlab. This is done for a number of reasons - but all are related to the volume of the signal inputted. The volume, or more precisely amplitude of the inputted signal will naturally alter due the volume that the person speaks at. If the person speaks more loudly, then the amplitude of the wave-recorded signal will be higher than if that person had spoken softly. This will in subsequently alter the energy, PSD etc. of the signals. Thus, the amplitude of the signals must be normalised to a certain level to take out the randomness of a persons speech volume. The same is true if the person holds the microphone further away from their mouth before they speak. In such a case the amplitude must be normalised upwards so that it can be properly compared with someone who held the microphone close to their mouth. Thus normalisation was employed into the project.
A unnormalised waveform:
A normalised waveform:
Clipping
Another natural human interference to the system is the length of time between someone is asked to speak, and when they actually start speaking, as well as the program knowing once they have finished speaking. If the persons particular word is too short for the calculation window, there will be blank spots in the beginning and end of the waveform which are nothing more than noise. This noise creates two problems. Firstly it lowers the SNR. Also it makes for longer, unnecessary computation later on in the program. Thus it must be eliminated. This is why clipping was employed in the program.
To show evidence of this here is a graph without clipping:
Note the beginning, before the person spoke where the noise is evident but no actual signal. Also at the end, there is again noise.
These are taken away in the clipped version of the waveform:
Frequency analysis
To show that females do indeed have higher frequency speech patterns than those of a male, we show below the frequency spectrum of a female:
And that of a male saying the same thing:
It can clearly see that the main peak of the female frequency spectrum is at a higher frequency that those of a male. This is why frequency was used as a indicator of gender.
Specgram
Specgram is used in our program as it splits the signal into overlapping sections and applies the window specified in the program.[3] Once specgram is called in MATLAB it computes the Fourier transform of each section of the waveform specified by NFFT. It highlights the different frequency components existing in the waveform with regard to their intensity over a specified period of time.[4] It represents a two-dimensional representation of a sound plotting frequency of the sound against the total time. Its amplitude is indicated by the brightness of colour contrast in the plot. Further Specgram assists in computing the power of the waveform by storing the power of the signal in a matrix defined by the user. See code analysis for a better understanding.[5]
A male specgram shows the following:
Whilst a female shows this:
PSD
The power spectral density (PSD) describes how the power of a signal or time series is distributed with frequency. It states at what particular frequencies the power of a waveform lies.[6] Mathematically, it is defined as the Fourier Transform of the auto correlation sequence of the time series. An equivalent definition of PSD is the squared modulus of the Fourier transform of the time series, scaled by a proper constant term. We calculate the PSD in our program as it highlights the intricacies in the waveform.[7]
<a</a>
<a</a>
Training iterations
As seen in the code, the program asks the user to train it. This is done by asking the user to either:
- Input a voice - either male or female
- Give the program an already voice - either male or female
This is so that the program can establish highly accurate threshold levels to compare the final voice against. Iterations were used so that the program keeps asking for further training until it is requested by the user to stop, and use the data already in the program. Usually only a few training rounds are needed to give accurate results.
Problems Encountered
1. Sound Cropping
During sound recording we found the recordings always have blank period before the speaker starts speaking. We found some background noises in those blank periods and it turns out that background noises will bring high-frequency components that will cause analysis inaccuracy. Hence it is important to crop these blank periods.
To remedy this we combined the Moving Average algorithm and the Background Noise Learning algorithm to solve the problem.
2. Power Spectrum Density Diagram and Calculations
To identify whether a vocal input is from a male or a female, one critical factor is the Power Spectrum Density in the 0-2kHz low frequency domain. In early design stage, we calculated a sum of the Power Spectrum Density. Due to different lengths of recording, the results of the PSD analysis seem random. This incurs great confusion when we try to set the threshold for the PSD. This problem continues until we figured out the average PSD is more significant and predictable.
3. Removing Noise
Even if our cropping mechanism works well for the blank periods, it does not cancel the background noises that are recorded after the speaker starts speaking. Additional frequency components still exist in our frequency analysis. An effective way to further damp the background noises is to put the records through additional filter. However, in the early stage of the program design, we noticed that inappropriate filter settings make us lose useful sound components. It took us about a week to figure out a best-possible filter setting, upon which we found a balance between the noise filtering effect and vocal loss.
Results
Overall Accuracy of the system
The overall accuracy of the system can be calculated in the following manner:
Accuracy = (1-Number of errors/Number of times experiment run)x100. This will give a percentage figure for the accuracy of the program.
Number of runs = 19
Number of errors = 4
Thus, accuracy = (1-.2105)*100 = 79.85%
Benefits of the system
There are some benefits to our system over and above the high accuracy:
- It is live - it can record voices in real time, rather than requiring pre-recorded voices;
- It is quick to process the data and gives an output immediately; and
- It is highly robust.
Shortcomings of the system
- There are some shortcomings or limitations of the program:
- The length of the window of time is limited. If one records for too long, then it uses too much processing power;
- If the amplitude of the sound is overly high (i.e if one shouts into the microphone) then the program gives a divide by zero error; and
- It is highly CPU intensive.
References
- ↑ Kim J, Using Voice EZ, 25 April 2006, <http://me.kaist.ac.kr/upload/course/MAE683/MAE683_UsingVoiceEZ.pdf>
- ↑ Baken, R. J. (1987). Clinical Measurement of Speech and Voice. London: Taylor and Francis Ltd.
- ↑ Wikipedia, The Free Encyclopedia, Spectrogram, Date accessed - 20 October 2006, http://en.wikipedia.org/wiki/Spectrogram
- ↑ University of Indiana, Working Papers: Glossary, Date accessed - 17 October 2006, http://www.indiana.edu/~savail/workingpapers/glossary.html
- ↑ Mathworks, Specgram, Date accessed - 26 October 2006, http://www.mathworks.com/access/helpdesk_r13/help/toolbox/signal/specgram.html
- ↑ Wikipedia, The Free Encyclopedia, Power Spectral Density, Date accessed - 15 October 2006, http://en.wikipedia.org/wiki/Power_spectral_density
- ↑ Kay SM, Marple SL Jr. (1981) Spectrum Analysis - A Modern Perspective, http://www.cbi.dongnocchi.it/glossary/PowerSpectralDensity.html











