Published on Jul 03, 2020
Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals, while generating a smooth transition between them.
Speech morphing is analogous to image morphing. In image morphing the in-between images all show one face smoothly changing its shape and texture until it turns into the target face. It is this feature that a speech morph should possess. One speech signal should smoothly change into another, keeping the shared characteristics of the starting and ending signals but smoothly changing the other properties.
The major properties of concern as far as a speech signal is concerned are its pitch and envelope information. These two reside in a convolved form in a speech signal. Hence some efficient method for extracting each of these is necessary. We have adopted an uncomplicated approach namely cepstral analysis to do the same. Pitch and formant information in each signal is extracted using the cepstral approach. Necessary processing to obtain the morphed speech signal include methods like Cross fading of envelope information, Dynamic Time Warping to match the major signal features (pitch) and Signal Re-estimation to convert the morphed speech signal back into the acoustic waveform.
INTROSPECTION OF THE MORPHING PROCESS
Speech morphing can be achieved by transforming the signal's representation from the acoustic waveform obtained by sampling of the analog signal, with which many people are familiar with, to another representation. To prepare the signal for the transformation, it is split into a number of 'frames' - sections of the waveform. The transformation is then applied to each frame of the signal. This provides another way of viewing the signal information. The new representation (said to be in the frequency domain) describes the average energy present at each frequency band.
Further analysis enables two pieces of information to be obtained: pitch information and the overall envelope of the sound. A key element in the morphing is the manipulation of the pitch information. If two signals with different pitches were simply cross-faded it is highly likely that two separate sounds will be heard. This occurs because the signal will have two distinct pitches causing the auditory system to perceive two different objects. A successful morph must exhibit a smoothly changing pitch throuhout.
The pitch information of each sound is compared to provide the best match between the two signals' pitches. To do this match, the signals are stretched and compressed so that important sections of each signal match in time. The interpolation of the two sounds can then be performed which creates the intermediate sounds in the morph. The final stage is then to convert the frames back into a normal waveform
Matching and Warping: Background theory
Both signals will have a number of 'time-varying properties'. To create an effective morph, it is necessary to match one or more of these properties of each signal to those of the other signal in some way. The property of concern is the pitch of the signal - although other properties such as the amplitude could be used - and will have a number of features. It is almost certain that matching features do not occur at exactly the same point in each signal. Therefore, the feature must be moved to some point in between the position in the first sound and the second sound. In other words, to smoothly morph the pitch information, the pitch present in each signals needs to be matched and then the amplitude at each frequency cross-faded. To perform the pitch matching, a pitch contour for the entire signal is required. This is obtained by using the pitch peak location in each cepstral pitch slice.
Consider the simple case of two signals, each with two features occurring in different positions as shown in the figure below.
The match path shows the amount of movement (or warping) required in order aligning corresponding features in time. Such a match path is obtained by Dynamic Time Warping (DTW).
Dynamic Time Warping
Speaker recognition and speech recognition are two important applications of speech processing. These applications are essentially pattern recognition problems, which is a large field in itself. Some Automatic Speech Recognition (ASR) systems employ time normalization. This is the process by which time-varying features within the words are brought into line. The current method is time-warping in which the time axis of the unknown word is non-uniformly distorted to match its features to those of the pattern word. The degree of discrepancy between the unknown word and the pattern – the amount of warping required to match the two words - can be used directly as a distance measure.
Such time-warping algorithm is usually implemented by dynamic programming and is known as Dynamic Time Warping. Dynamic Time Warping (DTW) is used to find the best match between the features of the two sounds - in this case, their pitch. To create a successful morph, major features, which occur at generally the same time in each signal, ought to remain fixed and intermediate features should be moved or interpolated. DTW enables a match path to be created. This shows how each element in one signal corresponds to each element in the second signal.
In order to understand DTW, two concepts need to be dealt with:
Features: The information in each signal has to be represented in some manner.
Distances: some form of metric has to be used in order to obtain a match path. There are two types:
1. Local: a computational difference between a feature of one signal and a feature of the other.
2. Global: the overall computational difference between an entire signal and another signal of possibly different length.
Feature vectors are the means by which the signal is represented and are created at regular intervals throughout the signal. In this use of DTW, a path between two pitch contours is required. Therefore, each feature vector will be a single value. In other uses of DTW, however, such feature vectors could be large arrays of values. Since the feature vectors could possibly have multiple elements, a means of calculating the local distance is required. The distance measure between two feature vectors is calculated using the Euclidean distance metric
More Seminar Topics:
User Identification Through Keystroke Biometrics,
Virtual Retinal Display,
Wideband Sigma Delta PLL Modulator,
Wireless Charging Of Mobile Phones Using Microwaves,
Wireless LAN Security,
Adaptive Blind Noise Suppression in some Speech Processing Applications,
An Efficient Algorithm for Iris Pattern,
Analog-Digital Hybrid Modulation,
Artificial Intelligence Substation Control,
Bluetooth Based Smart Sensor Networks,
Carbon Nanotube Flow Sensors