Voice Isolation: When Noise Reduction Meets Deep Learning | EE Times: "Voice Isolation: When Noise Reduction Meets Deep Learning"
A new approach to old problems -- with the help of deep neural networks -- may make background noise a thing of the past.
Sometimes it's easy to forget that a smartphone is also a telephone. With all the fantastic functions and features, somehow people have grown accustomed to the occasional dropped syllable and garbled sounds that make us repeat ourselves time and again.A recent article in Scientific American suggests that the fault is with the service providers. It's true that bandwidth is definitely a factor, but even when there is a relatively good connection, throw in a noisy environment like a coffee shop or morning traffic and communication starts to break down. A new approach to old problems -- with the help of deep neural networks -- may make background noise a thing of the past.
The voice band -- good enough?
Since the invention of the telegraph almost two centuries ago, there has been nearly exponential improvement in bandwidth, mobility, speed, and reliability. That said, a key aspect of voice telecommunications has lagged behind: the quality and intelligibility of transmitted voice. Very early on, the standard for human voice transmission was set as the "voice band" located between 300 Hz and 3.3 kHz (to put this in perspective, the natural frequency span of human voice during speech ranges from about 50 Hz to nearly 10 kHz).
Since the invention of the telegraph almost two centuries ago, there has been nearly exponential improvement in bandwidth, mobility, speed, and reliability. That said, a key aspect of voice telecommunications has lagged behind: the quality and intelligibility of transmitted voice. Very early on, the standard for human voice transmission was set as the "voice band" located between 300 Hz and 3.3 kHz (to put this in perspective, the natural frequency span of human voice during speech ranges from about 50 Hz to nearly 10 kHz).
Apparently, this was satisfactory for landline usage in quiet settings, and generations of phone users came to expect poor call quality. When these standards were carried over for cellphone audio quality, and with the added woes of spotty network coverage and connection dropouts, cellphone users' expectations for call quality fell even lower.
Extending the frequency (for better or for worse)
Now that there are about about as many cellphone subscriptions as there are people on earth, one would think that there really shouldn't be any more technological excuses for poor voice quality. New standards branded as HD Voice and VoLTE promise the eventual extension of voice transmission frequency range up to 7 kHz. An IEEE Spectrum article from September 2014 gave an instructive, in-depth analysis of the causes of lousy voice quality, and placed hope in the deployment of these new technologies. Their implementation requires new hardware and new networks, which will be overcome in time, but broadening the voice band does nothing to solve the other major challenge preventing great sounding calls -- in fact, HD Voice and its relatives may actually make the problem worse!
Now that there are about about as many cellphone subscriptions as there are people on earth, one would think that there really shouldn't be any more technological excuses for poor voice quality. New standards branded as HD Voice and VoLTE promise the eventual extension of voice transmission frequency range up to 7 kHz. An IEEE Spectrum article from September 2014 gave an instructive, in-depth analysis of the causes of lousy voice quality, and placed hope in the deployment of these new technologies. Their implementation requires new hardware and new networks, which will be overcome in time, but broadening the voice band does nothing to solve the other major challenge preventing great sounding calls -- in fact, HD Voice and its relatives may actually make the problem worse!
Noise and the boundless crusade for its cancellation
Nearly half of all phone users today employ their mobile phones as their primary voice connection (a number sure to grow). Mobile phones, by design, are used in many different environments: in planes, trains, and automobiles; at sporting events, offices, factories, and shopping centers; on playgrounds and (yeah, thatguy) in public restrooms.
Nearly half of all phone users today employ their mobile phones as their primary voice connection (a number sure to grow). Mobile phones, by design, are used in many different environments: in planes, trains, and automobiles; at sporting events, offices, factories, and shopping centers; on playgrounds and (yeah, thatguy) in public restrooms.
Just think of the noises you might encounter walking through an urban downtown, near a construction site, or in an airport lounge. While the narrow range of the current voice band standard impairs the quality of the voice that is transmitted, it also automatically filters out any noise that may be present in higher frequency bands. By doubling the frequency span of the voice band, HD Voice and relatives increase the environmental noise power that is transmitted and, ironically, can make voice quality and intelligibility worse in everyday use cases.
The noise challenges facing cellphone users are a far cry from the relatively stable noise environments that exist around landlines and, by-and-large, noise reduction technologies have not caught up. From commonly-used phase cancellation, to techniques using multiple microphones or statistical properties and mathematical assumptions about environmental noise, each attempt to isolate and cancel out noise has its deficiencies. Either some of the noise gets through or the voice suffers from audio artifacts.
Voice isolation instead of noise cancellation
The engineering team at Cypher took a different tack when developing its noise reduction technology. Instead of formulating the problem as one of capturing a signal and then eliminating the noise, they considered its mathematical dual: how to characterize and isolate the speech components of a noisy signal. Rather than trying to defend against all of the possible noise types -- an impossible task -- Cypher concentrates on elucidating and extracting common elements of human speech.
The engineering team at Cypher took a different tack when developing its noise reduction technology. Instead of formulating the problem as one of capturing a signal and then eliminating the noise, they considered its mathematical dual: how to characterize and isolate the speech components of a noisy signal. Rather than trying to defend against all of the possible noise types -- an impossible task -- Cypher concentrates on elucidating and extracting common elements of human speech.
At the core of this approach is a sophisticated deep learning methodology that identifies mathematical descriptors, which can be used in training neural networks for audio pattern recognition. The deep learning stage takes place offline using a large database of human speech. The goal of the learning is to identify and separate human speech from any environmental noise.
The result is a deep neural network that can identify in real-time precisely when and where in an audio signal the human voice is present. Despite its broad and robust pattern recognition capabilities, this deep neural network is fast enough and compact enough to run in software on the CEVA-TeakLite-4 DSP. The neural network also guides other algorithmic components of Cypher's patented technology as they isolate the person speaking from all other sources of noise -- even other nearby human speakers. Once the desired voice has been extracted, post-processing modules enhance the voice signal and remove artifacts created in the background noise elimination process. The final output has a balanced, full sound as close to the original speaker as possible. To experience the clarity, visit Cypher's website for demonstrations of this cutting edge technology in a variety of different environments.
Eran Belaish serves as CEVA's Marketing Manager of Audio and Voice Product Line, overseeing Audio and Voice processing, Android interfaces, wearable devices, and Wireless Audio. Prior to this position, Eran served as CEVA's Senior Compiler Group Leader responsible for managing all compiler-related research and development, and before that he held several engineering and management positions at CEVA since 2003. Eran holds a B.Sc. in Electrical Engineering and Computer Science from Tel-Aviv University.
Dr. Erik Sherwood serves as Cypher's Chief Scientist. Erik is an applied mathematician with expertise in dynamical systems, computational neuroscience, scientific computing, algorithm design, statistics, and machine learning. Prior to joining Cypher, he worked in academia and held teaching, research, and faculty positions in the mathematics departments of Cornell University, Boston University, and, most recently, the University of Utah. Erik studied at Princeton, the University of Bremen (Germany), Cambridge University, and Cornell University. He earned an AB with honors in mathematics and certificates in applied mathematics, computer science, and German from Princeton University, and MS and PhD degrees in applied mathematics from Cornell University.
'via Blog this'
No comments:
Post a Comment