Blog Archive

Thursday, June 28, 2018

When Machines Know How You're Feeling: The Rise of Affective Computing

source:

Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. While the origins of the field may be traced as far back as to early philosophical inquiries into emotion, on affective computing. A motivation for the research is the ability to simulate empathy. The machine should interpret the emotional state of humans and adapt its behavior to them, giving an appropriate response to those emotions.

For More Information Download Free Brochure@ https://tinyurl.com/ya4cy6bb

The key players in the market were identified through secondary research, such as Association for the Advancement of Affective Computing (AAAC), Deep Learning Summit, and International Neural Network Society, and their market shares in respective regions were determined through primary & secondary researches. This entire procedure includes the study of annual and financial reports of top market players and extensive interviews for key insights from the industry leaders. All percentage shares, splits, and breakdowns were determined using secondary sources and verified through primary sources. 

The prominent players in the affective computing ecosystem are Google Inc. (California, U.S.), IBM Corporation (New York, U.S.), Microsoft Corporation (Washington DC, U.S.), Saffron Technology (North Carolina, U.S.), Softkinteic System S.A. (Brussels, Belgium), Affectiva (Waltham, U.S.), Elliptic Labs (Oslo, Norway), Eyesight Technologies Ltd. (Israel), Pyreos Ltd. (Edinburgh, U.K.), Cognitec Systems GmbH (Germany), Beyond Verbal Communication Ltd. (Tel Aviv, Israel), Numenta (California, U.S.), GestureTek (Canada), and SightCorp (Amsterdam, the Netherlands).

Tuesday, June 26, 2018

A Beginner’s Guide on Sentiment Analysis with RNN

https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e

SSD vs HDD


Samsung 850 PRO - 2TB - 2.5-Inch SATA III Internal SSD (MZ-7KE2T0BW)

Ultimate Sequential Read/Write Performance : Up to 550MB/s and 520MB/s Respectively,


WD Gold 8TB Datacenter Hard Disk Drive - 7200 RPM Class SATA 6 Gb/s 128MB Cache 3.5 Inch - WD8002FRYZ

Ultimate Sequential Read/Write Performance : less than 50MB/s
linux command: iostat -d -k 1 10

SSD vs HDD

Most people now buy laptops for their computing needs and have to make the decision between getting either a Solid State Drive (SSD) or Hard Disk Drive (HDD) as the storage component.  So which of the two is the better choice, an SSD or HDD? There’s no straight-forward answer to this question; each buyer has different needs and you have to evaluate the decision based on those needs, your preferences, and of course budget. Even though the price of SSDs has been falling, the price per gigabyte advantage is still strongly with HDDs. Yet, if performance and fast bootup is your primary consideration and money is secondary, then SSD is the way to go. For the remainder of this article, we will make a comparison of SSD and HDD storage and go over the good, the bad, and the ugly of both.
What is an SSD?
We’ll make no assumptions here and keep this article on a level that anyone can understand. You might be shopping for a computer and simply wondering what the heck SSD actually means? To begin, SSD stands for Solid State Drive. You’re probably familiar with USB memory sticks - SSD can be thought of as an oversized and more sophisticated version of the humble USB memory stick. Like a memory stick, there are no moving parts to an SSD. Rather, information is stored in microchips.  Conversely, a hard disk drive uses a mechanical arm with a read/write head to move around and read information from the right location on a storage platter. This difference is what makes SSD so much faster. As an analogy, what’s quicker? Having to walk across the room to retrieve a book to get information or simply magically having that book open in front of you when you need it? That’s how an HDD compares to an SSD; it simply requires more physical labor (mechanical movement) to get information.
A typical SSD uses what is called NAND-based flash memory. This is a non-volatile type of memory. What does non-volatile mean you ask? The simple answer is that you can turn off the disk and it won’t “forget” what was stored on it. This is of course an essential characteristic of any type of permanent memory. During the early days of SSD, rumors floated around saying stored data would wear off and be lost after only a few years.  Regardless, that rumor is certainly not true with today’s technology, as you can read and write to an SSD all day long and the data storage integrity will be maintained for well over 200 years. In other words, the data storage life of an SSD can outlive you!
An SSD does not have a mechanical arm to read and write data, it instead relies on an embedded processor (or “brain”) called a controller to perform a bunch of operations related to reading and writing data. The controller is a very important factor in determining the speed of the SSD. Decisions it makes related to how to store, retrieve, cache and clean up data can determine the overall speed of the drive. We won’t get into the nitty-gritty details for the various tasks it performs such as error correction, read and write caching, encryption, and garbage collection to name a few. Yet, suffice to say, good controller technology is often what separates an excellent SSD from a good one. An example of a fast controller today is the SandForce SATA 3.0 (6GB/s) SSD controller that supports burst speeds up to 550MB/s read and write speeds. The next gen SandForce 3700 family of controllers was announced in late 2013, and is quoted to reach a blistering 1,800MB/s read/write sequential speeds as well as 150K/80K random IOPS.
Finally, you may be wondering what an SSD looks like and how easy it is to replace a hard drive with an after-market device. If you look at the images below, you’ll see the top and undersides of a typically-sized 2.5” SSD. The technology is encased inside either a plastic or metal case and looks like nothing more than what a battery might:
SSD Top Side
SSD Bottom Side
The form factor of the SSD is actually the same as a regular hard drive. It comes in a standard 1.8”, 2.5”, or 3.5” size that can fit into the housing and connectors for the same-sized hard drives. The connector used for these standard sizes is SATA. There are smaller SSDs available that use what’s called mini-SATA (mSATA) and fit into the mini-PCI Express slot of a laptop.
What is an HDD?
Hard Disk Drives, or HDD in techno-parlance, have been around for donkey's years relative to the technology world. HDDs were first introduced by IBM in 1956 - yes folks this is nearly 60-year old technology, thank goodness vacuum tubes for TVs didn’t last so long! An HDD uses magnetism to store data on a rotating platter. A read/write head floats above the spinning platter reading and writing data. The faster the platter spins, the faster an HDD can perform. Typical laptop drives today spin at either 5400 RPM (Revolutions per Minute) or 7200RPM, though some server-based platters spin at up to 15,000 RPM!
The major advantage of an HDD is that it is capable of storing lots of data cheaply. These days, 1 TeraByte (1,024 gigabytes) of storage is not unusual for a laptop hard drive, and the density continues to grow. However, the cost per gigabyte is hard to calculate now-a-days since there are so many classes to consider, though it is safe to say that all HDDs are substantially cheaper than SSDs. As a comparison, the popular WD Black (1TB) goes for roughly $69 on most websites while the OCZ Trion 100 (960GB) and Samsung 850 EVO (1TB) SSDs go for $199 and $319 respectively, three to five times the price of the WD Black. So if you want cheap storage and lots of it, using a standard hard drive is definitely the more appealing way to go.
When it comes to appearance, HDDs essentially look the same from the outside as SSDs. HDDs predominantly use SATA interface. The most common size for laptop hard drives is the 2.5” form factor while a larger 3.5” form factor is used in desktop computers. The larger size allows for more platters inside and thus more storage capacity. Some desktop hard drives can store up to 6TB of data! Below is an example of what an HDD looks like using the Hitachi Deskstar 7K4000 4TB hard drive:
HDD Top Side
HDD Bottom Side
SSD Vs HDD Comparison
Now it’s time to do some comparisons and determine which might be best for your individual needs - SSD or HDD?  The best way to compare items is a table with a side by side comparison of items in which a green box indicates an advantage:
AttributeSSD (Solid State Drive)HDD (Hard Disk Drive)
Power Draw / Battery LifeLess power draw, averages 2 – 3 watts, resulting in 30+ minute battery boostMore power draw, averages 6 – 7 watts and therefore uses more battery
CostExpensive, roughly $0.20 per gigabyte (based on buying a 1TB drive)Only around $0.03 per gigabyte, very cheap (buying a 4TB model)
CapacityTypically not larger than 1TB for notebook size drives; 4TB max for desktopsTypically around 500GB and 2TB maximum for notebook size drives; 10TB max for desktops
Operating System Boot TimeAround 10-13 seconds average bootup timeAround 30-40 seconds average bootup time
NoiseThere are no moving parts and as such no soundAudible clicks and spinning can be heard
VibrationNo vibration as there are no moving partsThe spinning of the platters can sometimes result in vibration
Heat ProducedLower power draw and no moving parts so little heat is producedHDD doesn’t produce much heat, but it will have a measurable amount more heat than an SSD due to moving parts and higher power draw
Failure RateMean time between failure rate of 2.0 million hoursMean time between failure rate of 1.5 million hours
File Copy / Write SpeedGenerally above 200 MB/s and up to 550 MB/s for cutting edge drivesThe range can be anywhere from 50 – 120MB / s
EncryptionFull Disk Encryption (FDE) Supported on some modelsFull Disk Encryption (FDE) Supported on some models
File Opening SpeedUp to 30% faster than HDDSlower than SSD
Magnetism Affected?An SSD is safe from any effects of magnetismMagnets can erase data
If we tally up the checkmarks, the SSD gets 9 and HDD gets 3. Does that mean the that an SSD is three times better than an HDD? Not at all. As we mentioned earlier, it all depends on individual needs. The comparison here is just to lay out the pros and cons for both options. To aid you even more, here are some rules to follow when you decide which drive is best for you:
An HDD might be the right choice if:
  • You need lots of storage capacity, up to 10TB 
  • Don’t want to spend much money
  • Don’t care too much about how fast a computer boots up or opens programs - then get a hard drive (HDD).
An SSD might be the right choice if:
  • You are willing to pay for faster performance
  • Don’t mind limited storage capacity or can work around that (though consumer SSD now go up to 4TB and enterprise run as high as 60TB)
HDDs are still the popular choice for the majority of average consumers, usually choosing the HDD as the storage option in their new computer simply due to the much cheaper cost. However, more and more consumers desire top computing performance and are opting for an SSD inside their new setup or as an upgrade to their current one. As such, SSDs are well on their way to becoming the mainstream, standard storage mechanism, especially for laptops given the advantages they present for a mobile device (they are currently the default storage device in the Ultrabook category). That said, there will always be a market for both HDDs and SSDs. The advent of mSATA SSD devices and hybrid drives that include both SSD and HDD features is another option for consumers seeking a bit of the best of both worlds, but that’s a topic for another day!
Curious about which SSD or hard drive to buy? Be sure to check out our constantly updated leaderboard that has a breakdown of the best SSD in categories like value, mainstream and enthusiast.
About The Author: Andrew Baxter is the Editor of LaptopReviews.com where he writes news and reviews covering the laptop industry. He is also a Contributing Editor at StorageReview.com.

Monday, June 25, 2018

Deepgram: 音频搜索领域的Google

人工智能、音频检索……这家号称“音频搜索领域的Google”融资180万美元

http://www.chinambn.com/show-3774.html

Sunday, June 24, 2018

学界 | 现实版柯南「蝴蝶结变声器」:谷歌发布从声纹识别到多重声线语音合成的迁移学习

source

哈佛科学家说,做到这五点可延寿10年

source

想多活10年?那你可能需要改变一下自己的生活方式。
哈佛陈曾熙公共卫生学院(Harvard T.H. Chan School of Public Health)5月份发布的一份研究显示,一个人若能养成并长期保持五个生活习惯,其寿命预期将增加10年不止。好消息是,多10年真的多出不少时间。坏消息是,你得戒掉垃圾食品,不能再做沙发土豆。
以下是来自该研究报告的建议:
-坚持健康饮食
-每天至少锻炼30分钟
-保持健康体重(将体重指数BMI控制在18.5-24.9之间,BMI=体重(kg)/身高的平方(m2))
-饮酒要适量(女性每天不超过一杯葡萄酒(5盎司,约为142毫升),男性不超过两杯)
-(永不)不抽烟
那份发表在美国心脏协会医学期刊《循环》(Circulation)网站上的研究显示,相比那些以不健康的生活方式过了30年的人,遵循健康生活方式的人死于心血管疾病的几率要低82%,死于癌症的几率低65%。
研究人员对超过7.8万名女性持续34年的习惯数据和超过4.4万名男性持续27年的习惯数据分别进行了分析。据研究人员估计,女性养成这五个好习惯,寿命可增加14年,男性可增加12年。
哈佛研究人员指出的这些健康习惯看似并无新奇之处,但要内化成自己的生活方式也着实不易。以BMI推荐值为例,很多美国人可能就难以达到。美国男性的平均BMI目前是28.6,高于上世纪60年代初的25.1。超过24.9属于超重,超过30属于肥胖。
不过,美国国立卫生研究院(National Institutes of Health)称,有几个办法可以帮你养成这些习惯,并使之慢慢成为你生活的一部分。同时要警惕坏习惯,不要在下午3点跑到办公室的自动售货机前买买买,也不要熬夜不睡,错过第二天早上的健身房锻炼。
还有,不要一个人闷头坚持。让家人和朋友与你一同接受挑战。美国国立卫生研究院建议人们向前看,想象自己完成目标后的感觉。该研究院的内部通讯月刊中提到,“永远不要认为自己身材太差、体重太重或年龄太大的情况已无可救药。”
想要延长(或至少不要缩短)自己的生命旅程,还需考虑其他一些因素。研究显示,除锻炼身体和健康饮食外,人们还要积极参加社交活动,保证充足的睡眠。在美国,超过40%的成年人备尝孤寂之苦,而孤单寂寞可能诱发抑郁、痴呆、焦虑和心血管疾病。睡眠不足还可能引发高血压、糖尿病和肥胖症。
如果你现在的生活习惯和哈佛研究报告的建议相反,不妨看看你的坏习惯会导致哪些后果:
-华盛顿大学研究人员撰写的一份发表于《柳叶刀》(Lancet)的研究论文显示,每五个死亡案例中,就有一例是因为饮食不良导致的。此外,不良饮食还可能引发高血压和糖尿病,这两种疾病的发生都和糟糕的食物有关。(研究发现,合理的饮食应包括全谷物、水果、坚果和种子。)
-不锻炼也可能引发高血压和糖尿病。约翰霍普金斯医疗集团(Johns Hopkins Medicine)称,身体不爱动的人容易焦虑、抑郁、得冠心病甚至是癌症。
-BMI水平低于或高于推荐值都不安全。美国医疗健康信息平台Healthline称,如果你的BMI低于平均水平,说明你体重过轻,而这不仅预示着你身体营养不良,患骨质疏松的风险会增加,还涉及免疫功能减退和生育问题。BMI过高或肥胖则会引发哮喘和骨骼疾病等慢性病。
-美国临床肿瘤协会(American Society of Clinical Oncology)称,不仅过度饮酒会致癌,小酌亦会如此。
-《柳叶刀》上刊登的一份研究显示,全球11.5%的死亡都是吸烟所致。也就是说,全世界每10个人离世,就有超过一人是被吸烟害死的。
这样看来,还是应该坚持那五个好习惯。

Wednesday, June 20, 2018

C++ Passing Arrays & Vectors to Functions

Source:
https://www.tutorialspoint.com/cplusplus/cpp_passing_arrays_to_functions.htm

Note:
Way 1:
double getAverage(int arr[], int size) {
  int i, sum = 0;       
  double avg;          

   for (i = 0; i < size; ++i) {
      sum += arr[i];
   }
   avg = double(sum) / size;

   return avg;
}

Way 2:
double getAverage(int *arr, int size) {
  int i, sum = 0;       
  double avg;          

   for (i = 0; i < size; ++i) {
      sum += arr[i];
   }
   avg = double(sum) / size;

   return avg;
}

About calling the function:
vector<int> ivec({1,2,3});
getAverage(&iVec[0], iVec.size()); // Way 2 calling



Summary:
Note, the following two function signatures are the same
double getAverage(int arr[], int size)
double getAverage(int *arr, int size) 

Tuesday, June 19, 2018

Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant

source

The “Hey Siri” feature allows users to invoke Siri hands-free. A very small speech recognizer runs all the time and listens for just those two words. When it detects “Hey Siri”, the rest of Siri parses the following speech as a command or query. The “Hey Siri” detector uses a Deep Neural Network (DNN) to convert the acoustic pattern of your voice at each instant into a probability distribution over speech sounds. It then uses a temporal integration process to compute a confidence score that the phrase you uttered was “Hey Siri”. If the score is high enough, Siri wakes up. This article takes a look at the underlying technology. It is aimed primarily at readers who know something of machine learning but less about speech recognition.

Hands-Free Access to Siri

To get Siri’s help, say “Hey Siri”. No need to press a button as “Hey Siri” makes Siri hands-free. It seems simple, but quite a lot goes on behind the scenes to wake up Siri quickly and efficiently. Hardware, software, and Internet services work seamlessly together to provide a great experience.
Figure 1. The Hey Siri flow on iPhone
A diagram that shows how the acoustical signal from the user is processed. The signal is first processed by Core Audio then sent to a detector that works with the Voice Trigger. The Voice Trigger can be updated by the server. The Voice Trigger Framework controls the detection threshold and sends wake up events to Siri Assistant. Finally, the Siri Server checks the first words to make sure they are the Hey Siri trigger.
Being able to use Siri without pressing buttons is particularly useful when hands are busy, such as when cooking or driving, or when using the Apple Watch. As Figure 1 shows, the whole system has several parts. Most of the implementation of Siri is “in the Cloud”, including the main automatic speech recognition, the natural language interpretation and the various information services. There are also servers that can provide updates to the acoustic models used by the detector. This article concentrates on the part that runs on your local device, such as an iPhone or Apple Watch. In particular, it focusses on the detector: a specialized speech recognizer which is always listening just for its wake-up phrase (on a recent iPhone with the “Hey Siri” feature enabled).

The Detector: Listening for “Hey Siri”

The microphone in an iPhone or Apple Watch turns your voice into a stream of instantaneous waveform samples, at a rate of 16000 per second. A spectrum analysis stage converts the waveform sample stream to a sequence of frames, each describing the sound spectrum of approximately 0.01 sec. About twenty of these frames at a time (0.2 sec of audio) are fed to the acoustic model, a Deep Neural Network (DNN) which converts each of these acoustic patterns into a probability distribution over a set of speech sound classes: those used in the “Hey Siri” phrase, plus silence and other speech, for a total of about 20 sound classes. See Figure 2.
The DNN consists mostly of matrix multiplications and logistic nonlinearities. Each “hidden” layer is an intermediate representation discovered by the DNN during its training to convert the filter bank inputs to sound classes. The final nonlinearity is essentially a Softmax function (a.k.a. a general logistic or normalized exponential), but since we want log probabilities the actual math is somewhat simpler.
Figure 2. The Deep Neural Network used to detect "Hey Siri." The hidden layers are actually fully connected. The top layer performs temporal integration. The actual DNN is indicated by the dashed box.
A diagram that depicts a deep neural network. The bottom layer in a stream of feature vectors. There are four sigmoidal layers, each of which has a bias unit. These layers feed into Softmax function values which in turn feed into units that output a trigger score. The last layer for the tigger score maintains recurrent state.
We choose the number of units in each hidden layer of the DNN to fit the computational resources available when the “Hey Siri” detector runs. Networks we use typically have five hidden layers, all the same size: 32, 128, or 192 units depending on the memory and power constraints. On iPhone we use two networks—one for initial detection and another as a secondary checker. The initial detector uses fewer units than the secondary checker.
The output of the acoustic model provides a distribution of scores over phonetic classes for every frame. A phonetic class is typically something like “the first part of an /s/ preceded by a high front vowel and followed by a front vowel.”
We want to detect “Hey Siri” if the outputs of the acoustic model are high in the right sequence for the target phrase. To produce a single score for each frame we accumulate those local values in a valid sequence over time. This is indicated in the final (top) layer of Figure 2 as a recurrent network with connections to the same unit and the next in sequence. Inside each unit there is a maximum operation and an add:
Fi,t=max{si+Fi,t-1,mi-1+Fi-1,t-1}+qi,t
where
  • Fi,t is the accumulated score for state i of the model
  • qi,t is the output of the acoustic model—the log score for the phonetic class associated with the ith state given the acoustic pattern around time t
  • si is a cost associated with staying in state i
  • mi is a cost for moving on from state i
Both si and mi are based on analysis of durations of segments with the relevant labels in the training data. (This procedure is an application of dynamic programming, and can be derived based on ideas about Hidden Markov Models—HMMs.)
Figure 3. Visual depiction of the equation
A diagram that attempts to show a visual depiction of the mathematical equation.
Each accumulated score Fi,t is associated with a labelling of previous frames with states, as given by the sequence of decisions by the maximum operation. The final score at each frame is Fi,t, where the last state of the phrase is state I and there are N frames in the sequence of frames leading to that score. (N could be found by tracing back through the sequence of max decisions, but is actually done by propagating forwards the number of frames since the path entered the first state of the phrase.)
Almost all the computation in the “Hey Siri” detector is in the acoustic model. The temporal integration computation is relatively cheap, so we disregard it when assessing size or computational resources.
You may get a better idea of how the detector works by looking at Figure 4, which shows the acoustic signal at various stages, assuming that we are using the smallest DNN. At the very bottom is a spectrogram of the waveform from the microphone. In this case, someone is saying “Hey Siri what …” The brighter parts are the loudest parts of the phrase. The Hey Siri pattern is between the vertical blue lines.
Figure 4. The acoustic pattern as it moves through the detector
The acoustic pattern as it moves through the detector.
The second horizontal strip up from the bottom shows the result of analyzing the same waveform with a mel filter bank, which gives weight to frequencies based on perceptual measurements. This conversion also smooths out the detail that is visible in the spectrogram and due to the fine-structure of the excitation of the vocal tract: either random, as in the /s/, or periodic, seen here as vertical striations.
The alternating green and blue horizontal strips labelled H1 to H5 show the numerical values (activations) of the units in each of the five hidden layers. The 32 hidden units in each layer have been arranged for this figure so as to put units with similar outputs together.
The next strip up (with the yellow diagonal) shows the output of the acoustic model. At each frame there is one output for each position in the phrase, plus others for silence and other speech sounds. The final score, shown at the top, is obtained by adding up the local scores along the bright diagonal according to Equation 1. Note that the score rises to a peak just after the whole phrase enters the system.
We compare the score with a threshold to decide whether to activate Siri. In fact the threshold is not a fixed value. We built in some flexibility to make it easier to activate Siri in difficult conditions while not significantly increasing the number of false activations. There is a primary, or normal threshold, and a lower threshold that does not normally trigger Siri. If the score exceeds the lower threshold but not the upper threshold, then it may be that we missed a genuine “Hey Siri” event. When the score is in this range, the system enters a more sensitive state for a few seconds, so that if the user repeats the phrase, even without making more effort, then Siri triggers. This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time. (We discuss testing and tuning for accuracy later.)

Responsiveness and Power: Two Pass Detection

The “Hey Siri” detector not only has to be accurate, but it needs to be fast and not have a significant effect on battery life. We also need to minimize memory use and processor demand—particularly peak processor demand.
To avoid running the main processor all day just to listen for the trigger phrase, the iPhone’s Always On Processor (AOP) (a small, low-power auxiliary processor, that is, the embedded Motion Coprocessor) has access to the microphone signal (on 6S and later). We use a small proportion of the AOP’s limited processing power to run a detector with a small version of the acoustic model (DNN). When the score exceeds a threshold the motion coprocessor wakes up the main processor, which analyzes the signal using a larger DNN. In the first versions with AOP support, the first detector used a DNN with 5 layers of 32 hidden units and the second detector had 5 layers of 192 hidden units.
Figure 5. Two-pass detection
A diagram of the two-pass detection process. The first pass is fast and does not use a lot of computation power because is uses a small DNN. The second pass is more accurate and uses a lager DNN.
Apple Watch presents some special challenges because of the much smaller battery. Apple Watch uses a single-pass “Hey Siri” detector with an acoustic model intermediate in size between those used for the first and second passes on other iOS devices. The “Hey Siri” detector runs only when the watch motion coprocessor detects a wrist raise gesture, which turns the screen on. At that point there is a lot for WatchOS to do—power up, prepare the screen, etc.—so the system allocates “Hey Siri” only a small proportion (~5%) of the rather limited compute budget. It is a challenge to start audio capture in time to catch the start of the trigger phrase, so we make allowances for possible truncation in the way that we initialize the detector.

“Hey Siri” Personalized

We designed the always-on “Hey Siri” detector to respond whenever anyone in the vicinity says the trigger phrase. To reduce the annoyance of false triggers, we invite the user to go through a short enrollment session. During enrollment, the user says five phrases that each begin with “Hey Siri.” We save these examples on the device.
We compare any possible new “Hey Siri” utterance with the stored examples as follows. The (second-pass) detector produces timing information that is used to convert the acoustic pattern into a fixed-length vector, by taking the average over the frames aligned to each state. A separate, specially trained DNN transforms this vector into a “speaker space” where, by design, patterns from the same speaker tend to be close, whereas patterns from different speakers tend to be further apart. We compare the distances to the reference patterns created during enrollment with another threshold to decide whether the sound that triggered the detector is likely to be “Hey Siri” spoken by the enrolled user.
This process not only reduces the probability that “Hey Siri” spoken by another person will trigger the iPhone, but also reduces the rate at which other, similar-sounding phrases trigger Siri.

Further Checks

If the various stages on the iPhone pass it on, the waveform arrives at the Siri server. If the main speech recognizer hears it as something other than “Hey Siri” (for example “Hey Seriously”) then the server sends a cancellation signal to the phone to put it back to sleep, as indicated in Fig 1. On some systems we run a cut-down version of the main recognizer on the device to provide an extra check earlier.

The Acoustic Model: Training

The DNN acoustic model is at the heart of the “Hey Siri” detector. So let’s take a look at how we trained it. Well before there was a Hey Siri feature, a small proportion of users would say “Hey Siri” at the start of a request, having started by pressing the button. We used such “Hey Siri” utterances for the initial training set for the US English detector model. We also included general speech examples, as used for training the main speech recognizer. In both cases, we used automatic transcription on the training phrases. Siri team members checked a subset of the transcriptions for accuracy.
We created a language-specific phonetic specification of the “Hey Siri” phrase. In US English, we had two variants, with different first vowels in “Siri”—one as in “serious” and the other as in “Syria.” We also tried to cope with a short break between the two words, especially as the phrase is often written with a comma: “Hey, Siri.” Each phonetic symbol results in three speech sound classes (beginning, middle and end) each of which has its own output from the acoustic model.
We used a corpus of speech to train the DNN for which the main Siri recognizer provided a sound class label for each frame. There are thousands of sound classes used by the main recognizer, but only about twenty are needed to account for the target phrase (including an initial silence), and one large class class for everything else. The training process attempts to produce DNN outputs approaching 1 for frames that are labelled with the relevant states and phones, based only on the local sound pattern. The training process adjusts the weights using standard back-propagation and stochastic gradient descent. We have used a variety of neural network training software toolkits, including Theano, Tensorflow, and Kaldi.
This training process produces estimates of the probabilities of the phones and states given the local acoustic observations, but those estimates include the frequencies of the phones in the training set (the priors), which may be very uneven, and have little to do with the circumstances in which the detector will be used, so we compensate for the priors before the acoustic model outputs are used.
Training one model takes about a day, and there are usually a few models in training at any one time. We generally train three versions: a small model for the first pass on the motion coprocessor, a larger-size model for the second pass, and a medium-size model for Apple Watch.
“Hey Siri” works in all languages that Siri supports, but “Hey Siri” isn’t necessarily the phrase that starts Siri listening. For instance, French-speaking users need to say “Dis Siri” while Korean-speaking users say “Siri 야” (Sounds like “Siri Ya.”) In Russian it is “привет Siri “ (Sounds like “Privet Siri”), and in Thai “หวัดดี Siri”. (Sounds like “Wadi Siri”.)

Testing and Tuning

An ideal detector would fire whenever the user says “Hey Siri,” and not fire at other times. We describe the accuracy of the detector in terms of two kinds of error: firing at the wrong time, and failing to fire at the right time. The false-accept rate (FAR or false-alarm rate), is the number of false activations per hour (or mean hours between activations) and the false-reject rate (FRR) is the proportion of attempted activations that fail. (Note that the units we use to measure FAR are not the same as those we use for FRR. Even the dimensions are different. So there is no notion of an equal error rate.)
For a given model we can change the balance between the two kinds of error by changing the activation threshold. Figure 6 shows examples of this trade-off, for two sizes of early-development models. Changing the threshold moves along the curve.
During development we try to estimate the accuracy of the system by using a large test set, which is quite expensive to collect and prepare, but essential. There is “positive” data and “negative” data. The “positive” data does contain the target phrase. You might think that we could use utterances picked up by the “Hey Siri” system, but the system doesn’t capture the attempts that failed to trigger, and we want to improve the system to include as many of such failed attempts as possible.
At first we used the utterances of “Hey Siri” that some users said as they pressed the Home button, but these users are not attempting to catch Siri’s attention, (the button does that) and the microphone is bound to be within arm’s reach, whereas we also want “Hey Siri” to work across a room. We made recordings specially in various conditions, such as in the kitchen (both close and far), car, bedroom, and restaurant, by native speakers of each language.
We use the “negative” data to test for false activations (and false wakes). The data represent thousands of hours of recordings, from various sources, including podcasts and non-“Hey Siri” inputs to Siri in many languages, to represent both background sounds (especially speech) and the kinds of phrases that a user might say to another person. We need such a lot of data because we are trying to estimate false-alarm rates as low as one per week. (If there are any occurrences of the target phrase in the negative data we label them as such, so that we do not count responses to them as errors.)
Figure 6. Detector accuracy. Trade-offs against detection threshold for small and larger DNNs
A graph that shows the trade-offs against detection threshold for large and small DNNs. The larger DNN is more accurate.
Tuning is largely a matter of deciding what thresholds to use. In Figure 6, the two dots on the lower trade-off curve for the larger model show possible normal and second-chance thresholds. The operating point for the smaller (first-pass) model would be is at the right-hand side. These curves are just for the two stages of the detector, and do not include the personalized stage or subsequent checks.
While we are confident that models that appear to perform better on the test set probably are really better, it is quite difficult to convert offline test results into useful predictions of the experience of users. So in addition to the offline measurements described previously, we estimate false-alarm rates (when Siri turns on without the user saying “Hey Siri”) and imposter-accept rates (when Siri turns on when someone other than the user who trained the detector says “Hey Siri”) weekly by sampling from production data, on the latest iOS devices and Apple Watch. This does not give us rejection rates (when the system fails to respond to a valid “Hey Siri”) but we can estimate rejection rates from the proportion of activations just above the threshold that are valid, and a sampling of just-below threshold events on devices carried by development staff.
We continually evaluate and improve “Hey Siri,” and the model that powers it, by training and testing using variations of the approach described here. We train in many different languages and test under a wide range of conditions.
Next time you say “Hey Siri” you may think of all that goes on to make responding to that phrase happen, but we hope that it “just works!”