Profile cover photo
Profile photo
Nickolay Shmyrev
430 followers
430 followers
About
Posts

Post has attachment

Post has attachment
Google acknowledged the neural networks do not generalize over phone/video recordings.
Add a comment...

When information is already lost

In speech recognition we frequently deal with noisy or simply corrupted recordings. For example, in call center recordings you still get error rates like 50% or 60% even with the best algorithms. Someone calls while driving, others on the noisy street. Some use really bad phones. And the question raises how to improve the accuracy for such recordings. People try different complex algorithms and train model on GPU for many weeks while the answer is simple - the information in such recordings is simply lost and you can not decode them accurately.

Data transfer and data storage are expensive and VoIP providers often save every cent since they all operate on very low margin. That means they often use codecs with bugs or bad transmission lines and as a result you simply get unintelligible speech. Then everyone uses cell phones thus you have multiple encoding-decoding rounds where information from the microphone is encoded with AMR then encoded into 729 and finally converted into MP3 for storage, you have many codecs and often frame drops. As a result the quality of sound is sometimes very bad. And recognition accuracy is zero.

The quality of speech is hard to measure but there are ways to do that. The easiest way requires controlled experiments where you send the data from one endpoint and measure distortion on another endpoint. There are also single-side tools developed by ITU in PESQ 563 http://www.itu.int/rec/T-REC-P.563/en that simply take an audio file and give you the sound quality score which takes many parameters into account and estimates the quality of the speech. They are rough, but still can give you some idea how noisy your audio is. And if it is really noisy, the way to improve it is not to apply better algorithms and put more research in speech recognition but simply go to the VoIP provider and demand better sound quality.

Given we have such a tool we might want to introduce the normalized word error rate which takes into account how good the recording is. So you really want to decode high quality recordings accurately and you probably do not care about bad quality recordings.

When accuracy matters, sound quality is really important. If possible you can use your own VoIP stack sending audio directly from the mobile microphone to the server. But when calls come to play, it is usually hopeless.
Add a comment...

Post has attachment
The pain of training neural networks

Training Tips for the Transformer Model
Martin Popel, Ondřej Bojar
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Prague, Czechia

https://arxiv.org/pdf/1804.00247.pdf
Add a comment...

Post has attachment
Very impressive project for Korean support landed in Kaldi

https://github.com/goodatlas/zeroth
Add a comment...

Post has attachment
Add a comment...

Post has attachment
Add a comment...

Post has attachment
Kaldi 5.4 was announced pretty quietly, but it is really great progress in neural network architecture. The proper recipe is here:

https://github.com/kaldi-asr/kaldi/blob/master/egs/swbd/s5c/local/chain/tuning/run_tdnn_7n.sh, not the one that was announced in the group (just TDNN without LSTM).
Add a comment...

Post has shared content
Rant about end-to-end training:

I have the impression that the standards for experimental comparisons are significantly lower if the paper claims to show that end-to-end training (such as CTC) works. For example: if I wanted to introduce a new nonlinearity for neural networks, I think the reviewers would expect that the paper would show ReLU as a baseline, and that the baseline would be trained on the same data as the experiment.
But if the paper is about CTC? All bets are off. Train on privately owned data so that the numbers can't be replicated? Fine. Fail to show any baseline number whatsoever? Accepted. Compare only with CTC results previously reported in the literature but ignore better numbers that come from non-CTC systems? Best paper material.
A student's paper on end-to-end training was rejected from ICASSP recently, apparently because of reviewer misunderstandings about appropriate baselines. He had compared his results to previously reported results in the literature, and he had limited that list to things trained on the same training set. The reviewers rejected the paper because he had failed to mention better numbers from the literature (even though those systems were trained on vastly more data). But I suspect his real sin was to suggest that maybe CTC doesn't actually work very well.
Add a comment...

Post has attachment
Interesting library from Google for articulatory analysis

https://github.com/googlei18n/language-resources/blob/fonbund/fonbund/README.md
Add a comment...
Wait while more posts are being loaded