Sound recognition of language. Understanding modern speech recognition systems in Linux

Yes, only things are still there.
I.A. Krylov. Fable "Swan, Pike and Cancer"

The two main tasks of machine speech recognition - achieving guaranteed accuracy with a limited set of commands for at least one fixed voice and diction-independent recognition of arbitrary continuous speech with acceptable quality - have not yet been solved, despite the long history of their development. Moreover, there are doubts about the fundamental possibility of solving both problems, since even a person cannot always completely recognize the speech of the interlocutor.

Once upon a time, science fiction writers considered the possibility of a normal conversation with a computer so obvious and natural that the first computers, devoid of a voice interface, were perceived as something inferior.

It would seem, why not solve this problem programmatically, using "smart" computers? After all, there are manufacturers of such products, and the power of computers is constantly growing, and technologies are being improved. However, advances in automatic speech recognition and its conversion to text seem to be at the same level as 20-40 years ago. I remember that back in the mid-90s, IBM confidently announced the availability of such tools in OS / 2, and a little later Microsoft joined in the implementation of such technologies. Apple also tried to engage in speech recognition, but in early 2000 it officially announced the abandonment of this project. IBM (Via Voice) and Philips continue to work in this area, and IBM not only built the speech recognition function into its OS / 2 operating system (now sunk into oblivion), but still releases it as a separate product. The Via Voice (http://www-306.ibm.com/software/voice/viavoice) package for continuous speech recognition from IBM differed in that from the very beginning, even without training, it recognized up to 80% of words. During training, the probability of correct recognition increased to 95%, and besides, in parallel with setting up the program for a specific user, the future operator was mastering the skills of working with the system. Now there are rumors that such innovations will be implemented as part of Windows XP, although the head and founder of the corporation, Bill Gates, has repeatedly stated that he considers speech technologies are not yet ready for mass use.

Once upon a time, the American company Dragon Systems created, probably, the first commercial speech recognition system - Naturally Speaking Preferred, which worked back in 1982 on the IBM PC (not even XT!). True, this program was more like a game and since then the company has not made any serious progress, and by 2000 it went bankrupt, and its latest version of Dragon Dictate Naturally Speaking was sold to Lernout & Hauspie Speech Products (L&H), which was also one of leaders in the field of systems and methods of speech recognition and synthesis (Voice Xpress). L & H, in turn, also went bankrupt with the sale of assets and property (by the way, Dragon Systems was sold for almost $ 0.5 billion, and L & H - already for 10 million, so its scale in this area impressive is not progress, but regression!). L&H and Dragon Systems technologies have been taken over by ScanSoft, a company that used to do optical image recognition (it now runs some well-known type recognition programs like OmniPage), but nobody seems to be doing it seriously.

The Russian company Cognitive Technologies, which has achieved significant success in the field of character recognition, announced in 2001 a joint project with Intel to create Russian speech recognition systems - a speech corpus of the Russian language RuSpeech was prepared for Intel. Actually, RuSpeech is a speech database that contains fragments of continuous Russian speech with corresponding texts, phonetic transcription and additional information about speakers. Cognitive Technologies set itself the goal of creating a “speaker-independent” continuous speech recognition system, and the speech interface consisted of a dialogue script system, speech synthesis from text, and a speech command recognition system.

However, in practice, until now, programs for real speech recognition (and even in Russian) practically do not exist, and they will obviously not be created soon. Moreover, even the inverse problem of recognition - speech synthesis, which, it would seem, is much simpler than recognition, has not been completely solved. Any synthesized speech is perceived by a person worse than live speech, and this is especially noticeable when transmitted over a telephone channel, that is, exactly where it is most in demand today.

“Well, that’s all, you’re finished,” said Ivan Tsarevich, looking directly into the eyes of the third head of the Serpent Gorynych. She looked confusedly at the other two. They grinned mischievously in response.

Joke

In 1997, the entry into the commercial market of the famous "Gorynych" (essentially an adaptation of the Dragon Dictate Naturally Speaking program, carried out by the until then little-known Russian company White Group, the official distributor of Dragon Systems) became a kind of sensation. The program seemed quite workable, and its price seemed very moderate. However, time goes by, "Gorynychi" change interfaces and versions, but do not acquire any valuable properties. Maybe the core of Dragon Naturally Speaking was somehow tuned to the peculiarities of English speech, but even after the successive replacement of the dragon head with three Gorynych heads, it gives no more than 30-40% of the recognition of the average level of vocabulary, and with careful pronunciation. And who needs it anyway? As you know, according to the statements of the developers of Dragon Systems, IBM and Lernout & Hauspie, their programs with continuous dictation were able to correctly recognize up to 95% of the text, but they have not been produced for a long time, because it is known that for comfortable work, the recognition accuracy must be increased to 99 %. Needless to say, to conquer such heights in real conditions, considerable efforts are required, to put it mildly.

In addition, the program requires a long period of training and customization for a specific user, is very capricious about equipment, and is more than sensitive to intonation and the speed of pronouncing phrases, so its ability to learn to recognize different voices varies greatly.

However, maybe someone will purchase this package as some kind of advanced toy, but this will not help fingers tired of working with the keyboard, even if the Gorynych manufacturers claim that the speed of entering speech material and transforming it into text is 500-700 characters per minute, which is inaccessible even for a few experienced typists, if you add up the speed of their work.

Upon closer examination of the new version of this program, we did not manage to extract anything worthwhile from it. Even after a long “training” of the program (and the standard dictionary did not help us at all), it turned out that dictation must still be carried out strictly by words (that is, after each word you need to pause) and words must be pronounced clearly, which is not always typical for speech . Of course, "Gorynych" is a modification of the English-speaking system, and for English a different approach is simply unthinkable, but speaking Russian in such a manner seemed especially unnatural to us. In addition, in the course of a normal conversation in any language, the sound intensity almost never drops to zero (this can be seen from the spectrograms), and after all, commercial programs learned to recognize the dictation of texts of a general subject, performed in the manner of continuous speech, already 5-10 years ago .

The system is focused primarily on input, but contains tools that allow you to correct a misheard word, for which Gorynych offers a list of options. You can correct the text from the keyboard, which, by the way, constantly has to be done. Words that are not in the dictionary are also entered from the keyboard. I remember that in previous versions it was stated that the more often you dictate, the more the system gets used to your voice, but neither then nor now we noticed something. It even seemed to us that working with the Gorynych program is still more difficult than, for example, teaching a parrot to talk, and of the new products in version 3.0, only a more “poppy” multimedia interface can be noted.

In a word, there is only one manifestation of progress in this area: due to the increase in computer power, the time delay between pronouncing the word and displaying its written version on the screen has completely disappeared, and the number of correct hits, alas, has not increased.

Analyzing the capabilities of the program, we are more and more inclined to the opinion of experts that linguistic text analysis is an obligatory stage in the process of automatic input from dictation. Without this, the modern quality of recognition cannot be achieved, and many experts associate the prospects of speech systems precisely with the further development of the linguistic mechanisms contained in them. As a consequence, speech technologies are becoming increasingly dependent on the language they work with. And this means, firstly, that the recognition, synthesis and processing of Russian speech is the business that Russian developers should be engaged in, and secondly, only specialized domestic products, initially focused specifically on the Russian language, will be able to really solve this problem. task. True, it should be noted here that the domestic specialists of the St. Petersburg Center for Speech Technologies (STC) believe that the creation of their own dictation system in the current Russian conditions will not pay off.

Other toys

So far, speech recognition technologies have been successfully used by Russian developers mainly in interactive learning systems and games like My Talking Dictionary, Talk to Me or Professor Higgins, created by IstraSoft. They are used to control pronunciation in students. English language and user authentication. Developing the Professor Higgins program, IstraSoft employees learned to segment words into elementary segments that correspond to speech sounds and do not depend on either the speaker or the language (before, speech recognition systems did not perform such segmentation, and the smallest unit for them was a word) . In this case, the selection of phonemes from the flow of continuous speech, their encoding and subsequent restoration takes place in real time. This speech recognition technology has found a rather ingenious application - it allows you to significantly compress files with voice recordings or speech messages. The method proposed by IstraSoft allows speech compression by a factor of 200, and with compression less than 40 times, the quality of the speech signal practically does not deteriorate. Intelligent speech processing at the phoneme level is promising not only as a compression method, but also as a step towards the creation of a new generation of speech recognition systems, because theoretically, machine speech recognition, that is, its automatic presentation in the form of text, is precisely the highest degree of speech compression. signal.

Today, IstraSoft, in addition to training programs, offers on its website (http://www.istrasoft.ru/user.html) programs for compressing / playing sound files, as well as a demonstration program for voice-independent recognition of Russian language commands Istrasoft Voice Commander.

It would seem that now, in order to create a recognition system based on the new technology, there is very little left to do ...

), which has been working in this area since 1990, seems to have made some headway. STC has in its arsenal a whole set of software and hardware designed for noise cleaning and for improving the quality of sound, and primarily speech, signals - these are computer programs, stand-alone devices, boards (DSP) embedded in devices for recording or transmitting voice information (we already wrote about this firm in the article "How to improve speech intelligibility?" in No. 8'2004). The Center for Speech Technologies is known as a developer of noise reduction and sound editing tools: Clear Voice, Sound Cleaner, Speech Interactive Software, Sound Stretcher, etc. The company's specialists took part in the restoration of audio information recorded on board the sunken Kursk submarine and on the victims of air disasters. courts, as well as in the investigation of a number of criminal cases, for which it was required to establish the content of speech phonograms.

The Sound Cleaner speech noise cleaning complex is a professional set of software and hardware tools designed to restore speech intelligibility and to purify audio signals recorded in difficult acoustic conditions or transmitted over communication channels. This truly unique software product is designed to denoise and enhance the sound quality of live (that is, incoming in real time) or recorded audio signal and can help improve the intelligibility and text interpretation of low-quality speech recordings (including archival ones) recorded in difficult acoustic conditions.

Naturally, Sound Cleaner works more effectively with respect to noise and sound distortions of a known nature, such as typical noises and distortions of communication and sound recording channels, noise from rooms and streets, working mechanisms, vehicles, household appliances, voice "cocktail", slow music, electromagnetic interference of power systems, computer and other equipment, reverberation and echo effects. In principle, the more uniform and “regular” the noise, the more successfully this complex will cope with it.

However, with two-channel data acquisition, Sound Cleaner significantly reduces the impact of noise of any type - for example, it has two-channel adaptive filtering methods designed to suppress both broadband non-stationary interference (such as speech, radio or television broadcasting, hall noise, etc.) and periodic (vibrations, network pickups, etc.). These methods are based on the fact that when extracting a useful signal, additional information about the properties of the interference, presented in the reference channel, is used.

Since we are talking about speech recognition, it is impossible not to mention another development of the STC - a family of computer transcribers, which, unfortunately, are not yet programs for automatic speech recognition and converting it into text, but rather are computer digital tape recorders controlled from specialized text editor. These devices are designed to increase the speed and improve the comfort of documenting sound recordings of oral speech in the preparation of summaries, minutes of meetings, negotiations, lectures, interviews, they are also used in paperless office work and in many other cases. Transcribers are simple and easy to use and are available even for non-professional operators. At the same time, the speed of typing increases by two to three times for professional operators who type blindly, and for non-professionals - by five to ten times! In addition, the mechanical wear of the tape recorder and tape is significantly reduced when it comes to an analog source. In addition, computer transcribers have an interactive ability to check the typed text and the corresponding sound track. The connection between text and speech is established automatically and allows you to automatically find and listen to the corresponding sound fragments of the speech signal in the typed text when you move the cursor to the part of the text under study. Increasing the intelligibility of speech can be achieved here both by slowing down the playback speed without distorting the timbre of the voice, or by repeatedly repeating unintelligible fragments in the ring mode.

Of course, it is much easier to implement a program that can only recognize a limited, small set of control commands and symbols. This, for example, can be numbers from 0 to 9 in the phone, the words "yes" / "no" and one-word commands to call the desired subscribers, etc. Such programs appeared the very first and have long been used in telephony for voice dialing or subscriber selection.

Recognition accuracy is usually improved by pre-tuning to the voice of a particular user, and in this way it is possible to achieve speech recognition even when the speaker has a defect in diction or accent. Everything seems to be fine, but noticeable progress in this area is visible only if the individual use of equipment or software by one or several users is expected, in extreme cases, for each of which an individual “profile” is created.

In short, despite all the achievements of recent years, continuous speech recognition tools still make a large number of errors, require lengthy setup, are demanding on hardware and user skills, and refuse to work in noisy rooms, although the latter is important both for noisy offices, and for mobile systems and operation in a telephone environment.

However, speech recognition, as well as machine translation from one language to another, belongs to the so-called cult computer technologies, which receive special attention. Interest in these technologies is constantly fueled by countless works of science fiction writers, so constant attempts to create such a product that should correspond to our ideas about the technologies of tomorrow are inevitable. And even those projects that are essentially nothing are often commercially quite successful, since the consumer is keenly interested in the very possibility of such implementations, even regardless of whether he can put it into practice.

  • tutorial

In this article, I want to review the basics of such an interesting area of ​​software development as Speech Recognition. Naturally, I am not an expert in this topic, so my story will be replete with inaccuracies, errors and disappointments. Nevertheless, the main goal of my "work", as the name implies, is not a professional analysis of the problem, but a description of the basic concepts, problems and their solutions. In general, I ask everyone who is interested to welcome under the cut!

Prologue

Let's start with the fact that our speech is a sequence of sounds. Sound, in turn, is a superposition (superposition) of sound vibrations (waves) of different frequencies. A wave, as we know from physics, is characterized by two attributes - amplitude and frequency.

In this way, mechanical vibrations are converted into a set of numbers suitable for processing on modern computers.

It follows that the task of speech recognition is reduced to "matching" a set of numerical values ​​(digital signal) and words from some dictionary (Russian language, for example).

Let's see how, in fact, this very "mapping" can be implemented.

Input data

Let's say we have some file/stream with audio data. First of all, we need to understand how it works and how to read it. Let's look at the simplest option - a WAV file.

The format implies the presence of two blocks in the file. The first block is a header with information about the audio stream: bitrate, frequency, number of channels, file length, etc. The second block consists of "raw" data - the same digital signal, a set of amplitude values.

The logic for reading data in this case is quite simple. We read the header, check some restrictions (lack of compression, for example), save the data to a specially allocated array.

Recognition

Purely theoretically, now we can compare (element by element) the sample we have with some other one, the text of which we already know. That is, try to "recognize" speech ... But it's better not to do this :)

Our approach should be stable (well, at least a little) to changes in the timbre of the voice (the person pronouncing the word), volume and speed of pronunciation. Naturally, this cannot be achieved by element-by-element comparison of two audio signals.

Therefore, we will go in a slightly different way.

Frames

First of all, let's split our data into small time intervals - frames. Moreover, the frames should not go strictly one after another, but “overlap”. Those. the end of one frame must intersect with the beginning of another.

Frames are a more appropriate unit of data analysis than specific signal values, since it is much more convenient to analyze waves at a certain interval than at specific points. The arrangement of frames “overlapping” makes it possible to smooth out the results of the analysis of frames, turning the idea of ​​frames into a kind of “window” moving along the original function (signal values).

It has been experimentally established that the optimal frame length should correspond to a gap of 10ms, "overlap" - 50%. Considering that the average word length (at least in my experiments) is 500ms, such a step will give us approximately 500 / (10 * 0.5) = 100 frames per word.

word breaking

The first task that has to be solved in speech recognition is the division of this very speech into separate words. For simplicity, let's assume that in our case speech contains some pauses (intervals of silence), which can be considered as “separators” of words.

In this case, we need to find some value, a threshold - values ​​above which are a word, below which are silence. There may be several options here:

  • set to a constant (works if the original signal is always generated under the same conditions, in the same way);
  • cluster signal values ​​by explicitly highlighting the set of values ​​corresponding to silence (it will work only if silence occupies a significant part of the original signal);
  • analyze entropy;

As you may have guessed, we will now talk about the last point :) Let's start with the fact that entropy is a measure of disorder, “a measure of the uncertainty of any experience” (c). In our case, entropy means how much our signal “fluctuates” within a given frame.

  • suppose that our signal is normalized and all its values ​​lie in the range [-1;1];
  • build a histogram (distribution density) of frame signal values:
calculate the entropy as ;

And so, we got the value of entropy. But this is just another characteristic of the frame, and in order to separate sound from silence, we still need to compare it with something. In some articles, it is recommended to take the entropy threshold equal to the average between its maximum and minimum values ​​(among all frames). However, in my case, this approach did not give any good results.
Fortunately, entropy (unlike the mean square of values) is a relatively independent quantity. That allowed me to pick up the value of its threshold in the form of a constant (0.1).

Nevertheless, the problems do not end there: (Entropy can sag in the middle of a word (on vowels), or it can suddenly jump up due to a little noise. In order to deal with the first problem, we have to introduce the concept of “minimum distance between words” and “glue” nearby recumbent frame sets separated due to subsidence.The second problem is solved by using the “minimum word length” and cutting off all candidates that did not pass the selection (and were not used in the first paragraph).

If, in principle, speech is not “articulate,” one can try to break the original set of frames into subsequences prepared in a certain way, each of which will be subjected to a recognition procedure. But that's a completely different story :)

And so, we have a set of frames corresponding to a certain word. We can take the path of least resistance and use the root mean square of all its values ​​(Root Mean Square) as a numerical characteristic of the frame. However, such a metric carries very little information suitable for further analysis.

This is where Mel-frequency cepstral coefficients come into play. According to Wikipedia (which, as you know, does not lie), MFCC is a kind of representation of the energy of the signal spectrum. The advantages of using it are as follows:

  • The spectrum of the signal is used (that is, the expansion in terms of the basis of orthogonal [co]sinusoidal functions), which makes it possible to take into account the wave “nature” of the signal in further analysis;
  • The spectrum is projected onto a special mel-scale, allowing you to highlight the most significant frequencies for human perception;
  • The number of calculated coefficients can be limited to any value (for example, 12), which allows you to “compress” the frame and, as a result, the amount of information being processed;

Let's look at the process of calculating the MFCC coefficients for a certain frame.

Let's represent our frame as a vector , where N is the size of the frame.

Fourier expansion

First of all, we calculate the signal spectrum using the discrete Fourier transform (preferably its “fast” FFT implementation).

That is, the result will be a vector of the following form:

It is important to understand that after this transformation, on the x-axis we have the frequency (hz) of the signal, and on the y-axis we have the magnitude (as a way to get away from complex values):

Calculation of mel filters

Let's start with what mel is. Again according to Wikipedia, mel is a “psychophysical unit of pitch” based on subjective perception by average people. It depends primarily on the frequency of the sound (as well as on the volume and timbre). In other words, this value, showing how much the sound of a certain frequency is “significant” for us.

You can convert the frequency to chalk using the following formula (remember it as "formula-1"):

The reverse transformation looks like this (remember it as "formula-2"):

Plot mel / frequency:

But back to our task. Let's say we have a frame with a size of 256 elements. We know (from the audio format data) that the audio frequency in a given frame is 16000hz. Let's assume that human speech lies in the range from hz. Let us set the number of sought-after mel-coefficients M = 10 (recommended value).

In order to decompose the spectrum obtained above on a mel-scale, we need to create a “comb” of filters. In essence, each mel filter is a triangular window function that allows you to sum the amount of energy over a certain frequency range and thereby get the mel coefficient. Knowing the number of mel coefficients and the analyzed frequency range, we can build a set of such filters:

Note that the higher the mel coefficient number, the wider the base of the filter. This is due to the fact that the division of the frequency range of interest to us into the ranges processed by the filters occurs on the chalk scale.

But we digress again. And so for our case, the range of frequencies of interest to us is . According to formula-1 on the chalk scale, this range turns into.

m[i] =

Please note that the dots are evenly spaced on the chalk scale. Let's convert the scale back to hertz using formula-2:

h[i] =

As you can see, now the scale began to gradually stretch, thereby leveling the dynamics of the growth of “significance” at low and high frequencies.

Now we need to overlay the resulting scale on the spectrum of our frame. As we remember, on the X-axis we have the frequency. The length of the spectrum is 256 - elements, while it fits 16000hz. By solving a simple proportion, you can get the following formula:

f(i) = floor((frameSize+1) * h(i) / sampleRate)

Which in our case is equivalent to

f(i) = 4, 8, 12, 17, 23, 31, 40, 52, 66, 82, 103, 128

That's all! Knowing the reference points on the X-axis of our spectrum, it is easy to construct the filters we need using the following formula:

Applying filters, logarithm of spectrum energy

The application of the filter consists in pairwise multiplication of its values ​​with the values ​​of the spectrum. The result of this operation is the mel coefficient. Since we have M filters, there will be the same number of coefficients.

However, we need to apply mel filters not to the values ​​of the spectrum, but to its energy. Then take the logarithm of the results. It is believed that this reduces the sensitivity of the coefficients to noise.

cosine transform

The Discrete Cosine Transform (DCT) is used to get those "cepstral" coefficients. Its meaning is to “compress” the results obtained by increasing the significance of the first coefficients and decreasing the significance of the latter.

In this case, DCTII is used without any multiplication by (scale factor).

Now for each frame we have a set of M mfcc coefficients that can be used for further analysis.

Examples of code for the overlying methods can be found.

Recognition algorithm

Here, dear reader, the main disappointment awaits you. On the Internet, I happened to see a lot of highly intelligent (and not so) disputes about which recognition method is better. Someone stands up for Hidden Markov Models, someone for neural networks, someone's thoughts are basically impossible to understand :)

In any case, a lot of preference is given to HMM , and it is their implementation that I am going to add to my code ... in the future :)

At the moment, I propose to stop at a much less effective, but many times simpler method.

And so, remember that our task is to recognize a word from some dictionary. For simplicity, we will recognize the names of the first ten digits: “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”.

Now let's pick up an iPhone / Android and go through L colleagues with a request to dictate these words for the record. Next, let's assign (in some local database or a simple file) to each word L of sets of mfcc-coefficients of the corresponding records.

We will call this correspondence “Model”, and the process itself - Machine Learning! In fact, simply adding new samples to the database has an extremely weak connection with machine learning ... But the term is too trendy :)

Now our task is reduced to selecting the “closest” model for some set of mfcc-coefficients (recognizable word). At first glance, the problem can be solved quite simply:

  • for each model, we find the average (Euclidean) distance between the identified mfcc-vector and the model vectors;
  • we choose as the correct one the model, the average distance to which will be the smallest;

However, the same word can be pronounced both by Andrei Malakhov and by some of his Estonian colleagues. In other words, the size of the mfcc vector for the same word can be different.

Fortunately, the problem of comparing sequences of different lengths has already been solved in the form of the Dynamic Time Warping algorithm. This dynamic programming algorithm is beautifully described both in the bourgeois Wiki and in the Orthodox Habré.

The only change that should be made to it is the way the distance is found. We must remember that the model's mfcc vector is actually a sequence of mfcc "subvectors" of dimension M obtained from frames. So, the DTW algorithm should find the distance between the sequences of these same “subvectors” of dimension M. That is, distances (Euclidean) between mfcc “subvectors” of frames should be used as the values ​​of the distance matrix.

Experiments

I did not have the opportunity to test the work of this approach on a large “training” sample. The results of tests on a sample of 3 instances for each word in non-synthetic conditions showed, to put it mildly, not the best result - 65% of correct recognitions.

However, my goal was to create the most simple application for speech recognition. So to say “proof of concept” :) Add tags

The SendPulse service is a marketing tool for creating a subscription base and converting random visitors to your site into regular ones. SendPulse combines the most important features for attracting and retaining customers on one platform:
● e-mail newsletters,
● web-push,
● SMS mailings,
● SMTP,
● mailings in Viber,
● send messages to facebook messenger.

Email newsletters

You can use various tariffs for conducting e-mail newsletters, including free ones. The free plan has limitations: the subscription base is not more than 2500.
The first thing to start with when working with an e-mail mailing service is to create your own address book. Set a title and upload a list of e-mail addresses.


SendPulse makes it easy to create subscription forms in the form of a pop-up window, embedded forms, floating and fixed in a certain part of the screen. With the help of subscription forms, you will collect a subscriber base from scratch or supplement your base with new addresses.
In the form builder, you can create exactly the subscription form that best suits your needs, and the service tips will help you cope with this task. It is also possible to use one of the available ready-made forms.


When creating subscription forms, it is mandatory to use an e-mail with a corporate domain. Read how.
Message Templates will help to beautifully design your letters to subscribers. You can create your own letter template in a special constructor.


Auto mailings. Content managers actively use automatic distribution. It helps to automate the process of working with clients. There are several ways to create an auto mailer:
Sequential series of letters. This is the simplest option, when, regardless of the conditions, several letters are written that will be sent to recipients in a certain order. There may be options here - message series(simple message chain), special date(letters are timed to certain dates), trigger letter- the letter is sent depending on the actions of the subscriber (opening the message, etc.).
Automation360– mailing with certain filters and conditions, as well as taking into account conversions.
Finished chains by template. You can create a series of letters based on a given template, or modify the template and customize it to suit your needs.
A/B testing will help you experiment with different options for sending a series of emails and determine the best option for opens or transitions.

Sending Push Notifications

Push-mailings are a subscription in a browser window, it is a kind of replacement for rss-subscriptions. Web-push technologies have rapidly entered our lives, and it is already difficult to find a site that does not use push mailings to attract and retain customers. Request script for , you can send emails both manually and create auto-broadcasts by creating a series of emails or collecting data from RSS. The second option implies that after the appearance of a new article on your site, a notification about this will be automatically sent to your subscribers with a brief announcement.


New from Sendpulse– now you can monetize your site with push notifications by embedding advertisements in them. Upon reaching $10, every Monday payments are made to one of the payment systems - Visa / mastercard, PayPal or Webmoney.
Push messages on the service are absolutely free. Payment is taken only for White Label - mailings without mentioning the SendPulse service, but if the service logo does not bother you, then you can use push notifications for free without restrictions.

SMTP

The SMTP feature protects your mailing list from being blacklisted by using white IP addresses. The DKIM and SPF cryptographic signature technologies used in SendPulse mailings increase the credibility of the emails you send, making your emails less likely to end up in spam or blacklisted.

Facebook messenger bots

Facebook chatbot is in beta testing. You can connect it to your page and send messages to subscribers.

Sending SMS

Through the SendPulse service, it is easy to send mailings to a database of phone numbers. First you need to create an address book with a list of phone numbers. To do this, select the "Address book" section, create a new address book, upload phone numbers. Now you can create an SMS mailing list for this database. The price of SMS mailing varies depending on the telecom operators of the recipients and averages from 1.26 rubles to 2.55 rubles per 1 sent SMS.

affiliate program

SendPulse implements an affiliate program in which a registered user using your link who has paid the tariff will bring you 4,000 rubles. The invited user receives a discount of 4000 rubles for the first 5 months of using the service.

On Facebook we were asked:
“To work with the text, I need to transcribe 3 hours of voice recording. I tried to upload an audio file with a picture to YouTube and use their text transcriber, but it turns out some kind of abracadabra. Can you please tell me how to solve this technically? Thank you!
Alexander Konovalov»

Alexander, there is a simple technical solution - but the result will depend solely on the quality of your recording. Let me explain what quality I'm talking about.

Behind last years Russian speech recognition technologies have advanced a lot. The percentage of recognition errors has decreased to such a level that it has become easier to “speak” other text in a special mobile application or Internet service, manually correcting individual “mistakes” - than to type the entire text on the keyboard.

But in order for the artificial intelligence of the recognition system to be able to do its job, the user must do his own. Namely: speak into the microphone clearly and measuredly, avoid strong background noises, if possible use a stereo headset or an external microphone attached to the buttonhole (for recognition quality, it is important that the microphone is always at the same distance from the lips, and you yourself speak at the same volume ). Naturally, the higher the class of the audio device, the better.

It is easy to adhere to these conditions if, instead of accessing the speech recognition Internet service directly, you use a voice recorder as an intermediate intermediary device. By the way, such a "personal secretary" is especially indispensable when you do not have access to online. Naturally, it is better to use at least an inexpensive professional voice recorder than a recording device built into a cheap mp3 player or smartphone. This will give a much better chance of "feeding" the received records to the speech recognition service.

It is difficult, but you can persuade the interlocutor you are interviewing to follow these rules (one more piece of advice: if you don’t have an external microphone on a clothespin in the kit, at least keep the recorder next to the interlocutor, and not with you).

But “outlining” a conference or a seminar at the right level in automatic mode is, in my opinion, almost unrealistic (after all, you will not be able to control the speech of the speakers and the reaction of the audience). Although a rather interesting option: turning professionally recorded audio lectures and audio books into text (if background music and noises were not superimposed on them).

Let's hope that the quality of your dictaphone recording is high enough to be able to decipher it in automatic mode.

If not, with almost any recording quality, you can decrypt in semi-automatic mode.

In addition, in a number of situations, the greatest savings in time and effort will bring you, paradoxically, decoding in manual mode. More precisely, the version that I myself have been using for a dozen years. 🙂

So, in order.

1. Automatic speech recognition

Many advise transcribing voice recordings on YouTube. But this method forces the user to spend time downloading the audio file and background image, and then cleaning the resulting text from timestamps. Meanwhile, this time is easy to save. 🙂

You can recognize audio recordings directly from your computer using the capabilities of one of the Internet services running on the Google recognition engine (I recommend Speechpad.ru or Speechlogger.com). All you need to do is to do a little trick: instead of your voice playing from the microphone, redirect the audio stream played by your computer player to the service.

This trick is called a software stereo mixer (usually used to record music on a computer or broadcast it from a computer to the Internet).

The stereo mixer was part of Windows XP - but was removed by the developers from later versions of this operating system (they say, in order to protect copyrights: so that gamers do not steal music from games, etc.). However, it is not uncommon for a stereo mixer to come with drivers for audio cards (for example, Realtec cards built into the motherboard). If you can't find the stereo mixer on your PC using the screenshots below, try reinstalling the audio drivers from the CD that came with your motherboard, or from the motherboard manufacturer's website.

If this does not help, install an alternative program on your computer. For example - free VB-CABLE Virtual Audio Device : the owner of the aforementioned Speechpad.ru service recommends using it.

first step you must disable the microphone for use in recording mode and enable the stereo mixer (or virtual VB-CABLE) instead.

To do this, click on the speaker icon in the lower right corner (near the clock) - or select the "Sound" section in the "Control Panel". In the "Record" tab of the window that opens, right-click and check the boxes next to the items "Show disconnected devices" and "Show disconnected devices". Right-click on the microphone icon and select "Mute" (in general, turn off all devices marked with a green icon).

Right-click on the stereo mixer icon and select "Enable". A green icon will appear on the icon, indicating that the stereo mixer has become the default device.

If you decide to use VB-CABLE, then enable it in the same way in the "Record" tab.

And also - in the "Playback" tab.

Second step. Turn on audio recording in any player (if you need to decrypt the audio track of the video, you can also start the video player). At the same time, load the Speechpad.ru service in the Chrome browser and click the "Enable Recording" button in it. If the recording is of high enough quality, you will see how the service turns speech into meaningful and close to the original text before your eyes. True, without punctuation marks, which you will have to arrange yourself.

As an audio player, I advise you to use AIMP, which will be discussed in more detail in the third subchapter. Now I will only note that this player allows you to slow down the recording without speech distortion, as well as correct some other errors. This can somewhat improve the recognition of not very high-quality recordings. (Sometimes it is even advised to pre-process bad recordings in professional audio editing programs. However, in my opinion, this is too laborious a task for most users, who will type text much faster by hand. :))

2. Semi-automatic speech recognition

Everything is simple here. If the recording is of poor quality and the recognition “chokes” or the service produces too many errors, help the cause yourself by “embedding” in the chain: “audio player - announcer - recognition system”.

Your task is to listen to the recorded speech in headphones and simultaneously dictate it through the microphone to the Internet recognition service. (Of course, you don't need to switch from microphone to stereo mixer or virtual cable in the list of recording devices, as in the previous section). And as an alternative to the Internet services mentioned above, you can use smartphone applications like the free Yandex.Dictation or the dictation function in the iPhone with operating system iOS 8 and up.

I note that in semi-automatic mode you have the opportunity to immediately dictate punctuation marks, which services are not yet capable of placing in automatic mode.

If you manage to dictate synchronously with the playback of the recording on the player, the preliminary transcription will take almost as much time as the recording itself (not counting the subsequent time spent on correcting spelling and grammatical errors). But even working according to the scheme: "listen to a phrase - dictate - listen to a phrase - dictate" can give you a good time saving compared to traditional typing.

As an audio player, I recommend using the same AIMP. First, you can use it to slow down playback to a speed that you're comfortable with in synchronous dictation. Secondly, this player can return the recording for a given number of seconds: this is sometimes necessary in order to better hear an unintelligible phrase.

3. Manual transcription of a voice recorder

You can find out in practice that you get tired of semi-automatic dictation too quickly. Or you make too many mistakes with the service. Or, thanks to your speed typing skills, it is much easier to create ready-made corrected text on the keyboard than using dictation. Or your voice recorder, stereo headset microphone, audio card do not provide acceptable sound quality for the service. Or maybe you just don't have the opportunity to dictate out loud in your work or home office.

In all these cases, my proprietary manual decoding method will help you (listen to the recording in AIMP - type in Word). With it, you can turn a note into text faster than many professional journalists can, whose typing speed is similar to yours! At the same time, you will spend much less energy and nerves than they do. 🙂

What is the main reason for wasting energy and time during the transcription of audio recordings in the traditional way? Due to the fact that the user makes a lot of unnecessary movements.

The user constantly stretches out his hand to the voice recorder, then to the computer keyboard. I stopped playback - I typed the passage I listened to in a text editor - I turned on playback again - I rewound the illegible recording back - etc., etc.

Using a regular software player on a computer makes the process a little easier: the user has to constantly minimize / expand Word, stop / start the player, and even crawl back and forth with the player's slider to find an illegible fragment, and then return to the last listened place in the recording.

To reduce these and other losses of time, specialized IT companies are developing software and hardware transcribers. These are quite expensive solutions for professionals - the same journalists, court stenographers, investigators, etc. But, in fact, for our purposes, only two functions are required:

  • the ability to slow down the playback of a voice recorder without distorting it and lowering the tone (many players allow you to slow down the playback speed - but, alas, at the same time, the human voice turns into a monstrous robotic voice that is difficult to hear for a long time);
  • the ability to stop recording or roll it back for a specified number of seconds and return it back without stopping typing and without minimizing the text editor window.

I've tested dozens of audio programs in my time - and found only two affordable paid applications that meet these requirements. Got one of them. I searched a little more for my dear readers 🙂 - and found a wonderful free solution - the AIMP player, which I still use myself.

“After entering the AIMP settings, find the Global Keys section and reconfigure Stop/Start to the Escape (Esc) key. Believe me, this is the most convenient, because you don’t have to think about it and your finger won’t accidentally fall on other keys. Set the items “Move backward a little” and “Move forward a little”, respectively, to the Ctrl + back / forward cursor keys (you have four arrow keys on your keyboard - select two of them). This function is needed to re-listen to the last fragment or skip forward a little.

Then, by calling up the EQ, you can decrease the Velocity and Tempo values ​​- and increase the Pitch value. In this case, you will notice that the playback speed will slow down, but the pitch of the voice (if you choose the “Pitch” value well) will not change. Choose these two parameters so that you have time to type almost simultaneously, only occasionally stopping it.

When everything is set up, typing will take you less time and your hands will tire less. You will be able to transcribe the audio recording calmly and comfortably, practically without lifting your fingers from typing on the keyboard.”

I can only add to what has been said that if the recording is not of very high quality, you can try to improve its playback by experimenting with other settings in AIMP's Sound Effects Manager.

And the number of seconds for which it will be most convenient for you to move backward or forward through the recording using hot keys - set in the “Player” section of the “Settings” window (which can be called up by pressing the hot keys “Ctrl + P”).

I want to save more time routine tasks- and fruitfully use it for the main things! 🙂 And do not forget to turn on the microphone in the list of recording devices when you are going to talk on Skype! 😉

3 ways to transcribe a voice recording: speech recognition, dictation, manual mode

“I would like to say right away that I am dealing with recognition services for the first time. And so I will tell you about the services from a philistine point of view,” our expert noted, “for testing recognition, I used three instructions: Google, Yandex and Azure.”

Google

The notorious IT corporation offers to test its Google Cloud Platform product online. Anyone can try the service for free. The product itself is convenient and easy to use.

Pros:

  • support for more than 80 languages;
  • fast processing of names;
  • high-quality recognition in conditions of poor communication and in the presence of extraneous sounds.

Minuses:

  • there are difficulties in recognizing messages with an accent and poor pronunciation, which makes the system difficult to use by anyone other than native speakers;
  • lack of clear technical support service.

Yandex

Speech recognition from Yandex is provided in several versions:

  • Cloud
  • Library for access from mobile applications
  • "Boxed" version
  • JavaScript API

But let's be objective. We are primarily interested not in the variety of possibilities of use, but in the quality of speech recognition. Therefore, we took advantage of the trial version of SpeechKit.

Pros:

  • ease of use and configuration;
  • good text recognition in Russian;
  • the system gives several answers and tries to find the most similar answer through neural networks.

Minuses:

  • when streaming, some words may not be defined correctly.

Azure

Azure is developed by Microsoft. Against the background of analogues, it stands out due to the price. But, be prepared to face some challenges. The instructions presented on the official website are either incomplete or outdated. We failed to adequately launch the service, so we had to use a third-party launch window. However, even here, for testing, you will need a key from the Azure service.

Pros:

  • compared to other services, Azure processes messages in real time very quickly.

Minuses:

  • the system is very sensitive to accent, it is difficult to recognize speech from non-native speakers;
  • The system only works in English.

Review results:

After weighing all the pros and cons, we settled on Yandex. SpeechKit is more expensive than Azure, but cheaper than Google Cloud Platform. In the program from Google, a constant improvement in the quality and accuracy of recognition was noticed. The service is self-improving through machine learning technologies. However, recognition of Russian-language words and phrases from Yandex is a level higher.

How to use voice recognition in business?

There are a lot of options for using recognition, but we will focus on the one that, first of all, will affect the sales of your company. For clarity, let's analyze the process of recognition using a real example.

Not so long ago, one well-known SaaS service became our client (at the request of the company, the name of the service was not disclosed). With the help of F1Golos, they recorded two audio clips, one of which was aimed at extending the life of warm customers, the other - to process customer requests.

How to extend the life of customers with voice recognition?

Often, SaaS services operate on a monthly subscription fee. Sooner or later, the period of trial use or paid traffic - ends. Then there is a need to extend the service. The company decided to warn users about the end of the traffic 2 days before the expiration of the period of use. Users were notified via voicemail. The video sounded like this: “Good afternoon, we remind you that your period of paid use of the XXX service is ending. To extend the service, say yes, to cancel the services provided, say no.

Calls from users who said the code words: YES, EXTEND, I WANT, DETAILS; were automatically transferred to the company's operators. So, about 18% of users extended their registration due to only one call.

How to simplify the data processing system using speech recognition?

The second audio clip, launched by the same company, was of a different nature. They used voicemail to reduce the cost of verifying phone numbers. Previously, they verified user numbers using a bot call. The robot asked users to press certain keys on the phone. However, with the advent of recognition technologies, the company changed tactics. The text of the new video was as follows: “You have registered on the XXX portal, if you confirm your registration, say yes. If you didn't submit a registration request, say no." If the client uttered the words: YES, I CONFIRM, AHA or COURSE, the data about this was instantly transferred to the company's CRM system. And the registration request was confirmed automatically in a couple of minutes. The introduction of recognition technologies has reduced the time of one call from 30 to 17 seconds. Thus, the company reduced costs by almost 2 times.

If you are interested in other ways to use voice recognition, or want to learn more about voice mailings, follow the link. At F1Golos, you can send out your first newsletter for free and learn for yourself how new recognition technologies work.