Artificial Intelligence for Speech Recognition

Updated on May 29, 2026

Abstract

AI is the study of the abilities for computers to perform tasks, which currently are better done by humans. AI has an interdisciplinary field where computer science intersects with philosophy, psychology, engineering and other fields. Humans make decisions based upon experience and intention. The essence of AI in the integration of computer to mimic this learning process is known as Artificial Intelligence Integration

When you dial the telephone number of a big company, you are likely to hear the sonorous voice of a cultured lady who responds to your call with great courtesy saying "welcome to company X. Please give me the extension number you want" .You pronounces the extension number, your name, and the name of the person you want to contact. If the called person accepts the call, the connection is given quickly. This is artificial intelligence where an automatic call-handling system is used without employing any telephone operator.

The Technology

Artificial intelligence (AI) involves two basic ideas. First, it involves studying the thought processes of human beings. Second, it deals with representing those processes via machines (like computers, robots, etc).AI is behaviour of a machine, which, if performed by a human being, would be called intelligence. It makes machines smarter and more useful, and is less expensive than natural intelligence.

Natural language processing (NLP) refers to artificial intelligence methods of communicating with a computer in a natural language like English. The main objective of a NLP program is to understand input and initiate action.The input words are scanned and matched against internally stored known words. Identification of a keyword causes some action to be taken. In this way, one can communicate with the computer in one's language. No special commands or computer language are required. There is no need to enter programs in a special language for creating software.

Voice XML takes speech recognition even further. Instead of talking to your computer, you're essentially talking to a web site, and you're doing this over the phone.OK, you say, well, what exactly is speech recognition? Simply put, it is the process of converting spoken input to text. Speech recognition is thus sometimes referred to as speech-to-text.Speech recognition allows you to provide input to an application with your voice. Just like clicking with your mouse, typing on your keyboard, or pressing a key on the phone keypad provides input to an application; speech recognition allows you to provide input by talking. In the desktop world, you need a microphone to be able to do this. In the Voice XML world, all you need is a telephone.

The speech recognition process is performed by a software component known as the speech recognition engine. The primary function of the speech recognition engine is to process spoken input and translate it into text that an application understands. The application can then do one of two things:The application can interpret the result of the recognition as a command. In this case , the application is a command and control application. If an application handles the recognized text simply as text, then it is considered a dictation application.

The user speaks to the computer through a microphone, which in turn, identifies the meaning of the words and sends it to NLP device for further processing. Once recognized, the words can be used in a variety of applications like display, robotics, commands to computers, and dictation.

What is a speech recognition system?

A speech recognition system is a type of software that allows the user to have their spoken words converted into written text in a computer application such as a word processor or spreadsheet. The computer can also be controlled by the use of spoken commands.

Speech recognition software can be installed on a personal computer of appropriate specification. The user speaks into a microphone (a headphone microphone is usually supplied with the product). The software generally requires an initial training and enrolment process in order to teach the software to recognise the voice of the user. A voice profile is then produced that is unique to that individual. This procedure also helps the user to learn how to ‘speak’ to a computer.

After the training process, the user’s spoken words will produce text; the accuracy of this will improve with further dictation and conscientious use of the correction procedure. With a well-trained system, around 95% of the words spoken could be correctly interpreted. The system can be trained to identify certain words and phrases and examine the user’s standard documents in order to develop an accurate voice file for the individual.

However, there are many other factors that need to be considered in order to achieve a high recognition rate. There is no doubt that the software works and can liberate many learners, but the process can be far more time consuming than first time users may appreciate and the results can often be poor. This can be very demotivating, and many users give up at this stage. Quality support from someone who is able to show the user the most effective ways of using the software is essential.

When using speech recognition software, the user’s expectations and the advertising on the box may well be far higher than what will realistically be achieved. ‘You talk and it types’ can be achieved by some people only after a great deal of perseverance and hard work.

Terms and Concepts

Following are a few of the basic terms and concepts that are fundamental to speech recognition. It is important to have a good understanding of these concepts when developing VoiceXML applications.

3.2.1 Utterances

When the user says something, this is known as an utterance. An utterance is any stream of speech between two periods of silence. Utterances are sent to the speech engine to be processed. Silence, in speech recognition, is almost as important as what is spoken, because silence delineates the start and end of an utterance. Here's how it works. The speech recognition engine is "listening" for speech input. When the engine detects audio input - in other words, a lack of silence -- the beginning of an utterance is signaled. Similarly, when the engine detects a certain amount of silence following the audio, the end of the utterance occurs.

Utterances are sent to the speech engine to be processed. If the user doesn’t say anything, the engine returns what is known as a silence timeout - an indication that there was no speech detected within the expected timeframe, and the application takes an appropriate action, such as reprompting the user for input. An utterance can be a single word, or it can contain multiple words (a phrase or a sentence).

3.2.2 Pronounciations

The speech recognition engine uses all sorts of data, statistical models, and algorithms to convert spoken input into text. One piece of information that the speech recognition engine uses to process a word is its pronunciation, which represents what the speech engine thinks a word should sound like. Words can have multiple pronunciations
associated with them. For example, the word “the” has at least two pronunciations in the U.S. English language: “thee” and “thuh.” As a VoiceXML application developer, you may want to provide multiple pronunciations for certain words and phrases to allow for variations in the ways your callers may speak them.

3.2.3 Grammars

As a VoiceXML application developer, you must specify the words and phrases that users can say to your application. These words and phrases are defined to the speech recognition engine and are used in the recognition process. You can specify the valid words and phrases in a number of different ways, but in VoiceXML, you do this by specifying a grammar. A grammar uses a particular syntax, or set of rules, to define the words and phrases that can be recognized by the engine. A grammar can be as simple as a list of words, or it can be flexible enough to allow such variability in what can be said that it approaches natural language capability.

3.2.4 Accuracy

The performance of a speech recognition system is measurable. Perhaps the most widely used measurement is accuracy. It is typically a quantitative measurement and can be calculated in several ways. Arguably the most important measurement of accuracy is whether the desired end result occurred. This measurement is useful in validating application design Another measurement of recognition accuracy is whether the engine recognized the utterance exactly as spoken.