Speechat; an algorithm for real-time Spoken Language Assessment in python

5 min readAug 1, 2019

By Shahab Sabahi

This algorithm https://shahabks.github.io/Speechat/ was built for processing high-entropy speech (simultaneous free speech processing) using probabilistic machine learning and deep learning models to predict spoken English language proficiency. This algorithm can measure the “pronunciation”, “prosody”, “use of language” competency and latent semantic index of a user (speaker) to rate its spoken proficiency based on a classification scores and also compare it with the average rate of non-native and native speakers.

This is the results from two years of study whose overall achievement is an average assessment accuracy level of 72% for non-native adult speakers. The correlation between the human scores and the machine scores for an overall measure of speaking was 0.86 thus proving the reliability of the measure of speaking in tests.

Introduction

it transforms sounds/language into vectors in a n-dimension sphere in which each feature is vectorized to represent pronunciation, prosody, and language for further evaluation.

The models range from parametric, non-parametric statistics to a Neural Networks architecture. A scoring rubric philosophy was adopted for judgment on the spoken language proficiency level. This framework could change and be customized on demand.

The mic input is crucial in the accuracy of the results. Certainly, pre-recorded sounds can be analysed, some parts of acoustic features which contain key information will be vanished though. It happens when the sound is compressed during the digitization process.

Here are the models generated by the algorithms:

CART,
ETC
NN,
LDA,
LR,
MLTRNL,
CNN,
RNN,
myspsolution,
* PCA,
REF,
SVN,
dfdg,
forquil,

INSIDE SPEECH RATER

The scientific way to measure one’s Reading/Speaking rate is in syllables per second.

Speech Rater’s estimate of the “Speaking Rate” is obtained by timing the user while reading a selection of text with a known syllables count.

The algorithm evaluates the competency of the user by employing mathematical formulas and an independent speaking rubric and philosophy.

The algorithm was trained with an audio dataset of non-native English speakers. These audios ranged from just 1 minute in length from the speech audios of the English speaking trainees. Speakers’ topics vary widely, total 13762 minutes audios.

The speech audios of the trainees had been rated by native English teachers.

There are three models-set:

SET-1; it was developed based on non-native English speakers.
SET-2; it was developed based on non-native and native English speakers in ordinary conversation situations.
SET-3; it was developed based on non-native and native English speakers where they had spoken about specific topics with having background knowledge of them.

Definition

Total speaking fluency refers to the ability of speakers to speak the words about specific topics in English effortlessly and efficiently (automaticity) with meaningful expression that enhances the meaning of the topics (prosody). Fluency takes phonics or word recognition to the next level. While many speakers can decode words accurately, they may not be fluent or automatic in their word recognition in simultaneous speech. These speakers tend to expend too much of their limited mental energy on figuring out the pronunciation and meaning of words, energy that is taken away from that more important task in conveying intelligible ideas — getting to the topics overall meaning. Thus, the lack of fluency often results in poor communication.

Fluent speakers, on the other hand, are able to speak words accurately and effortlessly. They speak words and phrases instantly on spot. A minimal amount of cognitive energy is expended in decoding the words. This means, then, that the maximum amount of a speaker’s cognitive energy can be directed to the all-important task of making sense of its ideas.

The second component to fluency is prosody , or speaking with expression. A key characteristic of fluent speakers (or speech, for that matter) is the ability to embed appropriate expression into the speaking.Fluent speakers raise and lower the volume and pitch of their voices, they speed up and slow down at appropriate places through the course of speech, they speak words in meaningful groups or phrases, they pause at appropriate places through the course of speech. All these are elements of expression, or what linguists have termed prosody. Prosody is essentially the melody of language as it is spoken. By embedding prosody in our oral language (read or spoken), we are adding meaning to the communication.

Latent Semantic Analysis (LSA) is employed for analyzing speech to find the underlying meaning or concepts of those used words in speech. If each word was only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts. Since in languages words or a group of words or even spoken words in different intonation have different meaning, the semantic analysis becomes a difficult task to complete. It means that the same group of words/words could convey multiple meanings and it might creates sorts of ambiguities in comprehensive communication between people. At this stage, we use LSA by introducing datasets which are represented as (1) “bags of words”, where the order of the words in a document is not important, only how many times each word appears being considered. (2) Concepts are represented as patterns of words that usually appear together in documents. (3) weighted abduction which recognizes textual entailment and sentence structures.

SCOPE AND LIMITATIONS

Note, this is the ”Speaking Rater” for evaluating simultaneous free-speech. If a user reads out loud, its results should not be considered as the same thing as its spoken language proficiency.
The best way to determine the user’s speaking rate is to time the user’s delivering a free speech.
All the annotations that will been analyzed by the current algorithm are based on the mentioned rubrics and the non-native English speaker audios. We do not claim that these are 100% accurate or the only way the speech can be analyzed. We will upgrade the algorithm. Your comments and feedback are most welcome. Please feel free to contact us and let us know your thoughts about the corpus.
The evaluation mode could be adjusted either to Flexible or stringent. The stringent mode is sensitive to high accuracy of the language production, the standard rate of reading, and the ability to read sentences effortlessly, and automatically with little conscious attention to the mechanics of reading, such as decoding. While the flexible mode was originally designed for beginners to allow them build their confidence along with growing their skills.

High quality recording

Step 1. Find a quiet place for recording. Make sure to turn off all background machinery and electronic appliances, such as your TV set.
Step 2. Set up your recording equipment Plug in and test your microphone. Please do not put the microphone too close to your mouth(10–12 inches from the speaker is preferred)to avoid “p pops”.
Step 3. Adjust the recording settings. Before starting your recording, you must be certain that your machine sound recorder will record at DVD quality mono settings (44.10 kHz., 24-bit, mono).

Speechat; an algorithm for real-time Spoken Language Assessment in python

Introduction

INSIDE SPEECH RATER

There are three models-set:

Definition

SCOPE AND LIMITATIONS

High quality recording

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sabailabo

Responses (1)