The following is a high-level view of how we plan to stich together the speech recognition system. I'm especially interested in feedback on mime types for the various data flows. I'll post seperately on specific components as well.
Components
Feature Computation
* Data Input: audio/x-raw-int
* Metadata Input: None
* Data Outout: mpf-audio-features/rasta
* Metadata Output: None
* Parameters: None
* Source and license: ICSI Rasta library. Owned by ICSI, so any open license.
* ETA: Initial version complete. Support for more variety in features as needed.
Given audio input, generate synchronized output features based on the input. Features are used by later analytic components.
Initially, we will only support 16 bit, 16 kHz, linear PCM encoded headerless audio and rasta features. However, the component will be based on ICSI rasta library, which supports a huge variety of other options.
Speech/Non-speech Detector
* Data Input: mpf-audio-features/rasta
* Metadata Input: None
* Data Outout: mpf-audio-features/speech-nonspeech-scores
* Metadata Output: None
* Parameters: Pre-trained speech/non-speech acoustic models matching the features.
* Source and license: ICSI's fast online speech/non-speech package. Any open license.
* ETA: Very fast GMM system complete with initial models. Improved models in second installment. Increased accuracy vs. decreased speed in third installment if needed.
Given input features, estimate the likelihood that the acoustics contain speech vs. non-speech (e.g. silence, music, noise). Output frame synchronous speech and non-speech scores. This output could be used e.g. for user interface, compression, or to isolate downstream analytics only to portions of the input containing speech. Note that this component does no smoothing or temporal modeling and outputs synchronously with the input frames. Downstream components could use the speech/non-speech scores to e.g. generate a segmentation.
Speech/Non-speech Segmenter
* Data Input: mpf-audio-features/speech-nonspeech-scores
* Metadata Input: None
* Data Outout: None
* Metadata Output: mpf-audio-segments/speech-nonspeech
* Parameters: None
* Source and license: To be written by ICSI based on existing scripts.
* ETA: Simple version in second installment. Improved version with higher accuracy in third installment.
Imposes a temporal model on the speech/non-speech scores to produce segments suitable for use with speech recognition (e.g. smoothing and a minimum duration for speech).
Utterance Normalizer
* Data Input: mpf-audio-features/*
* Metadata Input: mpf-audio-segments/speech-nonspeech
* Data Outout: None
* Metadata Output: mpf-audio-features/speech-segments
* Parameters: None
* Source and license: Excerpted and ported from the ICSI quicknet package. Any open license.
* ETA: Second installment
This component takes a features and segments and performs mean and variance normalization of the segment. This has been shown to help significantly with accuracy of later analytic components by equalizing some of the effects of volume and speaker variability.
Decoder
* Data Input: None
* Metadata Input: mpf-audio-features/speech-segments
* Data Outout: None
* Metadata Output: text/x-asr-word-list
* Parameters: Pre-trained acoustic models matching the feature. Pre-trained language models. A pronunciation dictionary.
* Source and license: AMI Juicer Decoder (BSD license) ported to mpf by ICSI.
* ETA: Software port underway. "Bootstrap" models complete. Complete port by second installment with ICSI Meeting Corpus models. Better quality models in third installment.
Given a segment, compute the most likely sequence of words spoken during that segment.
Data and Metadata
mpf-audio-features (data)
Features are time synchronous data computed from the audio. For ASR work, they are typically related to the spectrum (e.g. how much energy is there at each frequency). Since features are a direct representation of the audio and are synchronized with the input stream, we consider features to be data rather than metadata. Features consist of a fixed number of floating point values per time step.
The type of feature is encoded in the caps.
mpf-audio-segments (metadata)
A segment is simply a span of time. It consists of a start time, an end time, a type, and a label. This is used, for example, to identify speech segments from non-speech segments. Segments need not be synchronized with the input stream.
ASR Word List (metadata)
This represents the most likely stream of words according to the decoder. It consists of start time, end time, the word, and a confidence score. It need not be synchronized with the input stream.
Example Dataflow
The image below demonstrates a simple dataflow for an ASR system which uses one set of features for speech/non-speech and another for decoding. This demonstrates one reason we chose to separate detection from segmentation in the speech/non-speech component. The dotted lines refer to file system resident models. The solid lines are data/metadata flow.


Naming conventions so we can check in the code
Hello Adam,
I picked up your tarballs and intend to plug them into our vcs and automated build system. Before I do, though, I'd like to settle on the names. For consistency, our mpf packages all start with "mpf-", and try be descriptive of what they do. So I'm thinking that rasta_mpf should probably be called something like mpf-feature-gen and spnsp should probalby called mpf-spnsp. Do you have any different sugestions?
Curt
don't use mpf-feature-gen
I like the mpf-<descriptive> pattern, but there are other "feature generators" we can have besides the ones from mystt; so, I might 'namespace' it like 'mpf-mystt-*' (mpf-mystt-feature-gen, etc)
Our most-flexible pattern is that the mpf-* package is the code needed to connect the analytic to MPF rather than having to be the underlying analytic code itself. (An example would be mpf-opencv; that package would depend on some opencv package, it wou'd statically bind in a copy of opencv. This has some nice license-management features, and helps to decouple the MPF code from GPL or proprietary code.) So I would expect to have mpf-mystt-feature-gen actually depend on some other package that isn't in MPF itself, and might not be Appscio-local.
Hence, once question here is where that code lives - is the MySTT group going to leave code on our servers, or are they hosting a repository we can pull from when they make changes?
(Some days I wish we had adopted git ...)
Sounds like you are
Sounds like you are suggesting:
mpf-mystt-feature-ge
mpf-mystt-spnsp
Names
Since there are many potential ways of generating features and doing speech/non-speech, may go even longer?
mpf-mystt-feature-gen-rasta
mpf-mystt-spnsp-gmm
Adam
P.S. Should this go to the forums rather than comments in the blog?
re: Speech Recognition overview
Adam, thanks very much for posting. We'll take a look at the new MPF components; we also need to get you hooked into our source control and build system so that you can continue to evolve these components.
Gareth