Building openears compatible language model

Building openears compatible language model - iphone

I am doing some development on speech to text and text to speech and I found the OpenEars API very useful.
The principle of this cmu-slm based API is it uses a language model to map the speech listened by the iPhone device. So I decided to find a big English language model to feed the API speech recognizer engine. But I failed to understand the format of the voxfourge english data model to use with OpenEars.
Do anyone have any idea that how can I get the .languagemodel and .dic file for English language to work with OpenEars?

Regarding LM Formats:
AFAIK most Language Models use the ARPA standard for Language Models. Sphinx / CMU language models are compiled into binary format. You'd need the source format to convert a Sphinx LM into another format. Most other Language Models are in text format.
I'd recommend using the HTK Speech Recognition Toolkit ; Detailed Documentation here: http://htk.eng.cam.ac.uk/ftp/software/htkbook_html.tar.gz
Here's also a description of CMU's SLM Toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html
Here's an example of a language model in ARPA format I found on the net: http://www.arborius.net/~jphekman/sphinx/full/index.html
You probably want to create an ARPA LM first, then convert it into any binary format if needed.
In General:
To build a language model, you need lots and lots of training data - to determine what the probability of any other word in your vocabulary is, after observing the current input to this point in time.
You can't just "make" a language model by just adding the words you want to recognize - you also need a lot of training data (= typical input you observe when running your speech recognition application).
A Language Model is not just a word list -- it estimates the probability of the next token (word) in the input.
To estimate those probabilities, you need to run a training process, which goes over training data (e.g. historic data), and observes word frequencies there to estimate above mentioned probabilities.
For your problem, maybe as a quick solution, just assume all words have the same frequency / probability.
create a dictionary with the words you want to recognize (N words in dictionary)
create a language model which has 1/N as the probability for each word (uni-gram language model)
you can then interpolate that uni-gram language model (LM) with another LM for a bigger corpus using HTK Toolkit

Old question, but maybe the answer is still interesting. OpenEars now has built-in language model generation, so one option is for you to create models dynamically in your app as you need them using the LanguageModelGenerator class, which uses the MITLM library and NSScanner to accomplish the same task as the CMU toolkit mentioned above. Processing a corpus with >5000 words on the iPhone is going to take a very long time, but you could always use the Simulator to run it once and get the output out of the documents folder and keep it.
Another option for large vocabulary recognition is explained here:
Creating ARPA language model file with 50,000 words
Having said that, I need to point out as the OpenEars developer that the CMU tool's limit of 5000 words corresponds pretty closely to the maximum vocabulary size that is likely to have decent accuracy and processing speed on the iPhone when using Pocketsphinx. So, the last suggestion would be to either reconceptualize your task so that it doesn't absolutely require large vocabulary recognition (for instance, since OpenEars allows you switch models on the fly, you may find that you don't need one enormous model but can get by with multiple smaller ones that you can switch in in different contexts), or to use a network-based API that can do large vocabulary recognition on a server (or make your own API that uses Sphinx4 on your own server). Good luck!

Related

What is the use of "domain" attribute when we create a custom translation model

After reading the docs regarding creation of custom translation model it is not clear what is the use of the domain attribute when we create a IBM Cloud Translation custom translation model ?

As stated in the documentation here
Most of the provided translation models in Language Translator can be
extended to learn custom terms and phrases or a general style that's
derived from your translation data. Follow these instructions to
create your own custom translation model.
Use a parallel corpus when you want your custom model to learn from
general translation patterns in parallel sentences in your samples.
What your model learns from a parallel corpus can improve translation
results for input text that the model has not been trained on. You can
upload multiple parallel corpora files with a request. To successfully
train with parallel corpora, the corpora files must contain a
cumulative total of at least 5000 parallel sentences. The cumulative
size of all uploaded corpus files for a custom model is limited to 250
MB.
Check the customizing your model documentation for more info.
Make sure that your Language Translator service instance is on an Advanced or Premium pricing plan.

What is currently the best way to add a custom dictionary to a neural machine translator that uses the transformer architecture?

It's common to add a custom dictionary to a machine translator to ensure that terminology from a specific domain is correctly translated. For example, the term server should be translated differently when the document is about data centers, vs when the document is about restaurants.
With a transformer model, this is not very obvious to do, since words are not aligned 1:1. I've seen a couple of papers on this topic, but I'm not sure which would be the best one to use. What are the best practices for this problem?

I am afraid you cannot easily do that. You cannot easily add new words to the vocabulary because you don't know what embedding it would get during training. You can try to remove some words, or alternatively you can manually change the bias in the final softmax layer to prevent some words from appearing in the translation. Anything else would be pretty difficult to do.
What you want to do is called domain adaptation. To get an idea of how domain adaptation is usually done, you can have a look at a survey paper.
The most commonly used approaches are probably model finetuning or ensembling with a language model. If you want to have parallel data in your domain, you can try to fine-tune your model on that parallel data (with simple SGD, small learning rate).
If you only have monolingual data in the target language, you train a language model on that data. During the decoding, you can mix the probabilities from the domain-specific language and the translation model. Unfortunately, I don't know of any tool that could do this out of the box.

Google Cloud NL API term/classification quality and batch processing on Traditional Chinese (zh-Hant) data

I've tested Google Cloud NL API v1.0 these days. I mainly use Traditional Chinese(a.k.a. zh-Hant) data. After the testing, I find the quality is not satisfactory, classification is not right, too many one-character terms (many of them should be stop words), the worst quality is for unknown word recognition.
Also, some analysis methods (e.g. entity-sentimental) don't support zh-Hant (that I can only use 'en' to run zh-Hant data, pitty).
Does anyone know if NL API provides any way, e.g. set configuration, set parameters, or run some process, so as to improve training result?
Does anyone actually have experience on using NL API generated result to add some value-added feature on a business product or service?
Also, if I want to feed high-volume data, is there a library or SDK, that I can use to write code to carry out batch-in-batch-out processing?

IBM Watson Language Translation - correct way to train using parallel corpus

I have a bunch of articles that are translated, which I want to use as training data for IBM Watson language translation. What is the correct way to use these articles for training? Do I use the whole article and its translation as an entry in the parallel corpus, or do I have to split the article into sentences and have its translation pair as an entry?

You have two choices.
Either split up the text into phrase pairs with a from and to for each phrase, and create either a forced_glossary or a parallel_corpus.
Or send all the translated text as a single file to create a monolingual_corpus.
Detailed documentation is available at https://www.ibm.com/watson/developercloud/doc/language-translator/customizing.html#training
and the API documentation is available at https://www.ibm.com/watson/developercloud/language-translator/api/v2/?curl#create-model

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.

Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.

Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with

Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse