Uima Ruta StringList - uima

Is there a way to iterate StringList in Ruta, provided the strings in the StringList are not present in the input document?
Sample StringList
Television
Computer
Tablet
Sound
Sample Input Document
Flat-screen televisions for sale at a consumer electronics store in 2008.
Television (TV), sometimes shortened to tele or telly is a telecommunication medium used for transmitting moving images in monochrome (black and white), or in colour, and in two or three dimensions and sound. The term can refer to a television set, a television program ("TV show"), or the medium of television transmission. Television is a mass medium for advertising, entertainment and news.
Problem
I want to get the values, Computer and Tablet as a result from the output CAS (say as an annotation or a feature). Is there a way to do so?

There is currently (2.6.1) no way to iterate over a StringList in Ruta as far as I know.
You want to return all entries that are not present in the text?
I have not tried it, but you could maybe use multiple lists, and add entries to one list if they occur in the text and to the other if they do not. Then, you story the second StringList in a feature.
(I would probably use a simple Java analysis engine instead of Ruta)
DISCLAIMER: I am a developer of UIMA Ruta

Related

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.
Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.enableExternalScorer(args.scorer)
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

Scanning invoices using OCR in swift

I am currently working on scanning invoices with OCR scanning. All invoices use the "OCRB" font, and have the same formatting.
The bottom of a sample invoice looks like this
This is what the user needs to scan.
I have tried many different libraries to detect what I want. But most libraries doesn't give me the correct result. The best result came from Firebase ML Vision text recognition.
But the resulting output I get is this:
I can calculate if the values are correct, except for the amount, presented in the middle. In this case it's presented as "3557 00" but if the user moves the camera a bit further to the right, the result I get is "557 00". Since both MLKit and other libraries cuts around the word, I have no idea if the full sum is presented or not.
If I would get a single space before the word, I could get that there is a full "word", in this case a sum.
Anyone has any ideas of how what library to use to get the best result?

Why FastText test of a model return only 1 exemple when my test file contains 135

I'm trying to test the model (model.bin) i've made with fastText on a test file (test.txt). In this test file, i have 135 labelised data. I'm expecting from fastText to test my model on this number of example, but instead, it only test it over 1 example. Where does come from this problem ?
I've already tried to do such a thing with another model and another testing file and all worked nicely.
this is how I test my model. model_baby.bin is the model, and test.data.txt is my testing file.
./fasttext test model_baby.bin test.data.txt
N 1
P#1 1
R#1 0.0164
Number of examples: 1
And here is an extract from my testing file
__label__4.0 I love the fact you can hide your stuff. Only down is that the straps to hold it at midpoint and bottom could be better designed for your car. It's got plenty of room which is great. __label__5.0 This hid our ipad wonderfully. Especially for those quick stops where we all had jump out and use the restroom. It zipped, folded and held all our stuff for the kids in the back seat. __label__3.0
As i have more than 1 labelised example in my testing file, I expect the output "Number of examples: " to be at least more than 1 but the actual one is "1"
From the official documentation (https://fasttext.cc/docs/en/supervised-tutorial.html): Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the __label__ prefix, which is how fastText recognize what is a label or what is a word.
I don't understand very much your extract. I think it should be like this:
__label__4.0 I love the fact you can hide your stuff. Only down is that the straps to hold it at midpoint and bottom could be better designed for your car. It's got plenty of room which is great.
__label__5.0 This hid our ipad wonderfully. Especially for those quick stops where we all had jump out and use the restroom. It zipped, folded and held all our stuff for the kids in the back seat.
__label__3.0 ...

Text classification using Weka

I'm a beginner to Weka and I'm trying to use it for text classification. I have seen how to StringToWordVector filter for classification. My question is, is there any way to add more features to the text I'm classifying? For example, if I wanted to add POS tags and named entity tags to the text, how would I use these features in a classifier?
It depends of the format of your dataset and the preprocessing steps you perform. For instance, let us suppose that you have pre-POS-tagged your texts, looking like:
The_det dog_n barks_v ._p
So you can build an specific tokenizer (see weka.core.tokenizers) to generate two tokens per word, one would be "The" and the other one would be "The_det" so you keep the tag information.
If you want only tagged words, then you can just ensure that "_" is not a delimiter in the weka.core.tokenizers.WordTokenizer.
My advice is to have both the words and tagged words, so a simpler way would be to write an script that joins the texts and the tagged texts. From a file containing "The dog barks" and another one cointaining "The_det dog_n barks_v ._p", it would generate a file with "The The_det dog dog_n barks barks_v . ._p". You may even forget about the order unless you are going to make use of n-grams.

Single barcode with Code128B and Code128C with iTextSharp

I wish to generate a barcode mixing code128B and code128C with iTextSharp DLL. Do you know how to do that ? I currently know only with a single codeset.
By example, I wish to generate a barcode with the value 8L1 91450 883421 0550 001065
where "8L1 91450" is in code128B and "883421 0550 001065" is in code128C.
Thanks
Barcode128 will actually automatically switch from B to C if and when it can but it sounds like you don't want this. For the control that you're looking for you'll need to set your barcode's CodeType property to Barcode.CODE128_RAW and manually set the raw values.
There's a couple of posts out there that give the basic idea but unfortunately they tend to assume to much knowledge of iText or too much knowledge of barcodes.
I'm not a barcode expert either but the basic idea is to create a string that starts with Barcode128.START_B, then the first part of your text, then Barcode128.START_C and then the second. When in raw mode, text isn't ASCII, however. You can use this site to get the character codes for various ASCII values. But basically instead of sending the letter L you'd send (char)44.
Hopefully this gets you started at least.