NLP ELMo model pruning input - neural-network

I am trying to retrieve embeddings for words based on the pretrained ELMo model available on tensorflow hub. The code I am using is modified from here: https://www.geeksforgeeks.org/overview-of-word-embedding-using-embeddings-from-language-models-elmo/
The sentence that I am inputting is
bod =" is coming up in and every project is expected to do a video due on we look forward to discussing this with you at our meeting this this time they have laid out the selection criteria for the video award s go for the top spot this time "
and these are the keywords I want embeddings for:
words=["do", "a", "video"]
embeddings = elmo([bod],
signature="default",
as_dict=True)["elmo"]
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
this sentence is 236 characters in length.
this is the picture showing that
but when I put this sentence into the ELMo model, the tensor that is returned is only contains a string of length 48
and this becomes a problem when i try to extract embeddings for keywords that are outside the 48 length limit because the indices of the keywords are shown to be outside this length:
this is the code I used to get the indices for the words in 'bod'(as shown above)
num_list=[]
for item in words:
print(item)
index = bod.index(item)
num_list.append(index)
num_list
But i keep running into this error:
I tried looking for ELMo documentation to explain why this is happening but I have not found anything related to this problem of pruned input.
Any advice is much appreciated!
Thank You

This is not really an AllenNLP issue since you are using a tensorflow-based implementation of ELMo.
That said, I think the problem is that ELMo embeds tokens, not characters. You are getting 48 embeddings because the string has 48 tokens.

Related

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.
Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.enableExternalScorer(args.scorer)
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

Need to create dictionary of idf values, associating words with their idf values

I understand how to get the idf values and vocabulary using the vectorizer. With vocabulary the frequency of the word is the value and the word is the key of a dictionary, however, what I want the value to be is the idf value.
I haven't been able to try anything because I don't know how to work with sklearn.
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
The code provided above is what I was originally trying to work with.
I have since come up with a new solution that does not use scikit:
for string in text_array:
for word in string:
if word not in total_dict.keys(): # build up a word frequency in the dictionary
total_dict[word] = 1
else:
total_dict[word] += 1
for word in total_dict.keys(): # calculate the tf-idf of each word in the dictionary using this url: https://nlpforhackers.io/tf-idf/
total_dict[word] = math.log(len(text_array) / float(1 + total_dict[word]))
print("word", word, ":" , total_dict[word])
Let me know if the code snippet above is enough to allow a reasonable estimation of what is going on. I included a link to what I was using for guidance.
You can directly use vectorizer.fit_transform(text) for the first time.
What it does is build a vocabulary set according to all the word/tokens in the text.
And then you can use vectorizer.transform(anothertext) to vectorize another text with the same mapping as the previous text.
More explanation:
fit() is to learn vocabulary and idf from training set. transform() is to transform the documents based on the learned vocabulary from the previous fit().
So you should only do fit() once, and can transform many times.

Why FastText test of a model return only 1 exemple when my test file contains 135

I'm trying to test the model (model.bin) i've made with fastText on a test file (test.txt). In this test file, i have 135 labelised data. I'm expecting from fastText to test my model on this number of example, but instead, it only test it over 1 example. Where does come from this problem ?
I've already tried to do such a thing with another model and another testing file and all worked nicely.
this is how I test my model. model_baby.bin is the model, and test.data.txt is my testing file.
./fasttext test model_baby.bin test.data.txt
N 1
P#1 1
R#1 0.0164
Number of examples: 1
And here is an extract from my testing file
__label__4.0 I love the fact you can hide your stuff. Only down is that the straps to hold it at midpoint and bottom could be better designed for your car. It's got plenty of room which is great. __label__5.0 This hid our ipad wonderfully. Especially for those quick stops where we all had jump out and use the restroom. It zipped, folded and held all our stuff for the kids in the back seat. __label__3.0
As i have more than 1 labelised example in my testing file, I expect the output "Number of examples: " to be at least more than 1 but the actual one is "1"
From the official documentation (https://fasttext.cc/docs/en/supervised-tutorial.html): Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the __label__ prefix, which is how fastText recognize what is a label or what is a word.
I don't understand very much your extract. I think it should be like this:
__label__4.0 I love the fact you can hide your stuff. Only down is that the straps to hold it at midpoint and bottom could be better designed for your car. It's got plenty of room which is great.
__label__5.0 This hid our ipad wonderfully. Especially for those quick stops where we all had jump out and use the restroom. It zipped, folded and held all our stuff for the kids in the back seat.
__label__3.0 ...

Text classification using Weka

I'm a beginner to Weka and I'm trying to use it for text classification. I have seen how to StringToWordVector filter for classification. My question is, is there any way to add more features to the text I'm classifying? For example, if I wanted to add POS tags and named entity tags to the text, how would I use these features in a classifier?
It depends of the format of your dataset and the preprocessing steps you perform. For instance, let us suppose that you have pre-POS-tagged your texts, looking like:
The_det dog_n barks_v ._p
So you can build an specific tokenizer (see weka.core.tokenizers) to generate two tokens per word, one would be "The" and the other one would be "The_det" so you keep the tag information.
If you want only tagged words, then you can just ensure that "_" is not a delimiter in the weka.core.tokenizers.WordTokenizer.
My advice is to have both the words and tagged words, so a simpler way would be to write an script that joins the texts and the tagged texts. From a file containing "The dog barks" and another one cointaining "The_det dog_n barks_v ._p", it would generate a file with "The The_det dog dog_n barks barks_v . ._p". You may even forget about the order unless you are going to make use of n-grams.

Solr search error when dealing with Arabic string

I'm struggling with Solr search Arabic for several days and made some experiment. Here is the simple reflection of the problem.
After I store some Arabic sentence (now only 1 word السوري ) into database and have Solr index it, then query it by q=*:*&wt=python,(if no wt part, it was garbled chars) the response is:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
The actual word I store there for index is coding in another way:
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
As you can tell, there is a one-to-to corresponding from \xd8↔\u00d8. But I don't know what is the name of this coding, thus I cannot convert it. And when I do the search as: <>/select/?q=السوري&wt=python,the response is:
{'responseHeader':{'status':0,'QTime':0,'params':{'wt':'python','q':u'\u0627\u0644\u0633\u0648\u0631\u064a'}},'response':{'numFound':0,'start':0,'docs':[]}}
No docs found and it seems using a third version for coding u'\u0627\u0644\u0633\u0648\u0631\u064a'. if I take it and encode('utf8') then it convert back to '\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'.
In summary, when it (السوري) is in my code (python) or in data base (mysql),
it presents as 'form1':
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
When it is indexed by Solr, it converts to form2:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
And when I use <>/select/?q=السوري&wt=python, to query from browser (Google chrome), it becomes form3:
'\u0627\u0644\u0633\u0648\u0631\u064a'
(which could convert back to form1 by encode('utf8') But since they are different, the search matches nothing.
Therefore, those three different encode strategy may be the core problem. Could anyone help me figure it out and solve the search problem?
Thanks in advance.