Encoding error: Train BERT from scratch in Vietnamese language

Encoding error: Train BERT from scratch in Vietnamese language - encoding

I follow this tutorial How to train a new language model from scratch using Transformers and Tokenizers.
In Section 2. Train a tokenizer, after training by my own Vietnamese text data, I look at the .vocab file generated, all the tokens become like this:
"ĠÄĳ":268,"nh":269,"á»§":270,"Ãł":271,"Ġch":272,"iá»":273,"Ã¡":274,"Ġl":275,"Ġb":276,"Æ°":277,"Ġh":278,"áº¿":279,
any idea to fix this?

You can read these tokens by using a ByteLevel decoder (the part used by the ByteLevelBPETokenizer) as follows:
a = tokenizer.encode("thầy giáo rất tốt.").tokens
print(a)
>> ['<s>', 'tháº§y', 'ĠgiÃ¡o', 'Ġráº¥t', 'Ġtá»ĳt', '.', '</s>']
from tokenizers.decoders import ByteLevel
decoder = ByteLevel()
decoder.decode(a)
>> <s>thầy giáo rất tốt.</s>

Related

NLP ELMo model pruning input

I am trying to retrieve embeddings for words based on the pretrained ELMo model available on tensorflow hub. The code I am using is modified from here: https://www.geeksforgeeks.org/overview-of-word-embedding-using-embeddings-from-language-models-elmo/
The sentence that I am inputting is
bod =" is coming up in and every project is expected to do a video due on we look forward to discussing this with you at our meeting this this time they have laid out the selection criteria for the video award s go for the top spot this time "
and these are the keywords I want embeddings for:
words=["do", "a", "video"]
embeddings = elmo([bod],
signature="default",
as_dict=True)["elmo"]
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
this sentence is 236 characters in length.
this is the picture showing that
but when I put this sentence into the ELMo model, the tensor that is returned is only contains a string of length 48
and this becomes a problem when i try to extract embeddings for keywords that are outside the 48 length limit because the indices of the keywords are shown to be outside this length:
this is the code I used to get the indices for the words in 'bod'(as shown above)
num_list=[]
for item in words:
print(item)
index = bod.index(item)
num_list.append(index)
num_list
But i keep running into this error:
I tried looking for ELMo documentation to explain why this is happening but I have not found anything related to this problem of pruned input.
Any advice is much appreciated!
Thank You

This is not really an AllenNLP issue since you are using a tensorflow-based implementation of ELMo.
That said, I think the problem is that ELMo embeds tokens, not characters. You are getting 48 embeddings because the string has 48 tokens.

How to writeback to dataframe using transform_df in palantir foundry?

I created a library for updating description of the columns of the input dataset. This function takes three parameter as input (input_dataset, output_dataset, config file) and eventually writes back the description of output dataset. So now we want to import this library across various use cases. How to go for those cases where we are writing spark transformation i.e taking inputs through transform_df because here we can't assign output to output variable. In that situation how can i call my description library function? How to proceed in those situation in palantir foundry. Any suggestions?

This method isn't currently supported using the #transform_df decorator; you'll have to use the #transform decorator at the moment.
The reasoning behind this resulted from recognizing the need for broader access to metadata APIs like the #transform decorator already allows. Thus it seemed more in line with this pattern to keep it there since the #transform_df decorator is inherently higher-level.
You can always simply move over your transformations from...
from transforms.api import transform_df, Input, Output
#transform_df(
Output("/my/output"),
my_input("/my/input"),
)
def my_compute_function(my_input):
df = my_input
# ... logic ....
return my_input
...to...
from transforms.api import transform, Input, Output
#transform(
my_output=Output("/my/output"),
my_input=Input("/my/input")
)
def my_compute_function(my_input, my_output):
df = my_input.dataframe()
# ... logic ....
my_output.write_dataframe(df)
...in which only 6 lines of code need be changed.

Why does Blockcypher signer tool return some extra characters than bip32 dart package?

I'm trying to sign a transaction skeleton Blockcypher returns, in order to send it along, following https://www.blockcypher.com/dev/bitcoin/#creating-transactions.
For this example, I'll use the completely-unsafe 'raw raw raw raw raw raw raw raw raw raw raw raw' mnemonic, which using dart bip32 package creates a BIP32 with private key 0x05a2716a8eb37eb2aaa72594573165349498aa6ca20c71346fb15d82c0cbbf7c and address mpQfiFFq7SHvzS9ebxMRGVohwHTRJJf9ra for BTC testnet.
Blockcypher Tx Skeleton tosign is 1cbbb4d229dcafe6dc3363daab8de99d6d38b043ce62b7129a8236e40053383e.
Using Blockcypher signer tool:
$ ./signer 1cbbb4d229dcafe6dc3363daab8de99d6d38b043ce62b7129a8236e40053383e 05a2716a8eb37eb2aaa72594573165349498aa6ca20c71346fb15d82c0cbbf7c
304402202711792b72547d2a1730a319bd219854f0892451b8bc2ab8c17ec0c6cba4ecc4022058f675ca0af3db455913e59dadc7c5e0bd0bf1b8ef8c13e830a627a18ac375ab
On the other hand, using bip32 I get:
String toSign = txSkel['tosign'][0];
var uToSign = crypto.hexToBytes(toSign);
var signed = fromNode.sign(uToSign);
var signedHex = bufferToHex(signed);
var signedHexNo0x = signedHex.substring(2);
where fromNode is the bip32.BIP32 node. Output is signedHexNo0x = 2711792b72547d2a1730a319bd219854f0892451b8bc2ab8c17ec0c6cba4ecc458f675ca0af3db455913e59dadc7c5e0bd0bf1b8ef8c13e830a627a18ac375ab.
At first sight, they seem completely different buffers, but after a detailed look, Blockcypher signer output only has some extra characters than that of bip32:
Blockcypher signer output (I split it into several lines for you to see it clearly):
30440220
2711792b72547d2a1730a319bd219854f0892451b8bc2ab8c17ec0c6cba4ecc4
0220
58f675ca0af3db455913e59dadc7c5e0bd0bf1b8ef8c13e830a627a18ac375ab
bip32 output (also intentionally split):
2711792b72547d2a1730a319bd219854f0892451b8bc2ab8c17ec0c6cba4ecc4
58f675ca0af3db455913e59dadc7c5e0bd0bf1b8ef8c13e830a627a18ac375ab
I'd expect two 64-character numbers to give a 128-characters signature, which bip32 output accomplishes. Hence, Blockcypher signer output has 140 characters, i.e. 12 more than the former, which is clear when seen as split into lines as above.
I'd be really thankful to anyone throwing some light on this issue, which I need to understand and correct. I need to implement the solution in dart, I cannot use the signer script other than for testing.

The dart bip32 package doesn't seem to encode the signature in DER format, but rather in a simple (r, s) encoding. However DER is required for Bitcoin. For more information see:
https://bitcoin.stackexchange.com/questions/92680/what-are-the-der-signature-and-sec-format
You can either add the DER extra bytes yourself according to your r and s or check if there's a DER encoding in the dart bip32 library.

Need to create dictionary of idf values, associating words with their idf values

I understand how to get the idf values and vocabulary using the vectorizer. With vocabulary the frequency of the word is the value and the word is the key of a dictionary, however, what I want the value to be is the idf value.
I haven't been able to try anything because I don't know how to work with sklearn.
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
The code provided above is what I was originally trying to work with.
I have since come up with a new solution that does not use scikit:
for string in text_array:
for word in string:
if word not in total_dict.keys(): # build up a word frequency in the dictionary
total_dict[word] = 1
else:
total_dict[word] += 1
for word in total_dict.keys(): # calculate the tf-idf of each word in the dictionary using this url: https://nlpforhackers.io/tf-idf/
total_dict[word] = math.log(len(text_array) / float(1 + total_dict[word]))
print("word", word, ":" , total_dict[word])
Let me know if the code snippet above is enough to allow a reasonable estimation of what is going on. I included a link to what I was using for guidance.

You can directly use vectorizer.fit_transform(text) for the first time.
What it does is build a vocabulary set according to all the word/tokens in the text.
And then you can use vectorizer.transform(anothertext) to vectorize another text with the same mapping as the previous text.
More explanation:
fit() is to learn vocabulary and idf from training set. transform() is to transform the documents based on the learned vocabulary from the previous fit().
So you should only do fit() once, and can transform many times.

pyspark unindex one-hot encoded and assembled columns

I have the following code which takes in a mix of categorical, numeric features, string indexes the categorical features, then one hot encodes the categorical features, then assembles both one hot encoded categorical features and numeric features, runs them trough a random forest and prints the resultant tree. I want the tree nodes to have the original features names (i.e Frame_Size etc). How can I do that? In general how can I decode one hot encoded and assembled features?
# categorical features : start singindexing and one hot encoding
column_vec_in = ['Commodity','Frame_Size' , 'Frame_Shape', 'Frame_Color','Frame_Color_Family','Lens_Color','Frame_Material','Frame_Material_Summary','Build', 'Gender_Global', 'Gender_LC'] # frame Article_Desc not slected because the cardinality is too high
column_vec_out = ['Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']
indexers = [StringIndexer(inputCol=x, outputCol=x+'_tmp') for x in column_vec_in ]
encoders = [OneHotEncoder(dropLast=False, inputCol=x+"_tmp", outputCol=y) for x,y in zip(column_vec_in, column_vec_out)]
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]
#categorical and numeric features
cols_now = ['SODC_Regular_Rate','Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']
assembler_features = VectorAssembler(inputCols=cols_now, outputCol='features')
labelIndexer = StringIndexer(inputCol='Lens_Article_Description_reduced', outputCol="label")
tmp += [assembler_features, labelIndexer]
# converter = IndexToString(inputCol="featur", outputCol="originalCategory")
# converted = converter.transform(indexed)
pipeline = Pipeline(stages=tmp)
all_data = pipeline.fit(df_random_forest_P_limited).transform(df_random_forest_P_limited)
all_data.cache()
trainingData, testData = all_data.randomSplit([0.8,0.2], seed=0)
rf = RF(labelCol='label', featuresCol='features',numTrees=10,maxBins=800)
model = rf.fit(trainingData)
print(model.toDebugString)
After I run the spark machine learning pipeline I want to print out the random forest as a tree.Currently it looks like below.
What I actually want to see is the original categorical feature names instead of feature 1, feature 2 etc. The fact that the categorical features are one hot encoded and vector assembled makes it hard for me to unindex/decode the feature names. How can I unidex/decode onehot encoded and assembled feature vectors in pyspark? I have a vague idea that I have to use " IndexToString()" but I am not exactly sure because there is a mix of numeric, categorical features and they are one hot encoded and assembled.

Export the Apache Spark ML pipeline to a PMML document using the JPMML-SparkML library. A PMML document can be inspected and interpreted by humans (eg. using Notepad), or processed programmatically (eg. using other Java PMML API libraries).
The "model schema" is represented by the /PMML/MiningModel/MiningSchema element. Each "active feature" is represented by a MiningField element; you can retrieve their "type definitions" by looking up the corresponding /PMML/DataDictionary/DataField element.
Edit: Since you were asking about PySpark, you might consider using the JPMML-SparkML-Package package for export.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Encoding error: Train BERT from scratch in Vietnamese language - encoding

Related

NLP ELMo model pruning input

How to writeback to dataframe using transform_df in palantir foundry?

Why does Blockcypher signer tool return some extra characters than bip32 dart package?

Need to create dictionary of idf values, associating words with their idf values

pyspark unindex one-hot encoded and assembled columns

Categories

Resources