pyspark unindex one-hot encoded and assembled columns - pyspark

I have the following code which takes in a mix of categorical, numeric features, string indexes the categorical features, then one hot encodes the categorical features, then assembles both one hot encoded categorical features and numeric features, runs them trough a random forest and prints the resultant tree. I want the tree nodes to have the original features names (i.e Frame_Size etc). How can I do that? In general how can I decode one hot encoded and assembled features?
# categorical features : start singindexing and one hot encoding
column_vec_in = ['Commodity','Frame_Size' , 'Frame_Shape', 'Frame_Color','Frame_Color_Family','Lens_Color','Frame_Material','Frame_Material_Summary','Build', 'Gender_Global', 'Gender_LC'] # frame Article_Desc not slected because the cardinality is too high
column_vec_out = ['Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']
indexers = [StringIndexer(inputCol=x, outputCol=x+'_tmp') for x in column_vec_in ]
encoders = [OneHotEncoder(dropLast=False, inputCol=x+"_tmp", outputCol=y) for x,y in zip(column_vec_in, column_vec_out)]
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]
#categorical and numeric features
cols_now = ['SODC_Regular_Rate','Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']
assembler_features = VectorAssembler(inputCols=cols_now, outputCol='features')
labelIndexer = StringIndexer(inputCol='Lens_Article_Description_reduced', outputCol="label")
tmp += [assembler_features, labelIndexer]
# converter = IndexToString(inputCol="featur", outputCol="originalCategory")
# converted = converter.transform(indexed)
pipeline = Pipeline(stages=tmp)
all_data = pipeline.fit(df_random_forest_P_limited).transform(df_random_forest_P_limited)
all_data.cache()
trainingData, testData = all_data.randomSplit([0.8,0.2], seed=0)
rf = RF(labelCol='label', featuresCol='features',numTrees=10,maxBins=800)
model = rf.fit(trainingData)
print(model.toDebugString)
After I run the spark machine learning pipeline I want to print out the random forest as a tree.Currently it looks like below.
What I actually want to see is the original categorical feature names instead of feature 1, feature 2 etc. The fact that the categorical features are one hot encoded and vector assembled makes it hard for me to unindex/decode the feature names. How can I unidex/decode onehot encoded and assembled feature vectors in pyspark? I have a vague idea that I have to use " IndexToString()" but I am not exactly sure because there is a mix of numeric, categorical features and they are one hot encoded and assembled.

Export the Apache Spark ML pipeline to a PMML document using the JPMML-SparkML library. A PMML document can be inspected and interpreted by humans (eg. using Notepad), or processed programmatically (eg. using other Java PMML API libraries).
The "model schema" is represented by the /PMML/MiningModel/MiningSchema element. Each "active feature" is represented by a MiningField element; you can retrieve their "type definitions" by looking up the corresponding /PMML/DataDictionary/DataField element.
Edit: Since you were asking about PySpark, you might consider using the JPMML-SparkML-Package package for export.

Related

Need to create dictionary of idf values, associating words with their idf values

I understand how to get the idf values and vocabulary using the vectorizer. With vocabulary the frequency of the word is the value and the word is the key of a dictionary, however, what I want the value to be is the idf value.
I haven't been able to try anything because I don't know how to work with sklearn.
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
The code provided above is what I was originally trying to work with.
I have since come up with a new solution that does not use scikit:
for string in text_array:
for word in string:
if word not in total_dict.keys(): # build up a word frequency in the dictionary
total_dict[word] = 1
else:
total_dict[word] += 1
for word in total_dict.keys(): # calculate the tf-idf of each word in the dictionary using this url: https://nlpforhackers.io/tf-idf/
total_dict[word] = math.log(len(text_array) / float(1 + total_dict[word]))
print("word", word, ":" , total_dict[word])
Let me know if the code snippet above is enough to allow a reasonable estimation of what is going on. I included a link to what I was using for guidance.
You can directly use vectorizer.fit_transform(text) for the first time.
What it does is build a vocabulary set according to all the word/tokens in the text.
And then you can use vectorizer.transform(anothertext) to vectorize another text with the same mapping as the previous text.
More explanation:
fit() is to learn vocabulary and idf from training set. transform() is to transform the documents based on the learned vocabulary from the previous fit().
So you should only do fit() once, and can transform many times.

Write Matrix Data to Each Member of Datatype in HDF5 file via MATLAB

This is my first go at trying to create an HDF5 file from scratch using the Low-Level commands via MATLAB.
My issue is that I am having a hard time trying to write data to each specific member in the datatype on my dataset.
First, I create a new HDF5 file, and set the right layer of groups:
new_h5 = H5F.create('new_hdf5_file.h5','H5F_ACC_TRUNC','H5P_DEFAULT','H5P_DEFAULT');
new_h5 = H5G.create(new_h5,'first','H5P_DEFAULT','H5P_DEFAULT','H5P_DEFAULT');
new_h5 = H5G.create(new_h5,'second','H5P_DEFAULT','H5P_DEFAULT','H5P_DEFAULT');
Then, I create my datatype:
datatype = H5T.create('H5T_compound',20);
H5T.insert(datatype,'first_element',0,'H5T_NATIVE_INT');
H5T.insert(datatype,'second_element',4,'H5T_NATIVE_DOUBLE');
H5T.insert(datatype,'third_element',12,'H5T_NATIVE_DOUBLE');
Then, I format that into my dataset:
new_h5 = H5D.create(new_h5,'location',datatype,H5S.create('H5S_SCALAR'),'H5P_DEFAULT');
subset = H5D.get_type(H5D.open(new_h5,'/first/second/location'));
mem_type = H5T.get_member_type(subset,0);
I receive an error with the following command:
H5D.write(mem_type,'H5ML_DEFAULT','H5S_ALL','H5S_ALL','H5P_DEFAULT',data);
Error using hdf5lib2
Unhandled HDF5 class (H5T_NO_CLASS) encountered. It is not possible to write to this attribute or dataset.
So, I try this method instead:
new_h5 = H5D.create(new_h5,'location',datatype,H5S.create_simple(2,dims,dims),'H5P_DEFAULT'); %where dims are the dimensions of all matrices of data structure
H5D.write(mem_type,'H5ML_DEFAULT','H5S_ALL','H5S_ALL','H5P_DEFAULT',data); %where data is a structure
I receive an error with this following command:
H5D.write(mem_type,'H5ML_DEFAULT','H5S_ALL','H5S_ALL','H5P_DEFAULT',data);
Error using hdf5lib2
Attempted to transfer too many values to or from the library buffer.
When looking here for the XML tags for the error messages, it describes the above error as "illegalArrayAccess." Apparently, according to this question, you can only write to 4 members without the buffer throwing an error?
Is this correct? How can I correctly write to each member. I am about to reach my mental limit trying to figure this one out.
EDIT:
References kept here for general information:
HDF5 Compound Datatypes Example
HDF5 Compount Datatypes
H5D.write MATLAB Command
I found out why I cannot write data. I have solved the problem. I had my dimensions set incorrectly (which is code I forgot to include originally). My apologies. I had my dimensions like this:
dims = fliplr(size(data_matrix));
Where dims was a 15x250 matrix. The error was in that the buffer was unable to write a 250x15 matrix for each member, because it only had data for a 250x1 for each member.
The following code will (generically) work for writing data to each member:
new_h5 = H5F.create('new_hdf5_file.h5','H5F_ACC_TRUNC','H5P_DEFAULT','H5P_DEFAULT');
new_h5 = H5G.create(new_h5,'first','H5P_DEFAULT','H5P_DEFAULT','H5P_DEFAULT');
new_h5 = H5G.create(new_h5,'second','H5P_DEFAULT','H5P_DEFAULT','H5P_DEFAULT');
datatype = H5T.create('H5T_compound',20);
H5T.insert(datatype,'first_element',0,'H5T_NATIVE_INT');
H5T.insert(datatype,'second_element',4,'H5T_NATIVE_DOUBLE');
H5T.insert(datatype,'third_element',12,'H5T_NATIVE_DOUBLE');
dims = fliplr(size(data_matrix)); dims = [1 dims(1,2)];
new_h5 = H5D.create(new_h5,'location',datatype,H5S.create_simple(2,dims,dims),'H5P_DEFAULT');
H5D.write(new_h5,'H5ML_DEFAULT','H5S_ALL','H5S_ALL','H5P_DEFAULT',data_structure);
where data_matrix is a 15x250 matrix containing all data, and where data_structure is a sctucture containing 15 fields, each 250x1 in size.

Is it possible to load word2vec pre-trained available vectors into spark?

Is there a way to load Google's or Glove's pre-trained vectors (models) such as GoogleNews-vectors-negative300.bin.gz into spark and performing operations such as findSynonyms that are provided from spark? or do I need to do the loading and operations from scratch?
In this post Load Word2Vec model in Spark , Tom Lous suggests converting the bin file to txt and starting from there, I already did that .. but then what is next?
In a question I posted yesterday I got an answer that models in Parquet format can be loaded in spark, thus I'm posting this question to be sure that there is no other option.
Disclaimer: I'm pretty new to spark, but the below at least works for me.
The trick is figuring out how to construct a Word2VecModel from a set of word vectors as well as handling some of the gotchas in trying to create the model this way.
First, load your word vectors into a Map. For example, I have saved my word vectors to a parquet format (in a folder called "wordvectors.parquet") where the "term" column holds the String word and the "vector" column holds the vector as an array[float], and I can load it like so in Java:
// Loads the dataset with the "term" column holding the word and the "vector" column
// holding the vector as an array[float]
Dataset<Row> vectorModel = pSpark.read().parquet("wordvectors.parquet");
//convert dataset to a map.
Map<String, List<Float>> vectorMap = Arrays.stream((Row[])vectorModel.collect())
.collect(Collectors.toMap(row -> row.getAs("term"), row -> row.getList(1)));
//convert to the format that the word2vec model expects float[] rather than List<Float>
Map<String, float[]> word2vecMap = vectorMap.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, entry -> (float[]) Floats.toArray(entry.getValue())));
//need to convert to scala immutable map because that's what word2vec needs
scala.collection.immutable.Map<String, float[]> scalaMap = toScalaImmutableMap(word2vecMap);
private static <K, V> scala.collection.immutable.Map<K, V> toScalaImmutableMap(Map<K, V> pFromMap) {
final List<Tuple2<K,V>> list = pFromMap.entrySet().stream()
.map(e -> Tuple2.apply(e.getKey(), e.getValue()))
.collect(Collectors.toList());
Seq<Tuple2<K,V>> scalaSeq = JavaConverters.asScalaBufferConverter(list).asScala().toSeq();
return (scala.collection.immutable.Map<K, V>) scala.collection.immutable.Map$.MODULE$.apply(scalaSeq);
}
Now you can construct the model from scratch. Due to a quirk in how Word2VecModel works, you must set the vector size manually, and do so in a weird way. Otherwise it defaults to 100 and you get an error when trying to invoke .transform(). Here is a way I've found that works, not sure if everything is necessary:
//not used for fitting, only used for setting vector size param (not sure if this is needed or if result.set is enough
Word2Vec parent = new Word2Vec();
parent.setVectorSize(300);
Word2VecModel result = new Word2VecModel("w2vmodel", new org.apache.spark.mllib.feature.Word2VecModel(scalaMap)).setParent(parent);
result.set(result.vectorSize(), 300);
Now you should be able to use result.transform() like you would with a self-trained model.
I haven't tested other Word2VecModel functions to see if they work correctly, I only tested .transform().

How to give column names after one hot encoding with sklearn?

Here is my question, I hope someone can help me to figure it out..
To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.
After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding.
A = [1,2,3,4,..]
It should be like that after encoding,
A-1, A-2, A-3
Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;
I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..
As #Vivek Kumar mentioned, you can use the pandas function get_dummies() instead of OneHotEncoder. I wanted to preserve a version of my initial DataFrame so I did the folowing;
import pandas as pd
DataFrame2 = pd.get_dummies(DataFrame)
I used the following code to rename each one-hot encoded columns to "original name_one-hot encoded name". So for your example it would give A_1, A_2, A_3. Feel free to change the "_" below to "-".
#Create list of columns with "object" dtype
cat_cols = [col for col in df_pro.columns if df_pro[col].dtype == np.object]
#Find the array of new columns from one-hot encoding
cat_labels = ohenc.categories_
#Convert array of columns into list
cat_labels = np.concatenate(cat_labels).ravel().tolist()
#Use list comprehension to generate new list with labels needed
cat_labels_new = [(col + "_" + label) for label in cat_labels for col in cat_cols if
label in df_pro[col].values.tolist()]
#Create new DataFrame of transformed columns using new list labels
cat_ohc = pd.DataFrame(cat_arr, columns = cat_labels)
#Concat with original DataFrame and drop original columns (only columns with "object" dtype)

Modelica vector parameters from a file

Is it possible to read a vector of parameters from a file?
I'm trying to create a vector of objects, such as shown here: enter link description here starting on page 49. However, I would like to pull the specific resistance and capacitance values from a text file. (I'm actually just using this as an example for how to read it in).
So, the example fills in data like this:
A.Basic.Resistor R[N + 1](R = vector([Re/2; fill(Re,N-1); Re/2]) );
A.Basic.Capacitor C[N](each C = c*L/N);
But, instead I have a text file that contains something like, where the first column is the index, the second is the R values and the third is the C values:
#1
double test1(4,3) #First set of data (row then col)
1.0 1.0 10.0
2.0 2.0 30.0
3.0 5.0 50.0
4.0 7.0 100.0
I know that I can read this data in using a CombiTable1D or CombiTable2D. But, is there a way to essentially convert each column of data to a vector so that I can do something analogous to:
ReadInTableFromDisk
A.Basic.Resistor R[N + 1](R = FirstDataColumnOfDataOnDisk );
A.Basic.Capacitor C[N](each C = SecondDataColumnOfDataOnDisk);
I would recommend the ExternData library if you want to load external data files into your modelica tool.
Modelica library for data I/O of INI, JSON, XML, MATLAB MAT and Excel XLS/XLSX files
There is the vector() function that converts arrays to vectors.