RandomForestClassifier has no attribute transform, so how to get predictions? - pyspark

How do you get predictions out of a RandomForestClassifier? Loosely following the latest docs here, my code looks like...
# Split the data into training and test sets (30% held out for testing)
SPLIT_SEED = 64 # some const seed just for reproducibility
(trainingData, testData) = df.randomSplit([TRAIN_RATIO, 1-TRAIN_RATIO], seed=SPLIT_SEED)
print(f"Training set ({trainingData.count()}):")
print(f"Test set ({testData.count()}):")
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="labels", featuresCol="features", numTrees=36)
preds = rf.transform(testData)
When running this, I get the error
AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
Examining the python api docs, I see nothing that look like it relates to generating predictions from the trained model (nor feature importance for that matter). Not much experience with mllib, so not sure what to make of this. Anyone with more experience know what to do here?

by looking closely to the documentation
>>> model = rf.fit(td)
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> allclose(model.treeWeights, [1.0, 1.0, 1.0])
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> result = model.transform(test0).head()
>>> result.prediction
you will notice the rf.fit return fitted models which is different than the original RandomForestClassifier class.
And the model will have the method to transform and also feature importance
so in your code
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="labels", featuresCol="features", numTrees=36)
model = rf.fit(trainingData)
preds = model.transform(testData)


How to know that the token ids in a gensim pre-trained word2vec will match the ids of a tokenizer's vocabulary

I am building a pytorch BiLSTM that utilizes pre-trained gensim word2vec. I first used a nn.Embedding layer that was trained with the model from scratch but, i decided to use a pre-trained word2vec embeddings to improve accuracy.
My model architecture follows a simple BiLSTM architecture, where the first layer is the embedding layer followed by a BiLSTM layer(s), and lastly two feed forward layers.
import torch
import gensim
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
word2vec = gensim.models.Word2Vec.load('path_to_word2vec/wikipedia_cbow_100')
weights = torch.FloatTensor(word2vec.wv.vectors)
class BiLSTM_model(torch.nn.Module) :
def __init__(self, max_features, embedding_dim, hidden_dim, num_layers, lstm_dropout) :
# max_features is the vocabulary size (num of tokens/words).
# self.embeddings = nn.Embedding(max_features, embedding_dim, padding_idx=0)
self.embeddings = nn.Embedding.from_pretrained(weights)
self.lstm = nn.LSTM(word2vec.wv.vector_size,
num_layers = num_layers,
self.fc1 = nn.Linear(hidden_dim * 2, 64)
self.dropout = nn.Dropout(0.2)
self.fc2 = nn.Linear(64, config['num_classes'])
def forward(self, input):
embeddings_out = self.embeddings(input)
lstm_out, (hidden, cell) = self.lstm(embeddings_out)
hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
rel = self.relu(hidden)
dense1 = self.fc1(rel)
drop = self.dropout(dense1)
final_out = self.fc2(drop)
return final_out
i use a keras tokenizer to tokenize the text and obtain the token ids.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
## Tokenize the sentences
tokenizer = Tokenizer(num_words=config['max_features'])
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)
finally i use a standard training loop with an optimizer and a loss function. The code runs fine but there are no performance gains from using the pre-trained embeddings.
I suspect that it has to do with token ids not matching between the keras.preprocessing.text tokenizer and the gensim pre-trained embeddings for the words. My question is, how do i confirm (or deny) this inconsistency and ,if it is the case, how do i handle the issue?
Note: i am using a custom word2vec embeddings for the Arabic language. You can find the embeddings here.
After looking into jhso's comment. It seems that the solution for this problem is to use word2vec.wv.index2word which will return the vocabulary (words) as a list sorted in an order which reflects a word's embedding.
for example, the following code:
pretrained_embedding = gensim.models.Word2Vec.load('path/to/embedding')
word_vectors= pretrained_embedding.wv
for i in range (0,3):
print(f"{i}: '{word_vectors.index2word[i]}'")
will print:
0: 'this'
1: 'is'
2: 'an'
3: 'example'
where this token will have the id 0 and so on.
You then use word2vec.wv.index2word as input to the keras.preprocessing.text.Tokenizer object's .fit_on_texts() method as following:
vocabulary = pretrained_embeddings.index2word
tokenizer = Tokenizer(num_words=config['max_features'])
this should preserve the token ids between the gensim word2vec model and the keras tokenizer.

How can i get all outputs of the last transformer encoder in bert pretrained model and not just the cls token output?

I'm using pytorch and this is the model from huggingface transformers link:
from transformers import BertTokenizerFast, BertForSequenceClassification
bert = BertForSequenceClassification.from_pretrained("bert-base-uncased",
and in the forward function I'm building, I'm calling x1, x2 = self.bert(sent_id, attention_mask=mask)
Now, as far as I know, x2 is the cls output(which is the output of the first transformer encoder) but yet again, I don't think I understand the output of the model.
but I want the output of all the 12 last transformer encoders.
How can I do that in pytorch ?
Ideally, if you want to look into the outputs of all the layer, you should use BertModel and not BertForSequenceClassification. Because, BertForSequenceClassification is inherited from BertModel and adds a linear layer on top of the BERT model.
from transformers import BertModel
my_bert_model = BertModel.from_pretrained("bert-base-uncased")
### Add your code to map the model to device, data to device, and obtain input_ids and mask
sequence_output, pooled_output = my_bert_model(ids, attention_mask=mask)
# sequence_output has the following shape: (batch_size, sequence_length, 768), which contains output for all tokens in the last layer of the BERT model.
sequence_output contains output for all tokens in the last layer of the BERT model.
In order to obtain the outputs of all the transformer encoder layers, you can use the following:
my_bert_model = BertModel.from_pretrained("bert-base-uncased")
sequence_output, pooled_output, all_layer_output = model(ids, attention_mask=mask, output_hidden_states=True)
all_layer_output is a output tuple containing the outputs embeddings layer + outputs of all the layer. Each element in the tuple will have a shape (batch_size, sequence_length, 768)
Hence, to get the sequence of outputs at layer-5, you can use all_layer_output[5]. As, all_layer_output[0] contains outputs of the embeddings.
detailed in the doc: https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel.
from transformers import BertModel, BertConfig
config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)
outputs = model(inputs)
print(len(outputs)) # 3
hidden_states = outputs[2]
print(len(hidden_states)) # 13
embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

Issue/Bug when loading and applying MultilayerPerceptronClassifier in Spark Version 3.0.0

IllegalArgumentException: MultilayerPerceptronClassifier_... parameter solver given invalid value auto
I believe I have discovered a bug when loading MultilayerPerceptronClassificationModel in spark 3.0.0, scala 2.1.2 which I have tested and can see is not there in at least Spark 2.4.3, Scala 2.11. .
I am using pyspark on a databricks cluster and importing the library “from pyspark.ml.classification import MultilayerPerceptronClassificationModel”
When running model=MultilayerPerceptronClassificationModel.(“load”) and then model. transform (df) I get the following error: IllegalArgumentException: MultilayerPerceptronClassifier_8055d1368e78 parameter solver given invalid value auto.
This issue can be easily replicated by running the example given on the spark documents: http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier
Then adding a save model, load model and transform statement as such:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load training data
data = spark.read.format("libsvm")\
# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]
# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 3 (classes)
layers = [4, 5, 4, 3]
# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
# train the model
model = trainer.fit(train)
# compute accuracy on the test set
result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
from pyspark.ml.classification import MultilayerPerceptronClassifier, MultilayerPerceptronClassificationModel
model2. MultilayerPerceptronClassificationModel.load(Save_location)
result_from_loaded = model2.transform(test)
Bug has been confirmed Jira opened: : https://issues.apache.org/jira/browse/SPARK-32232

How to implement exponentially decay learning rate in Keras by following the global steps

Look at the following example
# encoding: utf-8
import numpy as np
import pandas as pd
import random
import math
from keras import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam, RMSprop
from keras.callbacks import LearningRateScheduler
X = [i*0.05 for i in range(100)]
def step_decay(epoch):
initial_lrate = 1.0
drop = 0.5
epochs_drop = 2.0
lrate = initial_lrate * math.pow(drop,
return lrate
def build_model():
model = Sequential()
model.add(Dense(32, input_shape=(1,), activation='relu'))
model.add(Dense(1, activation='linear'))
adam = Adam(lr=0.5)
model.compile(loss='mse', optimizer=adam)
return model
model = build_model()
lrate = LearningRateScheduler(step_decay)
callback_list = [lrate]
for ep in range(20):
X_train = np.array(random.sample(X, 10))
y_train = np.sin(X_train)
X_train = np.reshape(X_train, (-1,1))
y_train = np.reshape(y_train, (-1,1))
model.fit(X_train, y_train, batch_size=2, callbacks=callback_list,
epochs=1, verbose=2)
In this example, the LearningRateSchedule does not change the learning rate at all because in each iteration of ep, epoch=1. Thus the learning rate is just const (1.0, according to step_decay). In fact, instead of setting epoch>1 directly, I have to do outer loop as shown in the example, and insider each loop, I just run 1 epoch. (This is the case when I implement deep reinforcement learning, instead of supervised learning).
My question is how to set an exponentially decay learning rate in my example and how to get the learning rate in each iteration of ep.
You can actually pass two arguments to the LearningRateScheduler.
According to Keras documentation, the scheduler is
a function that takes an epoch index as input (integer, indexed from
0) and current learning rate and returns a new learning rate as output
So, basically, simply replace your initial_lr with a function parameter, like so:
def step_decay(epoch, lr):
# initial_lrate = 1.0 # no longer needed
drop = 0.5
epochs_drop = 2.0
lrate = lr * math.pow(drop,math.floor((1+epoch)/epochs_drop))
return lrate
The actual function you implement is not exponential decay (as you mention in your title) but a staircase function.
Also, you mention your learning rate does not change inside your loop. That's true because you set model.fit(..., epochs=1,...) and your epochs_drop = 2.0 at the same time. I am not sure this is your desired case or not. You are providing a toy example and it's not clear in that case.
I would like to add the more common case where you don't mix a for loop with fit() and just provide a different epochs parameter in your fit() function. In this case you have the following options:
First of all keras provides a decaying functionality itself with the predefined optimizers. For example in your case Adam() the actual code is:
lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
which is not exactly exponential either and it's somehow different than tensorflow's one. Also, it's used only when decay > 0.0 as it's obvious.
To follow the tensorflow convention of exponential decay you should implement:
decayed_learning_rate = learning_rate * ^ (global_step / decay_steps)
Depending on your needs you could choose to implement a Callback subclass and define a function within it (see 3rd bullet below) or use LearningRateScheduler which is actually exactly this with some checking: a Callback subclass which updates the learning rate at each epoch end.
If you want a finer handling of your learning rate policy (per batch for example) you would have to implement your subclass since as far as I know there is no implemented subclass for this task. The good part is that it's super easy:
Create a subclass
class LearningRateExponentialDecay(Callback):
and add the __init__() function which will initialize your instance with all needed parameters and also create a global_step variables to keep track of the iterations (batches):
def __init__(self, init_learining_rate, decay_rate, decay_steps):
self.init_learining_rate = init_learining_rate
self.decay_rate = decay_rate
self.decay_steps = decay_steps
self.global_step = 0
Finally, add the actual function inside the class:
def on_batch_begin(self, batch, logs=None):
actual_lr = float(K.get_value(self.model.optimizer.lr))
decayed_learning_rate = actual_lr * self.decay_rate ^ (self.global_step / self.decay_steps)
K.set_value(self.model.optimizer.lr, decayed_learning_rate)
self.global_step += 1
The really cool part is the if you want the above subclass to update every epoch you could use on_epoch_begin(self, epoch, logs=None) which nicely has epoch as parameter to it's signature. This case is even easier as you could skip global step altogether (no need to keep track of it now unless you want a fancier way to apply your decay) and use epoch in it's place.

Deep decision tree in PySpark

I am using PySpark for machine learning and I want to train decision tree classifier, random forest and gradient boosted trees. I want to try out different maximum depth values and select the best one via grid search and cross-validation. However, Spark is telling me that DecisionTree currently only supports maxDepth <= 30. What is the reason to limit it to 30? Is there a way to increase it? I am using it with text data and my feature vectors are TF-IDFs, so I want to try higher values for the maximum depth. Sample code from the Spark website with some modifications:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel",
featuresCol="indexedFeatures", numTrees=500)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction",
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
paramGrid_rf = ParamGridBuilder() \
.addGrid(rf.maxDepth, [50,100,150,250,300]) \
crossval_rf = CrossValidator(estimator=pipeline,
numFolds= 5)
cvModel_rf = crossval_rf.fit(trainingData)
The code above gives me the error message below.
Py4JJavaError: An error occurred while calling o12383.fit.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = 50.
From https://forums.databricks.com/questions/12300/for-decision-trees-is-the-current-maxdepth-limited.html
...the current implmentation imposes a restriction of maxDepth <= 30:
You could ask to increase that limit in github forum!