Error in saving Random Forest Classifier model in Pyspark - pyspark

rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("label")
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, rf])
model = pipeline.fit(training)
model.save(sc, '<path_to_save>')
I am trying to save the model file by using the above code.But getting an unexpected error -
TypeError: save() takes exactly 2 arguments (3 given)
I don't understand this error. I am passing 2 arguments only but still getting this error.
Do anybody have idea? What I am doing wrong here?

I don't know how this is working. But removing 1st field 'sc' works for me.
model.save('<path_to_save>')
I am able to save the model file by this command.

Related

Gatling how to create a list and pass it to the session?

So I'd like to create a list of random numbers from -7 to 7 and then iterate through these numbers. So I am running the following in order to create the list and save it in the session:
.exec(session => {session.set("randomDays", Random.shuffle(-7 to 7 toList))})
But then when I try to iterate through the list with:
.foreach("randomDays", "counter")
I am getting getting the following error message:
" Condition evaluation crashed with message 'Can't cast value randomDays of type class java.lang.String into interface scala.collection.Seq'"
When I am looking at the session values I see what looks like a list of random values as expected (see attached screenshot). What am I getting wrong here?
If anyone is interested, what I ended up doing is creating a random days val in my test class and fetched a different day using a repeat counter.
.repeat(14, "counter") {
exec(session =>{session.set("randomDay",randomDays(session("counter").as[Int]))})
I would still be very happy to understand what I got wrong with my initial attempt :)

Keras infinite loop

The code reads my images from colab folders. then it splits the codes as training set and validation set using generator. I used an existing premodel Dense201 to train it. However I am not sure why, for the the generator remains caught in an infinite loop and the loop that generates the validation data never executes. Does anyone know how to circumvent this ?
import tensorflow as tf
IMAGE_SIZE = 224
BATCH_SIZE = 64
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
rescale=1./255,
validation_split=0.2)
train_generator = datagen.flow_from_directory(
base_dir,
target_size=(IMAGE_SIZE, IMAGE_SIZE),
batch_size=BATCH_SIZE,
subset='training')
val_generator = datagen.flow_from_directory(
base_dir,
target_size=(IMAGE_SIZE, IMAGE_SIZE),
batch_size=BATCH_SIZE,
subset='validation')
base_model = tf.keras.applications.DenseNet201(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(5, activation='softmax')
])
model.compile(optimizer=tf.keras.optimizers.Adam(),
loss='categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(train_generator,
epochs=2,
steps_per_epoch=100,
validation_data=val_generator)
In the line:
history = model.fit(train_generator,
epochs=2,
steps_per_epoch=100,
validation_data=val_generator)
change steps_per_epoch=100 to steps_per_epoch=(len(train_generator)//BATCH_SIZE)
It finally worked!
!pip uninstall tensorflow
!pip install tensorflow==2.1.0
This issue arises because your validation generator is stuck in an infinite loop unable to exit. While data generator exits due to steps_per_epoch=100 argument you provided you haven't specified how many time the generator must be called until your validation loss is calculated. There's a similar argument that fixes this issue called validation_steps
history = model.fit(train_generator,
epochs=2,
steps_per_epoch=100,
validation_data=val_generator
validation_steps=50)
this way your validation loss will be calculated based on the data your validation generator returns for 50 calls, and it won't get stuck in an infinite loop

Why does Spark fail with "value write is not a member of org.apache.spark.sql.DataFrameReader [error]"?

I have two almost identical write into db . scala statement ,however one trowing me an error the other not and i don't understand how to fix it ? any ideas ?
this statement is working
df_pm_visits_by_site_trn.write.format("jdbc").option("url", db_url_2).option("dbtable", "pm_visits_by_site_trn").option("user", db_user).option("password", db_pwd).option("truncate","true").mode("overwrite").save()
this one doesn't work and throwing me compiling error
df_trsnss .write.format("jdbc").option("url", db_url_2).option("dbtable", "df_trsnss") .option("user", db_user).option("password", db_pwd).option("truncate","true").mode("overwrite").save()
_dev.scala:464: value write is not a member of org.apache.spark.sql.DataFrameReader [error]
df_trsnss.write.format("jdbc").option("url",
db_url_2).option("dbtable", "trsnss").option("user",
db_user).option("password",
db_pwd).option("truncate","true").mode("overwrite").save()
if i delete my second write statement or just simply comment it out whole code is compiling with no errors.
Based on the error message, df_trsnss is a DataFrameReader, not a DataFrame. You likely forgot to call load.
val df_trsnss = spark.read.format("csv")
instead of
val df_trsnss = spark.read.format("csv").load("...")

PySpark DataFrame of model statistics cannot be collected or converted to RDD

I'm encountering confusing PySpark errors when trying to extract the threshold value associated with the highest recall value from the DataFrame returned by recallByThreshold. Interestingly, these errors only occur when running the application in cluster mode.
training, testing = data.randomSplit([0.7, 0.3], seed=100)
train = training.coalesce(200)
test = testing.coalesce(100)
train.persist()
test.persist()
model = LogisticRegression(labelCol='label',
featuresCol='features',
weightCol='importance',
maxIter=30,
regParam=0.3,
elasticNetParam=0.2)
trained_model = model.fit(train)
threshold = trained_model.summary.recallByThreshold.rdd.max(key=lambda x: x["recall"])["threshold"]
The final line of code produces AttributeError: 'NoneType' object has no attribute 'setCallSite'. Broken down further, when I attempt trained_model.summary.recallByThreshold.rdd I get another a different error *** AttributeError: 'NoneType' object has no attribute 'sc'.
This problem appears related to Spark (pyspark) having difficulty calling statistics methods on worker node, but in this case I'm not able to collect the DataFrame at all (produces same error). I launched my application from IPython on the master node, so shouldn't the SparkContext be available through the SparkSession (using Spark version 2.1.0)?

Pyspark: ValueError

I have a dictionary of PySpark RDDs and am trying to convert them to data frames, save them as variable and then join them. When I attempt to convert one of my RDDs to a data frame I get the following error:
File "./spark-1.3.1/python/pyspark/sql/types.py",
line 986, in _verify_type
"length of fields (%d)" % (len(obj), len(dataType.fields)))
ValueError: Length of object (52) does not match with length of fields (7)
Does anyone know what this exactly means or can help me with a work around?
I agree - we need to see more code - obfuscated data is fine.
You are using SparkQL it seems (sql types) - mapped onto what ? HDFS/Text
From the error it would appear that your create schema is incorrect - leading to an error - when to create a Data Frame.
This was due to the passing of an incorrect RDD, sorry everyone. I was passing the incorrect RDD which caused didn't fit the code I was using.