pyspark error returned when checking if a df column contains values in broadcast.value.values() - pyspark

i have below code
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
states = {"NY":"New York", "CA":"California", "FL":"Florida"}
broadcastStates = spark.sparkContext.broadcast(states)
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
]
columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
def state_convert(code):
return broadcastStates.value[code]
result = df.rdd.map(lambda x: (x[0],x[1],x[2],state_convert(x[3]))).toDF(columns)
result.show(truncate=False)
dfFilter = result.where(result.state.isin(broadcastStates.value.values())
dfFilter.display()
When I run above code in notebook, I failed at below line
dfFilter = result.where(result.state.isin(broadcastStates.value.values())
The error is
Unsupported literal type class java.util.HashMap {FL=Florida, NY=New York, CA=California}
Then I check the type of broadcastStates.value.values by
print(type(broadcastStates.value.values()))
the type of returned value is <class 'dict_values'>
How should I covert the dict_values to a list that can passed to isin() function.

Related

Spark ML throws exception for Decision Tree classification: Column features must be of type numeric but was actually of type struct

I am trying to create a Spark ML model with the Decision Tree Classifier to perform classification , but I am getting an error saying the features in my training set should be of type numeric instead of type struct.
Here is the minimal reproducible example that I tried:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.VectorUDT
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml._
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
val df8 = Seq(
("2022-08-22 10:00:00",417.7,419.97,419.97,417.31,"nothing"),
("2022-08-22 11:30:00",417.35,417.33,417.46,416.77,"buy"),
("2022-08-22 13:00:00",417.55,417.68,418.04,417.48,"sell"),
("2022-08-22 14:00:00",417.22,417.8,421.13,416.83,"sell")
)
val df77 = spark.createDataset(df8).toDF("30mins_date","30mins_close","30mins_open","30mins_high","30mins_low", "signal")
val assembler_features = new VectorAssembler()
.setInputCols(Array("30mins_close","30mins_open","30mins_high","30mins_low"))
.setOutputCol("features")
val output2 = assembler_features.transform(df77)
val indexer = new StringIndexer()
.setInputCol("signal")
.setOutputCol("signalIndex")
val indexed = indexer.fit(output2).transform(output2)
val assembler_label = new VectorAssembler()
.setInputCols(Array("signalIndex"))
.setOutputCol("signalIndexV")
val output = assembler_label.transform(indexed)
val dt = new DecisionTreeClassifier()
.setLabelCol("features")
.setFeaturesCol("signalIndexV")
val Array(trainingData, testData) = output.select("features", "signalIndexV").randomSplit(Array(0.7, 0.3))
val model = dt.fit(trainingData)
Output error:
java.lang.IllegalArgumentException: requirement failed: Column features must be of type numeric but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:78)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema(Predictor.scala:54)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema$(Predictor.scala:47)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:73)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema(Classifier.scala:43)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema$(Classifier.scala:39)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:51)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:34)
at org.apache.spark.ml.classification.DecisionTreeClassifier.org$apache$spark$ml$tree$DecisionTreeClassifierParams$$super$validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema(treeParams.scala:245)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema$(treeParams.scala:241)
at org.apache.spark.ml.classification.DecisionTreeClassifier.validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:177)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:133)
... 61 elided
I tried above code in spark-shell environment:
spark v 3.3.1
scala v 2.12.15
Here is what trainingData looks like
+-----------------------------+------------+
|features |signalIndexV|
+-----------------------------+------------+
|[417.7,419.97,419.97,417.31] |[2.0] |
|[417.35,417.33,417.46,416.77]|[1.0] |
|[417.55,417.68,418.04,417.48]|[0.0] |
|[417.22,417.8,421.13,416.83] |[0.0] |
+-----------------------------+------------+
So what did I do wrong ? How can I convert column features into numeric type ?

custom pipeline in pyspark

I would like to create a custom pipeline in pyspark to label encoder. It works well except I get an error when save the pipeline object:
from pyspark.sql import functions as f
from pyspark.sql.types import StringType
from pyspark.ml import Transformer
class CatTransform(Transformer):
"""
A custom Transformer which map all categorical features to numeric features
"""
def __init__(self, cat_feats: Iterable[str]):
super(Transformer, self).__init__()
self.cat_feats = cat_feats
self.dict_ = {}
def _transform(self, df: DataFrame) -> DataFrame:
for feat in self.cat_feats:
if not self.dict_.get(feat):
this_dict = df.groupby(feat).count().toPandas().set_index(feat).to_dict()["count"]
self.dict_[feat] = this_dict
this_dict = self.dict_[feat]
this_udf = f.udf(lambda x: this_dict[x] if x in this_dict.keys() else -1, IntegerType())
df = df.withColumn(feat, this_udf(feat))
return df
Then I got error:
ValueError: ('Pipeline write will fail on this pipeline because stage %s of type %s is not MLWritable', 'CatTransform_d08e899887b0', <class 'CatTransform'>)
Someone can help me, please!

Getting error while converting DynamicFrame to a Spark DataFrame using toDF

I stated using AWS Glue to read data using data catalog and GlueContext and transform as per requirement.
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession
// Data Catalog: database and table name
val dbName = "abcdb"
val tblName = "xyzdt_2017_12_05"
// S3 location for output
val outputDir = "s3://output/directory/abc"
// Read data into a DynamicFrame using the Data Catalog metadata
val stGBDyf = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
val revisedDF = stGBDyf.toDf() // This line getting error
While executing above code I got following error,
Error : Syntax Error: error: value toDf is not a member of
com.amazonaws.services.glue.DynamicFrame val revisedDF =
stGBDyf.toDf() one error found.
I followed this example to convert DynamicFrame to Spark dataFrame.
Please suggest what will be the best way to resolve this problem
There's a typo. It should work fine with capital F in toDF:
val revisedDF = stGBDyf.toDF()

Cannot access Spark dataframe methods

In Zeppelin I am using a dataframe created in another paragraph. I display the type of my df variable and get:
res35: String = DataFrame
suggesting it is a dataframe. But when I try and use select on the df variable I get an error:
<console>:62: error: value select is not a member of Object
Do I have to convert Object to Dataframe or something? Can someone tell me what I am missing? TIA!
My code is:
val df = z.get("wds")
df.getClass.getSimpleName
df.select(explode($"filtered").as("value")).groupBy("value").count.show
This gives the folowwing (edited) output:
df: Object = [racist: boolean, contributors:
string, coordinates: string, ...n: Int = 20
res35: String = DataFrame
<console>:62: error: value select is not a member of Object
df.select(explode($"filtered").as("value")).groupBy("value").count.show
Seems I was missing
.asInstanceOf[DataFrame]
i.e.
import org.apache.spark.sql.DataFrame
val df = z.get("wds").asInstanceOf[DataFrame]

Scala java.lang.String cannot be cast to java.lang.Double error when converting double type dataframe to LabeledPoint in Spark

I have a dataset of 2002 variables. All variables are numeric. I first read in the dataset to Spark 1.5.0 and created a Double Type dataframe following the instruction here . Then I converted the dataframe to LabeledPoint following instructions here and here. However, when I tried to print out sample rows in the generated LabeledPoint, I got the "java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double" error. Below is the Scala code I used. Sorry for the long code but I hope that will help the debug.
Could anyone please tell me where the error is coming from and how to resolve the problem? Thank you very much for your help!
Below is the Scala code I used:
// Read in dataset but drop the header row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val trainRDD = sc.textFile("train.txt").filter(line => !line.contains("target"))
// Read in header file to get column names. Store in an Array.
val dictFile = "header.txt"
var arrName = new Array[String](2002)
for (line <- Source.fromFile(dictFile).getLines) {
arrName = line.split('\t').map(_.trim).toArray
}
// Create dataframe using programmatically specifying the schema method
// Encode schema in a string
var schemaString = arrName.mkString(" ")
// Import Row
import org.apache.spark.sql.Row
// Import RDD
import org.apache.spark.rdd.RDD
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,LongType,FloatType,DoubleType}
// Generate the Double Type schema based on the string of schema
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
// Create rowRDD and convert String type to Double type
val arrVar = sc.broadcast(0 to 2001 toArray)
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}
val rowRDDTrain = createRowRDD(trainRDD, arrVar)
// Apply the schema to the RDD.
val trainDF = sqlContext.createDataFrame(rowRDDTrain, schema)
trainDF.printSchema
// Verified all 2002 variables are in "double (nullable = true)" format
// Define toLabeledPoint( ) to convert dataframe to LabeledPoint format
// Reference: https://stackoverflow.com/questions/31638770/rdd-to-labeledpoint-conversion
def toLabeledPoint(dataDF:org.apache.spark.sql.DataFrame) : org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = {
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val targetInd = dataDF.columns.indexOf("target")
val ignored = List("target")
val featInd = dataDF.columns.diff(ignored).map(dataDF.columns.indexOf(_))
val dataLP = dataDF.rdd.map(r => LabeledPoint(r.getDouble(targetInd),
Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
return dataLP
}
// Create LabeledPoint from dataframe
val trainLP = toLabeledPoint(trainDF)
// Print out sammple rows in the generated LabeledPoint
trainLP.take(5).foreach(println)
// Failed: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
Update:
Thanks a lot for David Griffin's and zero323's comments below. David is correct. I find that exception is indeed caused by the null values in the data. I replaced the following original code:
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}
with this one to impute null values to 0.0 and then the problem is gone:
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => try {y.toDouble} catch {case _ : Throwable => 0.0}})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}