Kmeans, failed to execute user defined function($anonfun$4:...) - scala

I try to apply kmeans algorithm.
Code
val dfJoin_products_items = df_products.join(df_items, "product_id")
dfJoin_products_items.createGlobalTempView("products_items")
val weightFreight = spark.sql("SELECT cast(product_weight_g as double) weight, cast(freight_value as double) freight FROM global_temp.products_items")
case class Rows(weight:Double, freight:Double)
val rows = weightFreight.as[Rows]
val assembler = new VectorAssembler().setInputCols(Array("weight", "freight")).setOutputCol("features")
val data = assembler.transform(rows)
val kmeans = new KMeans().setK(4)
val model = kmeans.fit(data)
Values
dfJoin_products_items
scala> dfJoin_products_items.printSchema
root
|-- product_id: string (nullable = true)
|-- product_category_name: string (nullable = true)
|-- product_name_lenght: string (nullable = true)
|-- product_description_lenght: string (nullable = true)
|-- product_photos_qty: string (nullable = true)
|-- product_weight_g: string (nullable = true)
|-- product_length_cm: string (nullable = true)
|-- product_height_cm: string (nullable = true)
|-- product_width_cm: string (nullable = true)
|-- order_id: string (nullable = true)
|-- order_item_id: string (nullable = true)
|-- seller_id: string (nullable = true)
|-- shipping_limit_date: string (nullable = true)
|-- price: string (nullable = true)
|-- freight_value: string (nullable = true)
weightFreight
scala> weightFreight.printSchema
root
|-- weight: double (nullable = true)
|-- freight: double (nullable = true)
Error
2019-02-03 20:51:41 WARN BlockManager:66 - Putting block rdd_126_1 failed due to exception org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<weight:double,freight:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>).
2019-02-03 20:51:41 WARN BlockManager:66 - Block rdd_126_1 could not be removed as it was not found on disk or in memory
2019-02-03 20:51:41 WARN BlockManager:66 - Putting block rdd_126_2 failed due to exception org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<weight:double,freight:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>).
2019-02-03 20:51:41 ERROR Executor:91 - Exception in task 1.0 in stage 16.0 (TID 23)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<weight:double,freight:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
I don't understand this error, someone can explain me please ?
Thanks a lot!
UPDATE 1 : Full stacktrace
The stacktrace is huge, so you can find it here : https://pastebin.com/PhmZPtDk

Related

Spark - Scala Remove special character from the beginning and end from columns in a dataframe

I have a dataframe like this,
scala> df.printSchema
root
|-- Protocol ID: decimal(12,0) (nullable = true)
|-- Protocol #: string (nullable = true)
|-- Eudract #: string (nullable = true)
|-- STDY_MIGRATED_INDC: string (nullable = true)
|-- # Non-US Count: decimal(7,0) (nullable = true)
|-- # US Count: decimal(7,0) (nullable = true)
here the data columns have space and special characters in it. I wanted to replace that with underscore like this,
scala> newdf.printSchema
root
|-- Protocol_ID: decimal(12,0) (nullable = true)
|-- Protocol: string (nullable = true)
|-- Eudract: string (nullable = true)
|-- STDY_MIGRATED_INDC: string (nullable = true)
|-- Non-US_Count: decimal(7,0) (nullable = true)
|-- US_Count: decimal(7,0) (nullable = true)
So I used the below steps,
val df=spark.read.format("parquet").load("<s3 path>")
val regex_string="""[+._(),!#$%&"*./:;<-> ]+"""
val replacingColumns = df.columns.map(regex_string.r.replaceAllIn(_, "_"))
val resultDF = replacingColumns.zip(df.columns).foldLeft(df){
(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)
}
resultDF.printSchema
But Iam getting the df like this.
scala> resultDF.printSchema
root
|-- Protocol_ID: decimal(12,0) (nullable = true)
|-- Protocol_: string (nullable = true)
|-- Eudract_: string (nullable = true)
|-- STDY_MIGRATED_INDC: string (nullable = true)
|-- _Non-US_Count: decimal(7,0) (nullable = true)
|-- _US_Count: decimal(7,0) (nullable = true)
If the space or special character is in the beginning or end then I dont want the underscore.
In python I can use,
starts_with = [i.replace("_","",1) if i.startswith("_") else i for i in df.columns]
[(i[::-1].replace("_","",1)[::-1]) if i.endswith("_") else i for i in starts_with]
As I am new to scala I am not sure how to fix this. Any help would be appreciated.
You can use (^_|_$) regex to replace beginning or ending _ with empty string.
val regex_string = """[+._(),!#$%&"*./:;<-> ]+"""
val col = regex_string.r.replaceAllIn("#Non-US Count##", "_")
println(col)
println("(^_|_$)".r.replaceAllIn(col, ""))
// _Non-US_Count_
// Non-US_Count
val replacingColumns = df.columns.map(s=>"(^_|_$)".r.replaceAllIn(regex_string.r.replaceAllIn(s, "_"),""))

spark scala convert a nested dataframe to nested dataset

I have a nested dataframe "inputFlowRecordsAgg" which have follwoing schema
root
|-- FlowI.key: string (nullable = true)
|-- FlowS.minFlowTime: long (nullable = true)
|-- FlowS.maxFlowTime: long (nullable = true)
|-- FlowS.flowStartedCount: long (nullable = true)
|-- FlowI.DestPort: integer (nullable = true)
|-- FlowI.SrcIP: struct (nullable = true)
| |-- bytes: binary (nullable = true)
|-- FlowI.DestIP: struct (nullable = true)
| |-- bytes: binary (nullable = true)
|-- FlowI.L4Protocol: byte (nullable = true)
|-- FlowI.Direction: byte (nullable = true)
|-- FlowI.Status: byte (nullable = true)
|-- FlowI.Mac: string (nullable = true)
Wanted to convert into nested dataset of following case classes
case class InputFlowV1(val FlowI: FlowI,
val FlowS: FlowS)
case class FlowI(val Mac: String,
val SrcIP: IPAddress,
val DestIP: IPAddress,
val DestPort: Int,
val L4Protocol: Byte,
val Direction: Byte,
val Status: Byte,
var key: String = "")
case class FlowS(var minFlowTime: Long,
var maxFlowTime: Long,
var flowStartedCount: Long)
but when I try converting it using
inputFlowRecordsAgg.as[InputFlowV1]
cannot resolve '`FlowI`' given input columns: [FlowI.DestIP,FlowI.Direction, FlowI.key, FlowS.maxFlowTime, FlowI.SrcIP, FlowS.flowStartedCount, FlowI.L4Protocol, FlowI.Mac, FlowI.DestPort, FlowS.minFlowTime, FlowI.Status];
org.apache.spark.sql.AnalysisException: cannot resolve '`FlowI`' given input columns: [FlowI.DestIP,FlowI.Direction, FlowI.key, FlowS.maxFlowTime, FlowI.SrcIP, FlowS.flowStartedCount, FlowI.L4Protocol, FlowI.Mac, FlowI.DestPort, FlowS.minFlowTime, FlowI.Status];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
One comment asked me for a full code, here it is
def getReducedFlowR(inputFlowRecords: Dataset[InputFlowV1],
#transient spark: SparkSession): Dataset[InputFlowV1]={
val inputFlowRecordsAgg = inputFlowRecords.groupBy(column("FlowI.key") as "FlowI.key")
.agg(min("FlowS.minFlowTime") as "FlowS.minFlowTime" , max("FlowS.maxFlowTime") as "FlowS.maxFlowTime",
sum("FlowS.flowStartedCount") as "FlowS.flowStartedCount"
, first("FlowI.Mac") as "FlowI.Mac"
, first("FlowI.SrcIP") as "FlowI.SrcIP" , first("FlowI.DestIP") as "FlowI.DestIP"
,first("FlowI.DestPort") as "FlowI.DestPort"
, first("FlowI.L4Protocol") as "FlowI.L4Protocol"
, first("FlowI.Direction") as "FlowI.Direction" , first("FlowI.Status") as "FlowI.Status")
inputFlowRecordsAgg.printSchema()
return inputFlowRecordsAgg.as[InputFlowV1]
}
Reason is your case class schema has not matched with actual data schema, Please check the case class schema below. try to match your case class schema to data schema it will work.
Your case class schema is :
scala> df.printSchema
root
|-- FlowI: struct (nullable = true)
| |-- Mac: string (nullable = true)
| |-- SrcIP: string (nullable = true)
| |-- DestIP: string (nullable = true)
| |-- DestPort: integer (nullable = false)
| |-- L4Protocol: byte (nullable = false)
| |-- Direction: byte (nullable = false)
| |-- Status: byte (nullable = false)
| |-- key: string (nullable = true)
|-- FlowS: struct (nullable = true)
| |-- minFlowTime: long (nullable = false)
| |-- maxFlowTime: long (nullable = false)
| |-- flowStartedCount: long (nullable = false)
Try to change your code like below it should work now.
val inputFlowRecordsAgg = inputFlowRecords.groupBy(column("FlowI.key") as "key")
.agg(min("FlowS.minFlowTime") as "minFlowTime" , max("FlowS.maxFlowTime") as "maxFlowTime",
sum("FlowS.flowStartedCount") as "flowStartedCount"
, first("FlowI.Mac") as "Mac"
, first("FlowI.SrcIP") as "SrcIP" , first("DestIP") as "DestIP"
,first("FlowI.DestPort") as "DestPort"
, first("FlowI.L4Protocol") as "L4Protocol"
, first("FlowI.Direction") as "Direction" , first("FlowI.Status") as "Status")
.select(struct($"key",$"Mac",$"SrcIP",$"DestIP",$"DestPort",$"L4Protocol",$"Direction",$"Status").as("FlowI"),struct($"flowStartedCount",$"minFlowTime",$"maxFlowTime").as("FlowS")) // add this line & change based on your columns .. i have added roughly..:)

Cannot save spark dataframe as CSV

I am attempting to save a Spark DataFrame as a CSV. I have looked up numerous different posts and guides and am, for some reason, still getting an issue. The code I am using to do this is
endframe.coalesce(1).
write.
mode("append").
csv("file:///home/X/Code/output/output.csv")
I have also tried this by including .format("com.databricks.spark.csv") as well as by changing the .csv() to a .save() and strangely none of these work. The most unusual part is that running this code creates an empty folder called "output.csv" in the output folder.
The error message that spark gives is
Job aborted due to stage failure:
Task 0 in stage 281.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 281.0 (TID 22683, X.x.local, executor 4): org.apache.spark.SparkException:
Task failed while writing rows.
I have verified that the dataframe schema is properly initialized. However, when I use the .format, I do not import com.databricks.spark.csv, but I do not think that is the problem. Any advice on this would be appreciated.
The schema is as follows:
|-- kwh: double (nullable = true)
|-- qh_end: double (nullable = true)
|-- cdh70: double (nullable = true)
|-- norm_hbu: double (nullable = true)
|-- precool_counterprecoolevent_id6: double (nullable = true)
|-- precool_counterprecoolevent_id7: double (nullable = true)
|-- precool_counterprecoolevent_id8: double (nullable = true)
|-- precool_counterprecoolevent_id9: double (nullable = true)
|-- event_id10event_counterevent: double (nullable = true)
|-- event_id2event_counterevent: double (nullable = true)
|-- event_id3event_counterevent: double (nullable = true)
|-- event_id4event_counterevent: double (nullable = true)
|-- event_id5event_counterevent: double (nullable = true)
|-- event_id6event_counterevent: double (nullable = true)
|-- event_id7event_counterevent: double (nullable = true)
|-- event_id8event_counterevent: double (nullable = true)
|-- event_id9event_counterevent: double (nullable = true)
|-- event_idTestevent_counterevent: double (nullable = true)
|-- event_id10snapback_countersnapback: double (nullable = true)
|-- event_id2snapback_countersnapback: double (nullable = true)
|-- event_id3snapback_countersnapback: double (nullable = true)
|-- event_id4snapback_countersnapback: double (nullable = true)
|-- event_id5snapback_countersnapback: double (nullable = true)
|-- event_id6snapback_countersnapback: double (nullable = true)
|-- event_id7snapback_countersnapback: double (nullable = true)
|-- event_id8snapback_countersnapback: double (nullable = true)
|-- event_id9snapback_countersnapback: double (nullable = true)
|-- event_idTestsnapback_countersnapback: double (nullable = true)

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

My question is if there are any approaches to update the schema of a DataFrame without explicitly calling SparkSession.createDataFrame(dataframe.rdd, newSchema).
Details are as follows.
I have an original Spark DataFrame with schema below:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
I applied Dataset.mapPartitions on the original DataFrame and got a new DataFrame (returned by Dataset.mapPartitions).
The reason for using Dataset.mapPartitions but not Dataset.map is better transformation speed.
In this new DataFrame, every row should have a schema like below:
root
|-- column21: string (nullable = true)
|-- column22: long (nullable = true)
|-- column23: string (nullable = true)
|-- column24: long (nullable = true)
|-- column25: struct (nullable = true)
| |-- column251: string (nullable = true)
| |-- column252: string (nullable = true)
| |-- column253: string (nullable = true)
| |-- column254: string (nullable = true)
| |-- column255: string (nullable = true)
| |-- column256: string (nullable = true)
So the schema of the new DataFrame should be the same as the above.
However, the schema of the new DataFrame won't be updated automatically. The output of applying Dataset.printSchema method on the new DataFrame is still original:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
So, in order to get the correct (updated) schema, what I'm doing is using SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
My concern here is that falling back to RDD (newDataFrame.rdd) will hurt the transformation speed because Spark Catalyst doesn't handle RDD as well as Dataset/DataFrame.
My question is if there are any approaches to update the schema of the new DataFrame without explicitly calling SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
Thanks a lot.
You can use RowEncoder to define schema for newDataFrame.
See following example.
val originalDF = spark.sparkContext.parallelize(List(("Tonny", "city1"), ("Rogger", "city2"), ("Michal", "city3"))).toDF("name", "city")
val r = scala.util.Random
val encoderForNewDF = RowEncoder(StructType(Array(
StructField("name", StringType),
StructField("num", IntegerType),
StructField("city", StringType)
)))
val newDF = originalDF.mapPartitions { partition =>
partition.map{ row =>
val name = row.getAs[String]("name")
val city = row.getAs[String]("city")
val num = r.nextInt
Row.fromSeq(Array[Any](name, num, city))
}
} (encoderForNewDF)
newDF.printSchema()
|-- name: string (nullable = true)
|-- num: integer (nullable = true)
|-- city: string (nullable = true)
Row Encoder for spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RowEncoder.html

scala.MatchError during Spark 2.0.2 DataFrame union

I'm attempting to merge 2 DataFrames, one with old data and one with new data, using the union function. This used to work until I tried to dynamically add a new field to the old DataFrame because my schema is evolving.
This means that my old data will be missing a field and the new data will have it. In order for the union to work, I'm adding the field using the evolveSchema function below.
This resulted in the output/exception I pasted below the code, including my debug prints.
The column ordering and making fields nullable are attempts to fix this issue by making the DataFrames as identical as possible, but it persists. The schema prints show that they are both seemingly identical after these manipulations.
Any help to further debug this would be appreciated.
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.types.{StructField, StructType}
import org.apache.spark.sql.{DataFrame, SQLContext}
object Merger {
def apply(sqlContext: SQLContext, oldDataSet: Option[DataFrame], newEnrichments: Option[DataFrame]): Option[DataFrame] = {
(oldDataSet, newEnrichments) match {
case (None, None) => None
case (None, _) => newEnrichments
case (Some(existing), None) => Some(existing)
case (Some(existing), Some(news)) => Some {
val evolvedOldDataSet = evolveSchema(existing)
println("EVOLVED OLD SCHEMA FIELD NAMES:" + evolvedOldDataSet.schema.fieldNames.mkString(","))
println("NEW SCHEMA FIELD NAMES:" + news.schema.fieldNames.mkString(","))
println("EVOLVED OLD SCHEMA FIELD TYPES:" + evolvedOldDataSet.schema.fields.map(_.dataType).mkString(","))
println("NEW SCHEMA FIELD TYPES:" + news.schema.fields.map(_.dataType).mkString(","))
println("OLD SCHEMA")
existing.printSchema();
println("PRINT EVOLVED OLD SCHEMA")
evolvedOldDataSet.printSchema()
println("PRINT NEW SCHEMA")
news.printSchema()
val nullableEvolvedOldDataSet = setNullableTrue(evolvedOldDataSet)
val nullableNews = setNullableTrue(news)
println("NULLABLE EVOLVED OLD")
nullableEvolvedOldDataSet.printSchema()
println("NULLABLE NEW")
nullableNews.printSchema()
val unionData =nullableEvolvedOldDataSet.union(nullableNews)
val result = unionData.sort(
unionData("timestamp").desc
).dropDuplicates(
Seq("id")
)
result.cache()
}
}
}
def GENRE_FIELD : String = "station_genre"
// Handle missing fields in old data
def evolveSchema(oldDataSet: DataFrame): DataFrame = {
if (!oldDataSet.schema.fieldNames.contains(GENRE_FIELD)) {
val columnAdded = oldDataSet.withColumn(GENRE_FIELD, lit("N/A"))
// Columns should be in the same order for union
val columnNamesInOrder = Seq("id", "station_id", "station_name", "station_timezone", "station_genre", "publisher_id", "publisher_name", "group_id", "group_name", "timestamp")
val reorderedColumns = columnAdded.select(columnNamesInOrder.head, columnNamesInOrder.tail: _*)
reorderedColumns
}
else
oldDataSet
}
def setNullableTrue(df: DataFrame) : DataFrame = {
// get schema
val schema = df.schema
// create new schema with all fields nullable
val newSchema = StructType(schema.map {
case StructField(columnName, dataType, _, metaData) => StructField( columnName, dataType, nullable = true, metaData)
})
// apply new schema
df.sqlContext.createDataFrame( df.rdd, newSchema )
}
}
EVOLVED OLD SCHEMA FIELD NAMES:
id,station_id,station_name,station_timezone,station_genre,publisher_id,publisher_name,group_id,group_name,timestamp
NEW SCHEMA FIELD NAMES:
id,station_id,station_name,station_timezone,station_genre,publisher_id,publisher_name,group_id,group_name,timestamp
EVOLVED OLD SCHEMA FIELD TYPES:
StringType,LongType,StringType,StringType,StringType,LongType,StringType,LongType,StringType,LongType
NEW SCHEMA FIELD TYPES:
StringType,LongType,StringType,StringType,StringType,LongType,StringType,LongType,StringType,LongType
OLD SCHEMA
root |-- id: string (nullable = true) |-- station_id:
long (nullable = true) |-- station_name: string (nullable = true)
|-- station_timezone: string (nullable = true) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
PRINT EVOLVED OLD SCHEMA root |-- id: string (nullable = true) |--
station_id: long (nullable = true) |-- station_name: string (nullable
= true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = false) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
PRINT NEW SCHEMA root |-- id: string (nullable = true) |--
station_id: long (nullable = true) |-- station_name: string (nullable
= true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = true) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
NULLABLE EVOLVED OLD root |-- id: string (nullable = true) |--
station_id: long (nullable = true) |-- station_name: string (nullable
= true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = true) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
NULLABLE NEW root |-- id: string (nullable = true) |-- station_id:
long (nullable = true) |-- station_name: string (nullable = true)
|-- station_timezone: string (nullable = true) |-- station_genre:
string (nullable = true) |-- publisher_id: long (nullable = true)
|-- publisher_name: string (nullable = true) |-- group_id: long
(nullable = true) |-- group_name: string (nullable = true) |--
timestamp: long (nullable = true)
2017-01-18 15:59:32 ERROR org.apache.spark.internal.Logging$class
Executor:91 - Exception in task 1.0 in stage 2.0 (TID 4)
scala.MatchError: false (of class java.lang.Boolean) at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:296)
at
...
com.companystuff.meta.uploader.Merger$.apply(Merger.scala:49)
...
Caused by: scala.MatchError: false (of class java.lang.Boolean) at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:296)
...
It's because of ordering in the actual data even though its schema is the same.
So simply select all required columns then do a union query.
Something like this:
val columns:Seq[String]= ....
val df = oldDf.select(columns:_*).union(newDf.select(columns:_*)
Hope it helps you