How to load a csv directly into a Spark Dataset? - scala

I have a csv file [1] which I want to load directly into a Dataset. The problem is that I always get errors like
org.apache.spark.sql.AnalysisException: Cannot up cast `probability` from string to float as it may truncate
The type path of the target object is:
- field (class: "scala.Float", name: "probability")
- root class: "TFPredictionFormat"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
Moreover, and specifically for the phrases field (check case class [2]) it get
org.apache.spark.sql.AnalysisException: cannot resolve '`phrases`' due to data type mismatch: cannot cast StringType to ArrayType(StringType,true);
If I define all the fields in my case class [2] as type String then everything works fine but this is not what I want. Is there a simple way to do it [3]?
References
[1] An example row
B017NX63A2,Merrell,"['merrell_for_men', 'merrell_mens_shoes', 'merrel']",merrell_shoes,0.0806054356579781
[2] My code snippet is as follows
import spark.implicits._
val INPUT_TF = "<SOME_URI>/my_file.csv"
final case class TFFormat (
doc_id: String,
brand: String,
phrases: Seq[String],
prediction: String,
probability: Float
)
val ds = sqlContext.read
.option("header", "true")
.option("charset", "UTF8")
.csv(INPUT_TF)
.as[TFFormat]
ds.take(1).map(println)
[3] I have found ways to do it by first defining columns on a DataFrame level and the convert things to Dataset (like here or here or here) but I am almost sure this is not the way things supposed to be done. I am also pretty sure that Encoders are probably the answer but I don't have a clue how

TL;DR With csv input transforming with standard DataFrame operations is the way to go. If you want to avoid you should use input format which has is expressive (Parquet or even JSON).
In general data to be converted to statically typed dataset must be already of the correct type. The most efficient way to do it is to provide schema argument for csv reader:
val schema: StructType = ???
val ds = spark.read
.option("header", "true")
.schema(schema)
.csv(path)
.as[T]
where schema could be inferred by reflection:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
val schema = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
Unfortunately it won't work with your data and class because csv reader doesn't support ArrayType (but it would work for atomic types like FloatType) so you have to use the hard way. A naive solution could be expressed as below:
import org.apache.spark.sql.functions._
val df: DataFrame = ??? // Raw data
df
.withColumn("probability", $"probability".cast("float"))
.withColumn("phrases",
split(regexp_replace($"phrases", "[\\['\\]]", ""), ","))
.as[TFFormat]
but you may need something more sophisticated depending on the content of phrases.

Related

Unsupported operation exception from spark: Schema for type org.apache.spark.sql.types.DataType is not supported

Spark Streaming:
I am receiving a dataframe that consists of two columns. The first column is of string type that contains a json string and the second column consists of schema for each value(first column).
Batch: 0
-------------------------------------------
+--------------------+--------------------+
| value| schema|
+--------------------+--------------------+
|{"event_time_abc...|`event_time_abc...|
+--------------------+--------------------+
The table is stored in the val input(non mutable variable). I am using DataType.fromDDL function to convert the string type to a json DataFrame in the following way:
val out= input.select(from_json(col("value").cast("string"),ddl(col("schema"))))
where ddl is a predefined function,DataType.from_ddl(_:String):DataType in spark(scala) but i have registered it so that i can use it on whole column instead of a string only. I have done it in following way:
val ddl:UserDefinedFunction = udf(DataType.fromDDL(_:String):DataType)
and here is the final transformation on both column, value and schema of input table.
val out = input.select(from_json(col("value").cast("string"),ddl(col("schema"))))
However, i get exception from the registration at this line:
val ddl:UserDefinedFunction = udf(DataType.fromDDL(_:String):DataType)
The error is:
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.types.DataType is not supported
If i use:
val out = input.select(from_json(col("value").cast("string"),DataType.fromDDL("`" + "event_time_human2"+"`" +" STRING")).alias("value"))
then it works but as you see i am only using a string(manually typed coming from the schema column) inside the function DataType.fromDDL(_:String):DataType.
So how can i apply this function to whole column without registration or is there any other way to register the function?
EDIT: from_json function's first argument requires a column while second argument requires a schema and not a column. Hence, i guess a manual approach is required to parse each value field with each schema field. After some investigation i found out that DataFrames do not support DataType.
Since a bounty has been set on this question. I would like to provide additional information regarding the data and schema. The schema is defined in DDL(string type) and it can be parsed with from_DDL function. The value is simple json string that will be parsed with the schema that we derive using from_DDL function.
The basic idea is that each value has it's own schema and needs to be parsed with corresponding schema. A new column should be created where the result will be store.
Data:
Here is one example of the data:
value = {"event_time_human2":"09:45:00 +0200 09/27/2021"}
schema = "`event_time_human2` STRING"
It is not needed to convert to correct time format. Just a string will be fine.
It is in streaming context. So ,not all approaches work.
Schemas are being applied and validated before runtime, that is, before the Spark code is executed on the executors. Parsed schemas must be part of the execution plan therefore schema parsing can't be executed dynamically as you intended until now. This is the reason that you see the exception:
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.types.DataType is not supported only for the UDF. Consequently that implies that DataType.fromDDL should be used only inside the driver code and not in the runtime/executor code, which is the code within your UDF function. Inside the UDF function Spark has already executed the transformation of the imported data applying the schemas that you specified on the driver side. This is the reason that you can't use DataType.fromDDL directly in your UDF because it is essentially useless. All the above means that inside the UDF functions we can only use primitive Scala/Java types and some wrappers provided by the Spark API i.e WrappedArray.
An alternative could be to collect all the schemas on the driver. Then create a map with the pair (schema, dataframe) for each schema.
Keep in mind that collecting data to the driver is an expensive operation and it would make sense only if you have a reasonable number of unique schemas, i.e max some thousands. Also, applying these schemas to each dataset need to be done sequentially in the driver, which is quite expensive too, therefore it is important to realize that the suggested solution will only work efficiently if you have a limited amount of unique schemas.
Up to this point, your code could look as next:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.StructType
import spark.implicits._
val df = Seq(
("""{"event_time_human2":"09:45:00 +0200 09/27/2021", "name":"Pinelopi"}""", "`event_time_human2` STRING, name STRING"),
("""{"first_name":"Pin", "last_name":"Halen"}""", "first_name STRING, last_name STRING"),
("""{"code":993, "full_name":"Sofia Loren"}""", "code INT, full_name STRING")
).toDF("value", "schema")
val schemaDf = df.select("schema").distinct()
val dfBySchema = schemaDf.collect().map{ row =>
val schemaValue = row.getString(0)
val ddl = StructType.fromDDL(schemaValue)
val filteredDf = df.where($"schema" === schemaValue)
.withColumn("value", from_json($"value", ddl))
(schemaValue, filteredDf)
}.toMap
// Map(
// `event_time_human2` STRING, name STRING -> [value: struct<event_time_human2: string, name: string>, schema: string],
// first_name STRING, last_name STRING -> [value: struct<first_name: string, last_name: string>, schema: string],
// code INT, full_name STRING -> [value: struct<code: int, full_name: string>, schema: string]
// )
Explanation: first we gather each unique schema with schemaDf.collect(). Then we iterate through schemas and filter the initial df based on the current schema. We also use from_json to convert current string value column to the specific schema.
Note that we can't have one common column with different data type, this is the reason that we are creating a different df for each schema and not one final df.

Fetching a DataFrame into a Case Class instead results instead in reading a Tuple1

Given a case class :
case class ScoringSummary(MatchMethod: String="",
TP: Double=0,
FP: Double=0,
Precision: Double=0,
Recall: Double=0,
F1: Double=0)
We are writing summary records out as:
summaryDf.write.parquet(path)
Later we (attempt to) read the parquet file into a new dataframe:
implicit val generalRowEncoder: Encoder[ScoringSummary] =
org.apache.spark.sql.Encoders.kryo[ScoringSummary]
val summaryDf = spark.read.parquet(path).as[ScoringSummary]
But this fails - for some reason spark believes the contents of the data were Tuple1 instead of ScoringSummary:
Try to map struct<MatchMethod:string,TP:double,FP:double,Precision:double,
Recall:double,F1:double> to Tuple1,
but failed as the number of fields does not line up.;
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$
.org$apache$spark$sql$catalyst$analysis$Analyzer$
ResolveDeserializer$$fail(Analyzer.scala:2168)
What step / setting is missing/incorrect for the correct translation?
Use import spark.implicits._ instead of registering an Encoder
I had forgotten that it is required to import spark.implicits. The incorrect approach was to add the Encoder: i.e. do not include the following line
implicit val generalRowEncoder: Encoder[ScoringSummary] =
org.apache.spark.sql.Encoders.kryo[ScoringSummary] // Do NOT add this Encoder
Here is the error when removing the Encoder line
Error:(59, 113) Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._ Support for serializing
other types will be added in future releases.
val summaryDf = ParquetLoader.loadParquet(sparkEnv,res.state.dfs(ScoringSummaryTag).copy(df=None)).df.get.as[ScoringSummary]
Instead the following code should be added
import spark.implicits._
And then the same code works:
val summaryDf = spark.read.parquet(path).as[ScoringSummary]
As an aside: encoders are not required for case class'es or primitive types: and the above is a case class. kryo becomes handy for complex object types.

Spark failing to deserialize a record when creating Dataset

I'm reading a large number of CSVs from S3 (everything under a key prefix) and creating a strongly-typed Dataset.
val events: DataFrame = cdcFs.getStream()
events
.withColumn("event", lit("I"))
.withColumn("source", lit(sourceName))
.as[TradeRecord]
where TradeRecord is a case class that can normally be deserialized into by SparkSession implicits. However, for a certain batch, a record is failing to deserialize. Here's the error (stack trace omitted)
Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "deal")
- root class: "com.company.trades.TradeRecord"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
deal being a field of TradeRecord that should never be null in source data (S3 objects), so it's not an Option.
Unfortunately the error message doesn't give me any clue as to what the CSV data looks like, or even which CSV file it's coming from. The batch consists of hundreds of files, so I need a way to narrow this down to at most a few files to investigate the issue.
As suggested by user10465355 you can load the data:
val events: DataFrame = ???
Filter
val mismatched = events.where($"deal".isNull)
Add file name
import org.apache.spark.sql.functions.input_file_name
val tagged = mismatched.withColumn("_file_name", input_file_name)
Optionally add the chunk and chunk and offset:
import org.apache.spark.sql.functions.{spark_partition_id, monotonically_increasing_id, shiftLeft, shiftRight
df
.withColumn("chunk", spark_partition_id())
.withColumn(
"offset",
monotonically_increasing_id - shiftLeft(shiftRight(monotonically_increasing_id, 33), 33))
Here's the solution I came up with (I'm using Spark Structured Streaming):
val stream = spark.readStream
.format("csv")
.schema(schema) // a StructType defined elsewhere
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "corruptRecord")
.load(path)
// If debugging, check for any corrupted CSVs
if (log.isDebugEnabled) { // org.apache.spark.internal.Logging trait
import spark.implicits._
stream
.filter($"corruptRecord".isNotNull)
.withColumn("input_file", input_file_name)
.select($"input_file", $"corruptRecord")
.writeStream
.format("console")
.option("truncate", false)
.start()
}
val events = stream
.withColumn("event", lit("I"))
.withColumn("source", lit(sourceName))
.as[TradeRecord]
Basically if Spark log level is set to Debug or lower, the DataFrame is checked for corrupted records and any such records are printed out together with their file names. Eventually the program tries to cast this DataFrame to a strongly-typed Dataset[TradeRecord] and fails.

Adding new column using existing one using Spark Scala

Hi I want to add new column using existing column in each row of DataFrame , I am trying this in Spark Scala like this ...
df is dataframe containing variable number of column , which can be decided at run time only.
// Added new column "docid"
val df_new = appContext.sparkSession.sqlContext.createDataFrame(df.rdd, df.schema.add("docid", DataTypes.StringType))
df_new.map(x => {
import appContext.sparkSession.implicits._
val allVals = (0 to x.size).map(x.get(_)).toSeq
val values = allVals ++ allVals.mkString("_")
Row.fromSeq(values)
})
But this is giving error is eclipse itself
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
not enough arguments for method map: (implicit evidence$7: org.apache.spark.sql.Encoder[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]. Unspecified value parameter evidence$7.
Please help.
concat_ws from the functions object can help.
This code adds the docid field
df = df.withColumn("docid", concat_ws("_", df.columns.map(df.col(_)):_*))
assuming all columns of df are strings.

How to change the schema of a DataFrame (to fix the names of some nested fields)?

I have an issue where when we load a Json file into Spark, store it as Parquet and then try and access the Parquet file from Impala; Impala complains about the names of the columns as they contain characters which are illegal in SQL.
One of the "features" of the JSON files is that they don't have a predefined schema. I want Spark to create the schema, and then I have to modify the field names that have illegal characters.
My first thought was to use withColumnRenamed on the names of the fields in the DataFrame but this only works on top level fields I believe, so I could not use that as the Json contains nested data.
So I created the following code to recreate the DataFrames schema, going recursively through the structure. And then I use that new schema to recreate the DataFrame.
(Code updated with Jacek's suggested improvment of using the Scala copy constructor.)
def replaceIllegal(s: String): String = s.replace("-", "_").replace("&", "_").replace("\"", "_").replace("[", "_").replace("[", "_")
def removeIllegalCharsInColumnNames(schema: StructType): StructType = {
StructType(schema.fields.map { field =>
field.dataType match {
case struct: StructType =>
field.copy(name = replaceIllegal(field.name), dataType = removeIllegalCharsInColumnNames(struct))
case _ =>
field.copy(name = replaceIllegal(field.name))
}
})
}
sparkSession.createDataFrame(df.rdd, removeIllegalCharsInColumnNames(df.schema))
This works. But is there a better / simpler way to achive what I want to do?
And is there a better way to replace the existing schema on a DataFrame? The following code did not work:
df.select($"*".cast(removeIllegalCharsInColumnNames(df.schema)))
It gives this error:
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'cast'
I think the best bet would be to convert the Dataset (before you save as a parquet file) to an RDD and use your custom schema to describe the structure as you want.
val targetSchema: StructType = ...
val fromJson: DataFrame = ...
val targetDataset = spark.createDataFrame(fromJson.rdd, targetSchema)
See the example in SparkSession.createDataFrame as a reference however it uses an RDD directly while you're going to create it from a Dataset.
val schema =
StructType(
StructField("name", StringType, false) ::
StructField("age", IntegerType, true) :: Nil)
val people =
sc.textFile("examples/src/main/resources/people.txt").map(
_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
val dataFrame = sparkSession.createDataFrame(people, schema)
dataFrame.printSchema
// root
// |-- name: string (nullable = false)
// |-- age: integer (nullable = true)
But as you mentioned in your comment (that I later merged to your question):
JSON files don't have a predefined schema.
With that said, I think your solution is a correct one. Spark does not offer anything similar out of the box and I think it's more about developing a custom Scala code that would traverse a StructType/StructField tree and change what's incorrect.
What I would suggest to change in your code is to use the copy constructor (a feature of Scala's case classes - see A Scala case class ‘copy’ method example) that would only change the incorrect name with the other properties untouched.
Using copy constructor would (roughly) correspond to the following code:
// was
// case s: StructType =>
// StructField(replaceIllegal(field.name), removeIllegalCharsInColumnNames(s), field.nullable, field.metadata)
s.copy(name = replaceIllegal(field.name), dataType = removeIllegalCharsInColumnNames(s))
There are some design patterns in functional languages (in general) and Scala (in particular) that could deal with the deep nested structure manipulation, but that might be too much (and I'm hesitant to share it).
I therefore think that the question is in its current "shape" more about how to manipulate a tree as a data structure not necessarily a Spark schema.