Spark : ClassCastException while converting string into date - scala

I am trying to read parquet file in dataframe with following code:
val data = spark.read.schema(schema)
.option("dateFormat", "YYYY-MM-dd'T'hh:mm:ss").parquet(<file_path>)
data.show()
Here's the schema:
def schema: StructType = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", DateType, false)
))
When I try to execute data.show(), it throws the following exception:
Caused by: java.lang.ClassCastException: [B cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.catalyst.expressions.MutableInt.update(SpecificInternalRow.scala:74)
at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:240)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.set(ParquetRowConverter.scala:159)
at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addBinary(ParquetRowConverter.scala:89)
at org.apache.parquet.column.impl.ColumnReaderImpl$2$6.writeValue(ColumnReaderImpl.java:324)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:372)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
Apparently, it's because of date format and DateType in my schema. If I change DateType to StringType, it works fine and outputs the following:
+--------------------+--------------------+----------------------+
| id| text| created_date|
+--------------------+--------------------+----------------------+
|id..................|text................|2017-01-01T00:08:09Z|
I want to read created_date into DateType, do I need to change anything else?

The following works under Spark 2.1. Note the change of the date format and the usage of TimestampType instead of DateType.
val schema = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", TimestampType, false)
))
val data = spark
.read
.schema(schema)
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss'Z'")
.parquet("s3a://thisisnotabucket")
In older versions of Spark (I can confirm this works under 1.5.2), you can create a UDF to do the conversion for you in SQL.
def cvtDt(d: String): java.sql.Date = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Date(fmt.parseDateTime(d).getMillis)
}
def cvtTs(d: String): java.sql.Timestamp = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Timestamp(fmt.parseDateTime(d).getMillis)
}
sqlContext.udf.register("mkdate", cvtDt(_: String))
sqlContext.udf.register("mktimestamp", cvtTs(_: String))
sqlContext.read.parquet("s3a://thisisnotabucket").registerTempTable("dttest")
val query = "select *, mkdate(created_date), mktimestamp(created_date) from dttest"
sqlContext.sql(query).collect.foreach(println)
NOTE: I did this in the REPL, so I had to create the DateTimeFormat pattern on every call to the cvt* methods to avoid serialization issues. If you're doing this is an application, I recommend extracting the formatter into an object.
object DtFmt {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T‌​'HH:mm:ss'Z'")
}

Related

Define StructType as input datatype of a Function Spark-Scala 2.11 [duplicate]

This question already has an answer here:
Defining a UDF that accepts an Array of objects in a Spark DataFrame?
(1 answer)
Closed 3 years ago.
I'm trying to write a Spark UDF in scala, I need to define a Function's input datatype
I have a schema variable with the StructType, mentioned the same below.
import org.apache.spark.sql.types._
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
I'm trying to write a Function like below
val relationsFunc: Array[Map[String,String]] => Array[String] = _.map(do something)
val relationUDF = udf(relationsFunc)
input.withColumn("relation",relationUDF(col("relation")))
above code throws below exception
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(relation)' due to data type mismatch: argument 1 requires array<map<string,string>> type, however, '`relation`' is of array<struct<attribute:string,email:string,fname:string,lname:string>> type.;;
'Project [relation#89, UDF(relation#89) AS proc#273]
if I give the input type as
val relationsFunc: StructType => Array[String] =
I'm not able to implement the logic, as _.map gives me metadata, filed names, etc.
Please advice how to define relationsSchema as input datatype in the below function.
val relationsFunc: ? => Array[String] = _.map(somelogic)
Your structure under relation is a Row, so your function should have the following signature :
val relationsFunc: Array[Row] => Array[String]
then you can access your data either by position or by name, ie :
{r:Row => r.getAs[String]("email")}
Check the mapping table in the documentation to determine the data type representations between Spark SQL and Scala: https://spark.apache.org/docs/2.4.4/sql-reference.html#data-types
Your relation field is a Spark SQL complex type of type StructType, which is represented by Scala type org.apache.spark.sql.Row so this is the input type you should be using.
I used your code to create this complete working example that extracts email values:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(
Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
val data = Seq(
Row("{'relation':[{'attribute':'1','email':'johnny#example.com','fname': 'Johnny','lname': 'Appleseed'}]}")
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
relationsSchema
)
val relationsFunc = (relation: Array[Row]) => relation.map(_.getAs[String]("email"))
val relationUdf = udf(relationsFunc)
df.withColumn("relation", relationUdf(col("relation")))

Converting RDD into Dataframe

I am new in spark/scala.
I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations.
below should be the schema of dataframe
schema[UserId, EntityId, WebSessionId, ProductId]
rdd.foreach(println)
545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS
Will anyone please help me....!!!
I have tried same by defining schema class and mapping same against rdd but getting error
"ArrayIndexOutOfBoundsException :3"
If you treat your columns as String you can create with the following:
import org.apache.spark.sql.Row
val rdd : RDD[Row] = ???
val df = spark.createDataFrame(rdd, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.
In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:
val list =
List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
"875643,5485254,JHDSFJD543514KJKJ4",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR54545DSKJD541054",
"264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515",
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")
val FilterReadClicks = spark.sparkContext.parallelize(list)
val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
if(array.length == 4)
array
else Row.fromSeq(array.toSeq.:+(""))
}
rows.foreach(el => println(el.toSeq))
val df = spark.createDataFrame(rows, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
df.show()
+------------------+------------------+------------+---------+
| userId| EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|JHDSFJD543514KJKJ4| 5485254| 875643| |
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR54545DSKJD541054|DIKFH6545614561456| 5615615| 545456|
|PR5142545564542515|MNXZCBMNABC5645SAD| 3254564| 264264|
|UJHSG4240323545144| 8765984| 732543| |
|KJDXSGFJFS2545DSAS| 6276832| 564574| |
+------------------+------------------+------------+---------+
With rows rdd you will be able to create the dataframe.

Spark error when using except on a dataframe with MapType

I am seeing the error Cannot have map type columns in DataFrame which calls set operations when using Spark MapType.
Below is the sample code I wrote to reproduce it. I understand this is happening because the MapType objects are not hashable but I have an use case where I need to do the following.
val schema1 = StructType(Seq(
StructField("a", MapType(StringType, StringType, true)),
StructField("b", StringType, true)
))
val df = spark.read.schema(schema1).json("path")
val filteredDF = df.filter($"b" === "apple")
val otherDF = df.except(filteredDF)
Any suggestions for workarounds?

Error in creating dataframe: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string

I have created a schema with following code
val schema= new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
Created a RDD from
val data = spark.sparkContext.textFile("cities.txt")
Converted to RDD of Row to apply schema
val cities = data.map(line => line.split(";")).map(row => Row.fromSeq(row.zip(schema.toSeq)))
val citiesRDD = spark.sqlContext.createDataFrame(cities, schema)
This gives me an error
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string
You don't need a schema to create a Row, you need the schema when you create the DataFrame. You also need to introduce some logic how to convert your splitted line (which produces 3 strings) into integers:
here a minimal solution without exception-handling:
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
val schema = new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
val cities = data.map(line => {
val Array(city,female,male) = line.split(";")
Row(
city,
female.toInt,
male.toInt
)
}
)
val citiesDF = sqlContext.createDataFrame(cities, schema)
I normally use case-classes to create a dataframe, because spark can infer the schema from the case class:
// "schema" for dataframe, define outside of main method
case class MyRow(city:Option[String],female:Option[Int],male:Option[Int])
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
import sqlContext.implicits._
val citiesDF = data.map(line => {
val Array(city,female,male) = line.split(";")
MyRow(
Some(city),
Some(female.toInt),
Some(male.toInt)
)
}
).toDF()

Update Dataframe Schema Read Spark Scala

I am trying to read in a schema from hdfs to load into my dataframe. This allows the schema to be updated and reside outside the Spark Scala code. I was wondering what the best way was to do this? Below is what I have currently inside the code.
val schema_example = StructType(Array(
StructField("EXAMPLE_1", StringType, true),
StructField("EXAMPLE_2", StringType, true),
StructField("EXAMPLE_3", StringType, true))
def main(args: Array[String]): Unit = {
val df_example = get_df("example.txt", schema_example)
}
def get_df(filename: String, schema: StructType): DataFrame = {
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter","~")
.schema(schema)
.option("quote", "'")
.option("quoteMode", "ALL")
.load(filename)
df.select(df.columns.map(c => trim(col(c)).alias(c)): _*)
}
Better would be to read Schema from HOCON Config file, which can be updated as and when required.
schema[
{
columnName = EXAMPLE_1
type = string
},
{
columnName = EXAMPLE_2
type = string
},
{
columnName = EXAMPLE_3
type = string
}
]
They you can read this file using ConfigFactory.
This will be more better and cleaner way to maintain file schema.