Spark: Populating Cassandra UDTValue as a DataFrame column - scala

I am trying to create a Cassandra UDT from some columns from my data frame. I want to add this UDT column to the data frame and save this to the Cassandra table.
My code looks like:
val asUDT = udf((keys: Seq[String], values: Seq[String]) =>
UDTValue.fromMap(keys.zip(values).filter {
case (k, null) => false
case _ => true
}.toMap))
val keys = array(mapKeys.map(lit): _*)
val values = array(mapValues.map(col): _*)
return df.withColumn("targetColumn", (asUDT(keys, values))
However, I am receiving the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type AnyRef is not supported.
Please let me know how I can save a UDT value as a column in my data frame.
Any pointers on how I can get this to work will be really helpful.
Thanks

Related

Writing null values to Parquet in Spark when the NullType is inside a StructType

I'm importing a collection from MongodB to Spark. All the documents have field 'data' which in turn is a structure and has field 'configurationName' (which is always null).
val partitionDF = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("database", "db").option("collection", collectionName).load()
For the data column in the resulting DataFrame, I get this type:
StructType(StructField(configurationName,NullType,true), ...
When I try to save the dataframe as Parquet
partitionDF.write.mode("overwrite").parquet(collectionName + ".parquet")
I get the following error:
AnalysisException: Parquet data source does not support struct<configurationName:null, ...
It looks like the problem is that I have that NullType buried in the data column's type. I'm looking at How to handle null values when writing to parquet from Spark , but it only shows how to solve this NullType problem on the top-level columns.
But how do you solve this problem when a NullType is not at the top level? The only idea I have so far is to flatten the dataframe completely (exploding arrays and so on) and then all the NullTypes would pop at the top. But in such a case I would lose the original structure of the data (which I don't want to lose).
Is there a better solution?
#Roman Puchkovskiy : Rewritten your function using pattern matching.
def deNullifyStruct(struct: StructType): StructType = {
val items = struct.map { field => StructField(field.name, fixNullType(field.dataType), field.nullable, field.metadata) }
StructType(items)
}
def fixNullType(dt: DataType): DataType = {
dt match {
case _: StructType => return deNullifyStruct(dt.asInstanceOf[StructType])
case _: ArrayType =>
val array = dt.asInstanceOf[ArrayType]
return ArrayType(fixNullType(array.elementType), array.containsNull)
case _: NullType => return StringType
case _ => return dt
}
}
Building on How to handle null values when writing to parquet from Spark and How to pass schema to create a new Dataframe from existing Dataframe? (the second is suggested by #pasha701, thanks!), I constructed this:
def denullifyStruct(struct: StructType): StructType = {
val items = struct.map{ field => StructField(field.name, denullify(field.dataType), field.nullable, field.metadata) }
StructType(items)
}
def denullify(dt: DataType): DataType = {
dt match {
case struct: StructType => denullifyStruct(struct)
case array: ArrayType => ArrayType(denullify(array.elementType), array.containsNull)
case _: NullType => StringType
case _ => dt
}
}
which effectively replaces all NullType instances with StringType ones.
And then
val fixedDF = spark.createDataFrame(partitionDF.rdd, denullifyStruct(partitionDF.schema))
fixedDF.printSchema

How to create a encoder for type Iterator[org.apache.spark.sql.Row]

I am using spark 2.4.4 in databricks notebook.
I have a data in dataframe which I want to use to update records in Postgre table .
I am following the approach given in this post Spark Dataframes UPSERT to Postgres Table
Here is my code
import spark.implicits._
val update_query = s"""UPDATE scored_fact.f_learner_assessment_item_response_classifications_test SET is_deleted = ? where f.learner_assigned_item_classification_attempt_sk = ?::uuid AND f.root_org_partition= ?::int"""
changedSectionLearnerDF.coalesce(8).mapPartitions((d) => Iterator(d)).foreach { batch =>
val dbc: Connection = DriverManager.getConnection(connectionUrl)
val stmt: PreparedStatement = dbc.prepareStatement(update_query)
batch.grouped(100).foreach { session =>
session.foreach { row =>
stmt.setBoolean( 0, row.getAs[Boolean]("is_deleted") )
stmt.setString( 1, row.getAs[String]("learner_assigned_item_classification_attempt_sk"))
stmt.setString( 2, row.getAs[String]("root_org_partition"))
stmt.addBatch()
}
stmt.executeBatch()
}
dbc.close()
}
I am getting below error
Unable to find encoder for type Iterator[org.apache.spark.sql.Row]. An implicit Encoder[Iterator[org.apache.spark.sql.Row]] is needed to store Iterator[org.apache.spark.sql.Row] instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.changedSectionLearnerDF.coalesce(8).mapPartitions((d) => Iterator(d)).foreach { batch =>
I am sure I am missing something. How can I resolve this error by creating an encoder
The signature of mapPartitions is
def mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]
So d in d => Iterator(d) is an Iterator[Row] and the function returns a Dataset[Iterator[Row]] which can't reasonably exist.
I think the mapPartitions call is simply wrong and .mapPartitions((d) => Iterator(d)).foreach should be replaced by foreachPartition (as #shridharama's comment to the linked answer says).

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Scala Spark Filter RDD using Cassandra

I am new to spark-Cassandra and Scala. I have an existing RDD. let say:
((url_hash, url, created_timestamp )).
I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls.
Cassandra Table looks like following:
url_hash| url | created_timestamp | updated_timestamp
Any pointers will be great.
I tried something like this this:
case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)
I am getting cassandra error
java.lang.NullPointerException: Unexpected null value of column full_url in keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper
There are no null values in cassandra table
Thanks The Archetypal Paul!
I hope somebody finds this useful. Had to add Option to case class.
Looking forward to better solutions
case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)

Share HDInsight SPARK SQL Table saveAsTable does not work

I want to show the data from HDInsight SPARK using tableau. I was following this video where they have described how to connect the two systems and expose the data.
currently my script itself is very simple as shown below:
/* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer#mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")
// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table")
unfortunately I run in to the following error
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)
i have also tried to use the following code to overwrite the table if it exists
import org.apache.spark.sql.SaveMode
myData.saveAsTable("test_table", SaveMode.Overwrite)
but still it gives me same error.
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
Can someone please help me fix this issue?
I know it was my mistake, but i'll leave it as an answer as it was not readily available in any of the blogs or forum answers. hopefully it will help someone like me starting with Spark
I figured out that .toDF() actually creates the sqlContext and not the hiveContext based DataFrame. so I have now updated my code as below
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")
also make sure that the s(4) has proper double value else add try/catch to handle it. i did something like this:
def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))
Regards
Kiran