Write to Cassandra with writetime using dataframe in spark - scala

I have a following code :-
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_AND_DISK_SER)
val collection = kafkaStream.map(_._2).map(parser)
collection.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
try {
val dfs = rdd.toDF()
dfs.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "tablename", "keyspace" -> "dbname"))
.mode(SaveMode.Append).save()
} catch {
case e: Exception => e.printStackTrace
}
} else {
println("blank rdd")
}
})
In above example I'm saving spark streaming to cassandra using dataframe. Now, I want each row of df should have its specific writetime, similar to this command -
insert into table (imei , date , gpsdt ) VALUES ( '1345','2010-10-12','2010-10-12 10:10:10') USING TIMESTAMP 1530313803922977;
So basically writetime of each row should be equal to the gpsdt column of that row. On searching I found this link but it shows example of RDD, i want similar use case of dataframe - https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md Any suggestions, Thanks

As I'm aware, there is no such functionality in DataFrame version (there is corresponding JIRA: https://datastax-oss.atlassian.net/browse/SPARKC-416). But you anyway have the RDD, that you convert into DataFrame - why not use saveToCassandra as described in link that you cited?
P.S. you may have performance problems as you check for emptiness (http://www.waitingforcode.com/apache-spark/isEmpty-trap-spark/read)

Related

Input data received all in lowercase on spark streaming in databricks using DataFrame

My spark streaming application consumes data from an aws kenisis and is deployed in databricks. I am using the org.apache.spark.sql.Row.mkString method to consume the data and the whole data is received in lowercase. The actual input had camel case field name and values but is received in lowercase on consuming.
I have tried consuming from a simple java application and is receiving the data in the correct from from the kinesis queue. The issue is only in the spark streaming application using DataFrames and running in databricks.
// scala code
val query = dataFrame
.selectExpr("lcase(CAST(data as STRING)) as krecord")
.writeStream
.foreach(new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(row: Row) = {
logger.info("Record received in data frame is -> " + row.mkString)
processDFStreamData(row.mkString, outputHandler, kBase, ruleEvaluator)
}
def close(errorOrNull: Throwable): Unit = {
}
})
.start()
Expectation is the spark streaming input json should be in the same case
letter (camel case)as the data in the kinesis , it should not be converted to lower case once received using data frame.
Any thought's on what might be causing this?
Fixed the issue, the lcase used in the select expression was the culprit, updated the code as below and it worked.
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(new ForeachWriter[Row] {
.........

How to handle missing columns in spark sql

We are dealing with schema free JSON data and sometimes the spark jobs are failing as some of the columns we refer in spark SQL are not available for certain hours in the day. During these hours the spark job fails as the column being referred is not available in the data frame. How to handle this scenario? I have tried UDF but we have too many columns missing so can't really check each and every column for availability. I have also tried inferring a schema on a larger data set and applied it on the data frame expecting that missing columns will be filled with null but the schema application fails with weird errors.
Please suggest
This worked for me. Created a function to check all expected columns and add columns to dataframe if it is missing
def checkAvailableColumns(df: DataFrame, expectedColumnsInput: List[String]) : DataFrame = {
expectedColumnsInput.foldLeft(df) {
(df,column) => {
if(df.columns.contains(column) == false) {
df.withColumn(column,lit(null).cast(StringType))
}
else (df)
}
}
}
val expectedColumns = List("newcol1","newcol2","newcol3")
val finalDf = checkAvailableColumns(castedDateSessions,expectedColumns)
Here is an improved version of the answer #rads provided
#tailrec
def addMissingFields(fields: List[String])(df: DataFrame): DataFrame = {
def addMissingField(field: String)(df: DataFrame): DataFrame =
df.withColumn(field, lit(null).cast(StringType))
fields match {
case Nil =>
df
case c :: cs if c.contains(".") && !df.columns.contains(c.split('.')(0)) =>
val fields = c.split('.')
// it just supports one level of nested, but it can extend
val schema = StructType(Array(StructField(fields(1), StringType)))
addMissingFields(cs)(addMissingField(fields(0), schema)(df))
case ::(c, cs) if !df.columns.contains(c.split('.')(0)) =>
addMissingFields(cs)(addMissingField(c)(df))
case ::(_, cs) =>
addMissingFields(cs)(df)
}
}
Now you can use it as a transformation:
val df = ...
val expectedColumns = List("newcol1","newcol2","newcol3")
df.transform(addMissingFields(expectedColumns))
I haven't tested it in production yet to see if there is any performance issue. I doubt it. But if there was any, I'll update my post.
Here are the steps to add missing columns:
val spark = SparkSession
.builder()
.appName("Spark SQL json example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val df = spark.read.json
val schema = df.schema
val columns = df.columns // enough for flat tables
You can traverse the auto generated schema. If it is flat table just do
df.columns.
Compare the found columns to the expected columns and add the missing fields like this:
val dataframe2 = df.withColumn("MissingString1", lit(null).cast(StringType) )
.withColumn("MissingString2", lit(null).cast(StringType) )
.withColumn("MissingDouble1", lit(0.0).cast(DoubleType) )
Maybe there is a faster way to add the missing columns in one operation, instead of one by one, but the with withColumns() method which does that is private.
Here's a pyspark solution based on this answer which checks for a list of names (from a configDf - transformed into a list of columns it should have - parameterColumnsToKeepList) - this assumes all missing columns are ints but you could look this up in configdDf dynamically too. My default is null but you could also use 0.
from pyspark.sql.types import IntegerType
for column in parameterColumnsToKeepList:
if column not in processedAllParametersDf.columns:
print('Json missing column: {0}' .format(column))
processedAllParametersDf = processedAllParametersDf.withColumn(column, lit(None).cast(IntegerType()))

How to define schema of streaming dataset dynamically to write to csv?

I have a streaming dataset, reading from kafka and trying to write to CSV
case class Event(map: Map[String,String])
def decodeEvent(arrByte: Array[Byte]): Event = ...//some implementation
val eventDataset: Dataset[Event] = spark
.readStream
.format("kafka")
.load()
.select("value")
.as[Array[Byte]]
.map(decodeEvent)
Event holds Map[String,String] inside and to write to CSV I'll need some schema.
Let's say all the fields are of type String and so I tried the example from spark repo
val columns = List("year","month","date","topic","field1","field2")
val schema = new StructType() //Prepare schema programmatically
columns.foreach { field => schema.add(field, "string") }
val rowRdd = eventDataset.rdd.map { event => Row.fromSeq(
columns.map(c => event.getOrElse(c, "")
)}
val df = spark.sqlContext.createDataFrame(rowRdd, schema)
This gives error at runtime on line "eventDataset.rdd":
Caused by: org.apache.spark.sql.AnalysisException: Queries with
streaming sources must be executed with writeStream.start();;
Below doesn't work because '.map' has a List[String] not Tuple
eventDataset.map(event => columns.map(c => event.getOrElse(c,""))
.toDF(columns:_*)
Is there a way to achieve this with programmatic schema and structured streaming datasets?
I'd use much simpler approach:
import org.apache.spark.sql.functions._
eventDataset.select(columns.map(
c => coalesce($"map".getItem(c), lit("")).alias(c)
): _*).writeStream.format("csv").start(path)
but if you want something closer to the current solution skip RDD conversion
import org.apache.spark.sql.catalyst.encoders.RowEncoder
eventDataset.rdd.map(event =>
Row.fromSeq(columns.map(c => event.getOrElse(c,"")))
)(RowEncoder(schema)).writeStream.format("csv").start(path)

Issue storing AVRO Kafka streams to File System

I want to store my AVRO kafka streams to file system using my spark streaming API with the following scala code in delimited format, but facing some challenges in achieving this
record.write.mode(SaveMode.Append).csv("/Users/Documents/kafka-poc/consumer-out/)
Since, record(generic record) is not a DF or RDD, I am not sure how to proceed with this?
Code
val messages = SparkUtilsScala.createCustomDirectKafkaStreamAvro(ssc, kafkaParams, zookeeper_host, kafkaOffsetZookeeperNode, topicsSet)
val requestLines = messages.map(_._2)
requestLines.foreachRDD((rdd, time: Time) => {
rdd.foreachPartition { partitionOfRecords => {
val recordInjection = SparkUtilsJava.getRecordInjection(topicsSet.last)
for (avroLine <- partitionOfRecords) {
val record = recordInjection.invert(avroLine).get
println("Consumer output...."+record)
println("Consumer output schema...."+record.getSchema)
}}}}
following is the output & schema
{"username": "Str 1-0", "tweet": "Str 2-0", "timestamp": 0}
{"type":"record","name":"twitter_schema","fields":[{"name":"username","type":"string"},{"name":"tweet","type":"string"},{"name":"timestamp","type":"int"}]}
Thanks in advance and appreciate your help
I found a solution for this.
val jsonStrings: RDD[String] = sc.parallelize(Seq(record.toString()));
val result = sqlContext.read.json(jsonStrings).toDF();
result.write.mode("Append").csv("/Users/Documents/kafka-poc/‌​consumer-out/");

Spark Streaming csv to Data Frame

I have a streaming csv data set that comes in this format
2,C4653,C5030
2,C5782,C16712
6,C1191,C419
15,C3380,C22841
18,C2436,C5030
I am trying to take the Dstream and convert it into a DataFrame where i should get each field as a column. something like this.
col1 col2 col3
2 C4653 C5030
2 C5782 C16712
and so on.
I am using the following code but cannot get it to work. This is the code that I am using.
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val seperator = lines.map(_.split(","))
lines.foreachRDD { rdd =>
// Get the singleton instance of SparkSession
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Convert RDD[String] to DataFrame
val wordsDataFrame = rdd.map(_.split(",")).toDF().show();
}
I am getting the following as output for the code I am using.
+-----------------+
| value|
+-----------------+
|[2, C4653, C5030]|
+-----------------+
However, I am trying to make it into three columns. Please help.
You can try something like this.
val wordsDataFrame = rdd.map { record => {
val recordArr = record.split(",")
(recordArr(0),recordArr(1),recordArr(2))
} }.toDF("col1","col2","col3")
Please provide a schema with toDF . Something like this val wordsDataFrame = rdd.map(_.split(",")).toDF("col1","col2","col3").show() should work then