Scala Spark Filter RDD using Cassandra - scala

I am new to spark-Cassandra and Scala. I have an existing RDD. let say:
((url_hash, url, created_timestamp )).
I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls.
Cassandra Table looks like following:
url_hash| url | created_timestamp | updated_timestamp
Any pointers will be great.
I tried something like this this:
case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)
I am getting cassandra error
java.lang.NullPointerException: Unexpected null value of column full_url in keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper
There are no null values in cassandra table

Thanks The Archetypal Paul!
I hope somebody finds this useful. Had to add Option to case class.
Looking forward to better solutions
case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)

Related

How to apply filters on spark scala dataframe view?

I am pasting a snippet here where I am facing issues with the BigQuery Read. The "wherePart" has more number of records and hence BQ call is invoked again and again. Keeping the filter outside of BQ Read would help. The idea is, first read the "mainTable" from BQ, store it in a spark view, then apply the "wherePart" filter to this view in spark.
["subDate" is a function to subtract one date from another and return the number of days in between]
val Df = getFb(config, mainTable, ds)
def getFb(config: DataFrame, mainTable: String, ds: String) : DataFrame = {
val fb = config.map(row => Target.Pfb(
row.getAs[String]("m1"),
row.getAs[String]("m2"),
row.getAs[Seq[Int]]("days")))
.collect
val wherePart = fb.map(x => (x.m1, x.m2, subDate(ds, x.days.max - 1))).
map(x => s"(idata_${x._1} = '${x._2}' AND ds BETWEEN '${x._3}' AND '${ds}')").
mkString(" OR ")
val q = new Q()
val tempView = "tempView"
spark.readBigQueryTable(mainTable, wherePart).createOrReplaceTempView(tempView)
val Df = q.mainTableLogs(tempView)
Df
}
Could someone please help me here.
Are you using the spark-bigquery-connector? If so the right syntax is
spark.read.format("bigquery")
.load(mainTable)
.where(wherePart)
.createOrReplaceTempView(tempView)

scala insert to redis gives task not serializable

I have a following code :-
case class event(imei: String, date: String, gpsdt: String,
entrygpsdt: String,lastgpsdt: String)
val result = rdd.map(row => {
val imei = row.getString(0)
val date = row.getString(1)
val gpsdt = row.getString(2)
event(imei, date, gpsdt, lastgpsdt ,"2018-04-06 10:10:10")
}).collect()
val collection = sc.parallelize(result)
collection.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt", "lastgpsdt", "dt")
This works fine . So, now I'm inserting this result value into cassandra but I want to insert part of each rdd into Redis as well . When, I'm trying to use redis insert inside loop it gives an error that Task is not serializable
I want something like this :-
case class event(imei: String, date: String, gpsdt: String,
entrygpsdt: String,lastgpsdt: String)
val result = rdd.map(row => {
val imei = row.getString(0)
val date = row.getString(1)
val gpsdt = row.getString(2)
val zscore = Calendar.getInstance().getTimeInMillis
val value = row.getString(0) + ',' + row.getString(2)
val key = row.getString(1)
client.zadd(key , zscore, value)
event(imei, date, gpsdt, lastgpsdt ,"2018-04-06 10:10:10")
}).collect()
val collection = sc.parallelize(result)
collection.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt", "lastgpsdt", "dt")
So, How Can I do that , "client" is object of scala redis library.
Thanks,
Since no answer was provided by anyone. I found the solution for my case. Don't know whether the approach is good or not but it worked for me. So, idea is collect data by iterating over RDD . You'll be given a result of Array[event]. So, now again start a loop on result and insert each row in Redis. and finally "result" in cassandra. This flow is solving my both purpose that I was looking for.
Thanks,
The serializable exception is generally caused due to the connection object creation.
However your code does not include, I guess you have created the client object outside the foreachRDD
If so the client object is created in driver and foreach is executed in executor where it cannot find the client object and occurs exception task not serializable.
What you can do is create the client object inside foreach, But this creates a connection for each record, which is also not good for performance.
So what you can do is
rdd.foreachPartition(partition => {
//Create a connection here for redis
partition.foreach(record => {
//send the data here
})
})
Hope this helps!

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

How to correctly handle Option in Spark/Scala?

I have a method, createDataFrame, which returns an Option[DataFrame]. I then want to 'get' the DataFrame and use it in later code. I'm getting a type mismatch that I can't fix:
val df2: DataFrame = createDataFrame("filename.txt") match {
case Some(df) => { //proceed with pipeline
df.filter($"activityLabel" > 0)
case None => println("could not create dataframe")
}
val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
I need df2 to be of type: DataFrame otherwise later code won't recognise df2 as a DataFrame e.g. val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
However, the case None statement is not of type DataFrame, it returns Unit, so won't compile. But if I don't declare the type of df2 the later code won't compile as it is not recognised as a DataFrame. If someone can suggest a fix that would be helpful - been going round in circles with this for some time. Thanks
What you need is a map. If you map over an Option[T] you are doing something like: "if it's None I'm doing nothing, otherwise I transform the content of the Option in something else. In your case this content is the dataframe itself. So inside this myDFOpt.map() function you can put all your dataframe transformation and just in the end do the pattern matching you did, where you may print something if you have a None.
edit:
val df2: DataFrame = createDataFrame("filename.txt").map(df=>{
val filteredDF=df.filter($"activityLabel" > 0)
val Array(trainData, testData) = filteredDF.randomSplit(Array(0.5,0.5),seed = 12345)})

Spark: Populating Cassandra UDTValue as a DataFrame column

I am trying to create a Cassandra UDT from some columns from my data frame. I want to add this UDT column to the data frame and save this to the Cassandra table.
My code looks like:
val asUDT = udf((keys: Seq[String], values: Seq[String]) =>
UDTValue.fromMap(keys.zip(values).filter {
case (k, null) => false
case _ => true
}.toMap))
val keys = array(mapKeys.map(lit): _*)
val values = array(mapValues.map(col): _*)
return df.withColumn("targetColumn", (asUDT(keys, values))
However, I am receiving the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type AnyRef is not supported.
Please let me know how I can save a UDT value as a column in my data frame.
Any pointers on how I can get this to work will be really helpful.
Thanks