Unit testing for spark-bigquery-connector - scala

I am trying to write unit test case for my spark bigquery implementation.
Say, If I want to test this below piece of code:
val wordsDF = (spark.read.format("bigquery")
.option("table","bigquery-public-data:samples.shakespeare")
.load()
.cache())
And do some assertion on wordsDF
I am trying to think on the lines of:
Embedded kafka for kafka
Embedded cassandra for cassandra
Do we have any such library or alternative that can be used to write spark bigquery unit test

Related

How to perform Unit testing on PySpark Structured Streaming?

I'm using PySpark Structured Streaming with Kafka as a reader stream. For the sake of unit testing, I would like to replace Kafka reader stream with a mock. How can I do this?
The following question is equivalent to mine but I use Python (PySpark) instead of Scala. And I couldn't find MemoryStream in PySpark.
How to perform Unit testing on Spark Structured Streaming?

How to extract the field values from a dataset in spark using scala?

I have a dataframe which reads streams from kafka as a source and it is then converted to a dataset after applying schema, now how to get that particular field value from the dataset to work with it?
case class Fruitdata(id:Int, name:String, color:String, price:Int)
//say this function reads streams from kafka and gives me the dataframe
val df = readFromKafka(sparkSession,inputTopic)
//say this converts dataframe to a dataset with schema defined accordingly
val ds: Dataset[Fruitdata] = getDataSet[Fruitdata](df,schema)
//and say the incoming stream data is -
//"{"id":1,"name":"Grapes","color":"Green","price":15}"
//Now how to get a particular field like name, price and so on
//this doesn't works, it says "Queries with streaming sources must be executed with writeStream.start()"
ds.first()
//same here
ds.show
//also can i get the complete string as input,this gives me Dataset[String]
val ds2 = ds.flatMap((f: Fruitdata)=>List(s"${f.id},${f.name}"))
I think it's because you're trying to read from kafka.
When you run with Spark streaming, I think you cannot run few of the commands as they are related to streaming sources. For example, if you are reading from kafka, there is nothing like first, because it is a micro batch and first refers to each micro batch. Please, try something like "console" sink to output your records to console. Also make sure to read few sample records and not real kafka feed.

Saving a streaming DataFrame to MongoDB using MongoSpark

Some backstory: For a homework project for university we are tasked to implement an algorithm of choice in a scalable way. We chose to use Scala, Spark, MongoDB and Kafka as these were recommended during the course. To read data from our MongoDB, we opted to use MongoSpark as it allows for easy and scalable operations on data. We also use Kafka to simulate streaming from an outside source. We need to perform multiple operations on every entry that Kafka produces. The issue comes from saving the result of this data back to MongoDB.
We have the following code:
val streamDF = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "aTopic")
.load
.selectExpr("CAST(value AS STRING)")
From here on, we're at a loss. We cannot use a .map as MongoSpark only operates on DataFrames, Datasets and RDDs and is not serializable, and using MongoSpark.save does not work on streaming DataFrames like the one specified. We also cannot use the default MongoDB Scala driver as this conflicts with MongoSpark upon adding the dependency. Note that the rest of the algorithm heavily relies on joins and groupbys.
How can we get the data from here to our MongoDB?
Edit:
For an easy to reproduce example, one could try the following:
val streamDF = sparkSession
.readStream
.format("rate")
.load
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
The save() method for MongoDB Connector for Spark accepts RDD (as of current version 2.2). When utilising DStream with MongoSpark, you need to fetch the 'batches' of RDDs in the stream to write.
wordCounts.foreachRDD({ rdd =>
import spark.implicits._
val wordCounts = rdd.map({ case (word: String, count: Int)
=> WordCount(word, count) }).toDF()
wordCounts.write.mode("append").mongo()
})
See also:
Design Patterns for using foreachRDD
MongoDB: Spark Streaming

Spark Streaming Update on Kafka Direct Stream parameter

I have the followingcode:
//Set basic spark parameters
val conf = new SparkConf()
.setAppName("Cartographer_jsonInsert")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val messagesDStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Tuple4[String, Int, Long, String]](ssc, getKafkaBrokers, getKafkaTopics("processed"), (mmd: MessageAndMetadata[String, String]) => {
(mmd.topic, mmd.partition, mmd.offset, mmd.message().toString)
})
getKafkaBrokers and getKafkaTopics calls an API that checks the database for specific new Topics as we add them to our system. Does the SSC while running update variables each iteration? So ever messageDStream be re-created with the new variables each time?
It does not look like it does, is there any way to have the happen?
Tathagata Das, one of the creators of Spark Streaming answered a similar question in the Spark User List regarding modifications of existing DStreams.
Currently Spark Streaming does not support addition/deletion/modification of DStream after the streaming context has been started.
Nor can you restart a stopped streaming context.
Also, multiple spark contexts (and therefore multiple streaming contexts) cannot be run concurrently in the same JVM.
I don't see a straight forward way of implementing this with Spark Streaming, as you have no way of updating your graph. You need much more control than currently available. Maybe a solution based on Reactive Kafka, the Akka Streams connector for Kafka. Or any other streaming based solution where you control the source.
Any reason you are not using Akka Graph with reactive-kafka (https://github.com/akka/reactive-kafka). it is very easy to build reactive stream where source can be given topic , flow to process messages and sink to sink result.
I have built a sample application is using the same https://github.com/asethia/akka-streaming-graph

Stream the most recent data in cassandra with spark streaming

I continuously have data being written to cassandra from an outside source.
Now, I am using spark streaming to continuously read this data from cassandra with the following code:
val ssc = new StreamingContext(sc, Seconds(5))
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
val dstream = new ConstantInputDStream(ssc, cassandraRDD)
dstream.foreachRDD { rdd =>
println("\n"+rdd.count())
}
ssc.start()
ssc.awaitTermination()
sc.stop()
However, the following line:
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
takes the entire table data from cassandra every time. Now just the newest data saved into the table.
What I want to do is have spark streaming read only the latest data, ie, the data added after its previous read.
How can I achieve this? I tried to Google this but got very little documentation regarding this.
I am using spark 1.4.1, scala 2.10.4 and cassandra 2.1.12.
Thanks!
EDIT:
The suggested duplicate question (asked by me) is NOT a duplicate, because it talks about connecting spark streaming and cassandra and this question is about streaming only the latest data. BTW, streaming from cassandra IS possible by using the code I provided. However, it takes the entire table every time and not just the latest data.
There will be some low-level work on Cassandra that will allow notifying external systems (an indexer, a Spark stream etc.) of new mutations incoming to Cassandra, read this: https://issues.apache.org/jira/browse/CASSANDRA-8844