Consider the code:
val reads = Future.traverse(mycollection)(p => Future {
session.read
.json(p)
.write
.mode(SaveMode.Append)
.parquet(destinationTableLocation)
}(ExecutionContext.fromExecutor(Executors.newFixedThreadPool( session.sparkContext.defaultParallelism ))))
Await.ready(reads, Duration.Inf)
And here spark driver stops. Isthere a way to await for all futures?
P.S. do not sugect passing mycollection to json function- this is another story :)
Related
I need to execute some functions based on the values that I receive from topics. I'm currently using ForeachWriter to convert all the topics to a List.
Now, I want to pass this List as a parameter to the methods.
This is what I have so far
def doA(mylist: List[String]) = { //something for A }
def doB(mylist: List[String]) = { //something for B }
Ans this is how I call my streaming queries
//{"s":"a","v":"2"}
//{"s":"b","v":"3"}
val readTopics = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "myTopic").load()
val schema = new StructType()
.add("s",StringType)
.add("v",StringType)
val parseStringDF = readTopics.selectExpr("CAST(value AS STRING)")
val parseDF = parseStringDF.select(from_json(col("value"), schema).as("data"))
.select("data.*")
parseDF.writeStream
.format("console")
.outputMode("append")
.start()
//fails here
val listOfTopics = parseDF.select("s").map(row => (row.getString(0))).collect.toList
//unable to call the below methods
for (t <- listOfTopics ){
if(t == "a")
doA(listOfTopics)
else if (t == "b")
doB(listOfTopics)
else
println("do nothing")
}
spark.streams.awaitAnyTermination()
Questions:
How can I call a stand-alone (non-streaming) method in a streaming job?
I cannot use ForeachWriter here as I want to pass a SparkSession to methods and since SparkSession is not serializable, I cannot use ForeachWriter. What are the alternatives to call the methods doA and doB in parallel?
If you want to be able to collect data to a local Spark driver/executor, you need to use parseDF.write.foreachBatch, i.e. using a ForEachWriter
It's unclear what you need the SparkSession for within your two methods, but since they are working on non-Spark datatypes, you probably shouldn't be using a SparkSession instance, anyway
Alternatively, you should .select() and filter your topic column, then apply the functions to two "topic-a" and "topic-b" dataframes, thus parallelizing the workload. Otherwise, you would be better off just using regular KafkaConsumer from kafka-clients or kafka-streams rather than Spark
In my spark kinesis streaming application I am using foreachBatch to get the streaming data and need to send it to the drools rule engine for further processing.
My requirement is, I need to accumulate all json data in a list/ruleSession and send it for rule engine for processing as a batch at the executor side.
//Scala Code Example:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreachBatch(function)
.start()
query.awaitTermination()
val function = (batchDF: DataFrame, batchId: Long) => {
val ruleSession = kBase.newKieSession() //Drools Rule Session, this is getting created at driver side
batchDF.foreach(row => { // This piece of code is being run in executor.
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Getting a null pointer exception here as the ruleSession is not available in executor.
}
)
ruleHandler.processRule(ruleSession) // Again this is in the driver scope.
}
In the above code, the problem I am facing is: the function used in foreachBatch is getting executed at driver side and the code inside batchDF.foreach is getting executed at worker/executor side, and thus failing to get he ruleSession.
Is there any way to run the whole function at each executor side?
OR
Is there a better way to accumulate all the data in a batch DataFrame after transformation and send it to next process from within the executor/worker?
I think this might work ... Rather than running foreach, you could use foreachBatch or foreachPartition (or or a map version like mapPartition if you want return info). In this portion, open a connection to the drools system. From that point, iterate over the dataset within each partition (or batch) sending each to the drools system (or you might send that whole chunk to drools). In the foreachPartition / foreachBatch section, at the end, close the connect (if applicable).
#codeaperature, This is how I achieved batching, inspired from your answer, posting it as an answer as this exceeds the word limit in a comment.
Using foreach on dataframe and passing in a ForeachWriter.
Initializing the rule session in open method of ForeachWriter.
Adding each input JSON to rule session in process method.
Execute the rule in close method with the rule session loaded with batch of data.
//Scala code:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(dataConsumer)
.start()
val dataConsumer = new ForeachWriter[Row] {
var ruleSession: KieSession = null;
def open(partitionId: Long, version: Long): Boolean = { // first open is called once for every batch
ruleSession = kBase.newKieSession()
true
}
def process(row: Row) = { // the process method will be called for a batch of records
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Add all input json to rule session.
}
def close(errorOrNull: Throwable): Unit = { // after calling process for all records in bathc close is called
val factCount = ruleSession.getFactCount
if (factCount > 0) {
ruleHandler.processRule(ruleSession) //batch processing of rule
}
}
}
I am implementing one usecase to try-out Spark Structured Streaming API.
The source data is read from Kafka topic and after applying some transformations, results written to console.
I want to print the intermediate output along with the final results of the structured streaming query.
Here is the code snippet:
val trips = getTaxiTripDataframe() //this function consumes kafka topic and desrialize the byte array to create dataframe with required columns
val filteredTrips = trips.filter(col("taxiCompany").isNotNull && col("pickUpArea").isNotNull)
val output = filteredTrips
.groupBy("taxiCompany","pickupArea")
.agg(Map("pickupArea" -> "count"))
val query = output.writeStream.format("console")
.option("numRows","50")
.option("truncate","false")
.outputMode("update").start()
query.awaitTermination()
I want to print 'filteredTrips' dataframe on console. I tried using .show() method of dataframe, but as it is a dataframe created on streaming data, it is throwing below exception:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Is there any other work around?
Yes, you can create two streams (I am using Spark 2.4.3)
val filteredTrips = trips.filter(col("taxiCompany").isNotNull && col("pickUpArea").isNotNull)
val query1 = filteredTrips
.format("console")
.option("numRows","50")
.option("truncate","false")
.outputMode("update").start()
val query2 = filteredTrips
.groupBy("taxiCompany","pickupArea")
.agg(Map("pickupArea" -> "count"))
.writeStream
.format("console")
.option("numRows","50")
.option("truncate","false")
.outputMode("update").start()
query1.awaitTermination()
query2.awaitTermination()
When the following snippet executes:
...
stream
.map(_.value())
.flatMap(MyParser.parse(_))
.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val dataFrame = rdd.toDF();
val countsDf = dataFrame.groupBy($"action", window($"time", "1 hour")).count()
val query = countsDf.write.mode("append").jdbc(url, "stats_table", prop)
})
....
This error happens: java.lang.IllegalArgumentException: Can't get JDBC type for struct<start:timestamp,end:timestamp>
How would one go about saving the output of org.apache.spark.sql.functions.window() function to a MySQL DB?
I ran into the same problem using SPARK SQL:
val query3 = dataFrame
.groupBy(org.apache.spark.sql.functions.window($"timeStamp", "10 minutes"), $"data")
.count()
.writeStream
.outputMode(OutputMode.Complete())
.options(prop)
.option("checkpointLocation", "file:///tmp/spark-checkpoint1")
.option("table", "temp")
.format("com.here.olympus.jdbc.sink.OlympusDBSinkProvider")
.start
And I solved by adding a user defined function
val toString = udf{(window:GenericRowWithSchema) => window.mkString("-")}
For me String works, but you can change the function according to your needs, you can even have two functions to return start and end separately.
My query changed to:
val query3 = dataFrame
.groupBy(org.apache.spark.sql.functions.window($"timeStamp", "10 minutes"), $"data")
.count()
.withColumn("window",toString($"window"))
.writeStream
.outputMode(OutputMode.Complete())
.options(prop)
.option("checkpointLocation", "file:///tmp/spark-checkpoint1")
.option("table", "temp")
.format("com.here.olympus.jdbc.sink.OlympusDBSinkProvider")
.start
I need to establish a connection from Spark Streaming to Neo4j graph database.The RDDs are of type((is,I),(am,Hello)(sam,happy)....). I need to establish a edge between each pair of words in Neo4j.
In Spark Streaming documentation I found
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
to the push to the data to an external database.
I am doing this in Scala. I am little confused about how to go about? I found AnormCypher and Neo4jScala wrapper. Can I use these to get work done? If so, how can I do that? If not, all there any better alternatives?
Thank you all....
I did an experiment with AnormCypher
Like this:
implicit val connection = Neo4jREST.setServer("localhost", 7474, "/db/data/")
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(FILE, 4).cache()
val count = logData
.flatMap( _.split(" "))
.map( w =>
Cypher("CREATE(:Word {text:{text}})")
.on( "text" -> w ).execute()
).filter( _ ).count()
Neo4j 2.2.x has great concurrent write performance that you can use from Spark. So the more concurrent threads you can have to write to Neo4j the better. If you can batch statements in batches of 100 to 1000 each per request then even better.
Take a look at MazeRunner (http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html) as it will give you some ideas.