spark: How to get data by foreachRDD which is a DSStream created by ReceiverLauncher.launch? - scala

// consumer data from kafka
val tmp_stream = ReceiverLauncher.launch(ssc, props, numberOfReceivers,StorageLevel.MEMORY_ONLY)
tmp_stream.foreachRDD(rdd => {
rdd.collect()
val count = rdd.count() // here I can get the count of datas
// How can I get data here ?
})
Any idea how to complete this code to get data by foreachRDD from stream created by ReceiverLauncher.launch

You can call the getPayLoad to get the raw byte[]. Something like this
val stream = tmp_stream.map(x => { val s = new String(x.getPayload); s })

Related

How to associate some data to each partition in spark and re-use it?

I have a partitioned rdd and I want to extract some data from each partition so that I can re-use it later. An over-simplification could be:
val rdd = sc.parallelize(Seq("1-a", "2-b", "3-c"), 3)
val mappedRdd = rdd.mapPartitions{ dataIter =>
val bufferedIter = dataIter.buffered
//extract data which we want to re-use inside each partition
val reusableData = bufferedIter.head.charAt(0)
//use that data and return (but this does not allow me to re-use it)
bufferedIter.map(_ + reusableData)
}
My solution is to extract the re-usable data in a rdd:
val reusableDataRdd = rdd.mapPartitions { dataIter =>
//return an iterator with only one item on each partition
Iterator(dataIter.buffered.head.charAt(0))
}
and then zip the partitions
rdd.zipPartitions(reusableDataRdd){(dataIter, reusableDataIter) =>
val reusableData = reusableDataIter.next
dataIter.map(_ + reusableData)
}
I will get the same result as mappedRdd but I will also get my reusable data rdd.
Is there a better option to extract and re-use the data? Maybe more elegant or optimized?

EsHadoopException: Could not write all entries for bulk operation Spark Streaming

I want to traverse the stream of data, run a query on it and return the results which should be written into ElasticSearch. I tried to use mapPartitions method for creation of the connection to the database, however, I get such an error, which indicates that partition returns None to the rdd (I guess, some action should be added after the transformations):
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [10/10]. Error sample (first [5] error messages)
What can be changed in the code to get the data into rdd and send it to ElasticSearch without any troubles?
Alos, I had a variant of the solution for this problem with flatMap in foreachRDD, however, I create a connection to the database on each rdd, which is not effective in terms of performance.
This is the code for streaming data processing:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { part => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
part.map(
data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
val recommendationsMap = convertDataToMap(recommendations, calendarTime)
recommendationsMap
})
}
}
}.saveToEs("rdd-timed/output")
)
The problem was that I tried to convert the iterator directly into the Array, although it holds multiple rows of my records. That is why ElasticSEarch was not able to map this collection of records to the defined single record schema.
Here is the code that works properly:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { partition => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
val result = partition.map( data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
convertDataToMap(recommendations, calendarTime)
}).toList.flatten
result.iterator
}
}.saveToEs("rdd-timed/output")
})

Scala - Tweets subscribing - Kafka Topic and Ingest into HBase

I have to consume tweets from a Kafka Topic and ingest the same into HBase. The following is the code that i wrote but this is not working properly.
The main code is not calling "convert" method and hence no records are ingested into HBase table. Can someone help me please.
tweetskafkaStream.foreachRDD(rdd => {
println("Inside For Each RDD" )
rdd.foreachPartition( record => {
println("Inside For Each Partition" )
val data = record.map(r => (r._1, r._2)).map(convert)
})
})
def convert(t: (String, String)) = {
println("in convert")
//println("first param value ", t._1)
//println("second param value ", t._2)
val hConf = HBaseConfiguration.create()
hConf.set(TableOutputFormat.OUTPUT_TABLE,hbaseTableName)
hConf.set("hbase.zookeeper.quorum", "192.168.XXX.XXX:2181")
hConf.set("hbase.master", "192.168.XXX.XXX:16000")
hConf.set("hbase.rootdir","hdfs://192.168.XXX.XXX:9000/hbase")
val today = Calendar.getInstance.getTime
val printformat = new SimpleDateFormat("yyyyMMddHHmmss")
val id = printformat.format(today)
val p = new Put(Bytes.toBytes(id))
p.add(Bytes.toBytes("data"), Bytes.toBytes("tweet_text"),(t._2).getBytes())
(id, p)
val mytable = new HTable(hConf,hbaseTableName)
mytable.put(p)
}
I don't want to use the current datetime as the key (t._1) and hence constructing that in my convert method.
Thanks
Bala
Instead of foreachPartition, I changed it to foreach. This worked well.

stopping spark streaming after reading first batch of data

I am using spark streaming to consume kafka messages. I want to get some messages as sample from kafka instead of reading all messages. So I want to read a batch of messages, return them to caller and stopping spark streaming. Currently I am passing batchInterval time in awaitTermination method of spark streaming context method. I don't now how to return processed data to caller from spark streaming. Here is my code that I am using currently
def getsample(params: scala.collection.immutable.Map[String, String]): Unit = {
if (params.contains("zookeeperQourum"))
zkQuorum = params.get("zookeeperQourum").get
if (params.contains("userGroup"))
group = params.get("userGroup").get
if (params.contains("topics"))
topics = params.get("topics").get
if (params.contains("numberOfThreads"))
numThreads = params.get("numberOfThreads").get
if (params.contains("sink"))
sink = params.get("sink").get
if (params.contains("batchInterval"))
interval = params.get("batchInterval").get.toInt
val sparkConf = new SparkConf().setAppName("KafkaConsumer").setMaster("spark://cloud2-server:7077")
val ssc = new StreamingContext(sparkConf, Seconds(interval))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
var consumerConfig = scala.collection.immutable.Map.empty[String, String]
consumerConfig += ("auto.offset.reset" -> "smallest")
consumerConfig += ("zookeeper.connect" -> zkQuorum)
consumerConfig += ("group.id" -> group)
var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, consumerConfig, topicMap, StorageLevel.MEMORY_ONLY).map(_._2)
val streams = data.window(Seconds(interval), Seconds(interval)).map(x => new String(x))
streams.foreach(rdd => rdd.foreachPartition(itr => {
while (itr.hasNext && size >= 0) {
var msg=itr.next
println(msg)
sample.append(msg)
sample.append("\n")
size -= 1
}
}))
ssc.start()
ssc.awaitTermination(5000)
ssc.stop(true)
}
So instead of saving messages in a String builder called "sample" I want to return to caller.
You can implement a StreamingListener and then inside it, onBatchCompleted you can call ssc.stop()
private class MyJobListener(ssc: StreamingContext) extends StreamingListener {
override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) = synchronized {
ssc.stop(true)
}
}
This is how you attach your SparkStreaming to the JobListener:
val listen = new MyJobListener(ssc)
ssc.addStreamingListener(listen)
ssc.start()
ssc.awaitTermination()
We can get sample messages using following piece of code
var sampleMessages=streams.repartition(1).mapPartitions(x=>x.take(10))
and if we want to stop after first batch then we should implement our own StreamingListener interface and should stop streaming in onBatchCompleted method.

Play 2.1: Await result from enumerator

I'm working on testing my WebSocket code in Play Framework 2.1. My approach is to get the iterator/enumerator pair that are used for the actual web socket, and just test pushing data in and pulling data out.
Unfortunately, I just cannot figure out how to get data out of an Enumerator. Right now my code looks roughly like this:
val (in, out) = createClient(FakeRequest("GET", "/myendpoint"))
in.feed(Input.El("My input here"))
in.feed(Input.EOF)
//no idea how to get data from "out"
As far as I can tell, the only way to get data out of an enumerator is through an iteratee. But I can't figure out how to just wait until I get the full list of strings coming out of the enumerator. What I want is a List[String], not a Future[Iteratee[A,String]] or an Expectable[Iteratee[String]] or yet another Iteratee[String]. The documentation is confusing at best.
How do I do that?
You can consume an Enumerator like this:
val out = Enumerator("one", "two")
val consumer = Iteratee.getChunks[String]
val appliedEnumeratorFuture = out.apply(consumer)
val appliedEnumerator = Await.result(appliedEnumeratorFuture, 1.seconds)
val result = Await.result(appliedEnumerator.run, 1.seconds)
println(result) // List("one", "two")
Note that you need to await a Future twice because Enumerator and Iteratee control the speed of respectively producing and consuming values.
A more elaborate example for an Iteratee -> Enumerator chain where feeding the Iteratee results in the Enumerator producing values:
// create an enumerator to which messages can be pushed
// using a channel
val (out, channel) = Concurrent.broadcast[String]
// create the input stream. When it receives an string, it
// will push the info into the channel
val in =
Iteratee.foreach[String] { s =>
channel.push(s)
}.map(_ => channel.eofAndEnd())
// Instead of using the complex `feed` method, we just create
// an enumerator that we can use to feed the input stream
val producer = Enumerator("one", "two").andThen(Enumerator.eof)
// Apply the input stream to the producer (feed the input stream)
val producedElementsFuture = producer.apply(in)
// Create a consumer for the output stream
val consumer = Iteratee.getChunks[String]
// Apply the consumer to the output stream (consume the output stream)
val consumedOutputStreamFuture = out.apply(consumer)
// Await the construction of the input stream chain
Await.result(producedElementsFuture, 1.second)
// Await the construction of the output stream chain
val consumedOutputStream = Await.result(consumedOutputStreamFuture, 1.second)
// Await consuming the output stream
val result = Await.result(consumedOutputStream.run, 1.second)
println(result) // List("one", "two")