Consuming RabbitMQ messages with Spark streaming - scala

I'm new to scala and trying to hack my way around sending serialized Java objects over a RabbitMQ queue to a Spark Streaming application.
I can successfully enqueue my objects which have been serialized with an ObjectOutputStream. To receive my objects on the Spark end I have downloaded a custom RabbitMQ InputDStream and Receiver implementation from here - https://github.com/Stratio/rabbitmq-receiver
However, in my understanding that codebase only supports String messages, not binary. Thus I started hacking on that code in order to make it support being able to read a binary message and store it as a byte array, so that I can deserialize it on the Spark end. That attempt is here - https://github.com/llevar/rabbitmq-receiver
I then have the following code in my Spark driver program:
val conf = new SparkConf().setMaster("local[6]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val receiverStream: ReceiverInputDStream[scala.reflect.ClassTag[AnyRef]] =
RabbitMQUtils.createStreamFromAQueue(ssc,
"localhost",
5672,
"mappingQueue",
StorageLevel.MEMORY_AND_DISK_SER_2)
val parsedStream = receiverStream.map{ m =>
SerializationUtils.deserialize(m.asInstanceOf[Array[Byte]]).asInstanceOf[SAMRecord]
}
parsedStream.print()
ssc.start()
Unfortunately this does not seem to work. The data is consumed off the queue. I don't get any errors but I don't get any of the output that I expect either.
This is all I get.
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218845 replicated to only 0 peer(s) instead of 1 peers
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218846 replicated to only 0 peer(s) instead of 1 peers
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218847 replicated to only 0 peer(s) instead of 1 peers
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218848 replicated to only 0 peer(s) instead of 1 peers
I was able to successfully deserialize my objects before calling the store() method here - https://github.com/llevar/rabbitmq-receiver/blob/master/src/main/scala/com/stratio/receiver/RabbitMQInputDStream.scala#L106
just by invoking SerializationUtils on the data from the delivery.getBody call, but I don't seem to be able to get the same data from the DStream in my main program.
Any help is appreciated.

Related

Close GRPC Netty channel in Scala Spark App

I'm trying to cleanly close a set of channels created during processing of a Spark map function. When I introduce the "shutdown/awaitTermination" methods after the main processing part (right before returning the "Result"), I get errors in other calls. As if the channel was shutdown prematurely (due to Spark scheduling of the actual tasks, I guess). Any recommendation? I have the following flow:
val someDF = initialDF.mapPartitions(iterator => {
val caller = createChannel(certificate, URI, port)
val innerDF = iterator.map(row => {
//do stuff with caller created above
result
}).toDF()
If I don't shutdown the channel, it runs fine (except I get some error messages in unit testing). But if I create a new channel during execution after the code as above, I end up with the following error:
ERROR io.grpc.internal.ManagedChannelOrphanWrapper - ~~~ Channel ManagedChannelImpl{logId=41, target=blablabla:443} was not shutdown properly!!! ~~~
How should I shutdown these channels? I'm not too bright with Spark...
Thanks!

FileSink in Apache Flink not generating logs in output folder

I am using Apache Flink to read data from kafka topic and to store it in files on server. I am using FileSink to store files, it creates the directory structure date and time wise but no logs files are getting created.
When i run the program it creates directory structure as below but log files are not getting stored here.
/flink/testlogs/2021-12-08--07
/flink/testlogs/2021-12-08--06
I want the log files should be written every 15 mins to a new log file.
Below is the code.
DataStream <String> kafkaTopicData = env.addSource(new FlinkKafkaConsumer<String>("MyTopic",new SimpleStringSchema(),p));
OutputFileConfig config = OutputFileConfig
.builder()
.withPartPrefix("prefix")
.withPartSuffix(".ext")
.build();
DataStream <Tuple6 < String,String,String ,String, String ,Integer >> newStream=kafkaTopicData.map(new LogParser());
final FileSink<Tuple6<String, String, String, String, String, Integer>> sink = FileSink.forRowFormat(new Path("/flink/testlogs"),
new SimpleStringEncoder < Tuple6 < String,String,String ,String, String ,Integer >> ("UTF-8"))
.withRollingPolicy(DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024)
.build())
.withOutputFileConfig(config)
.build();
newStream.sinkTo(sink);
env.execute("DataReader");
LogParser returns Tuple6.
When used in streaming mode, Flink's FileSink requires that checkpointing be enabled. To do this, you need to specify where you want checkpoints to be stored, and at what interval you want them to occur.
To configure this in flink-conf.yaml, you would do something like this:
state.checkpoints.dir: s3://checkpoint-bucket
execution.checkpointing.interval: 10s
Or in your application code you can do this:
env.getCheckpointConfig().setCheckpointStorage("s3://checkpoint-bucket");
env.enableCheckpointing(10000L);
Another important detail from the docs:
Given that Flink sinks and UDFs in general do not differentiate between normal job termination (e.g. finite input stream) and termination due to failure, upon normal termination of a job, the last in-progress files will not be transitioned to the “finished” state.

With Akka Stream, how to dynamically duplicate a flow?

I'm running a live video streaming server. There's an Array[Byte] video source. Note that I can't get 2 connections to my video source. I want every client connecting to my server to receive this same stream, with a buffer discarding the old frames.
I tried using a BroadcastHub like this :
val source =
Source.fromIterator(() => myVideoStreamingSource.zipWithIndex)
val runnableGraph =
source.toMat(BroadcastHub.sink(bufferSize = 2))(Keep.right)
runnableGraph.run().to(Sink.foreach { index =>
println(s"client A reading frame #$index")
}).run()
runnableGraph.run().to(Sink.foreach { index =>
println(s"client B reading frame #$index")
}).run()
I get :
client A reading frame #0
client B reading frame #1
client A reading frame #2
client B reading frame #3
We see that the main stream is partitioned between the two clients, whereas I'd expect my two client being able to see all the source stream's frames.
Did I miss something, or is there any other solution ?
The issue is the combination of Iterator with BroadcastHub. I assume you myVideoStreamingSource is something like:
val myVideoStreamingSource = Iterator("A","B","C","D","E")
I'll now quote from BroadcastHub.Sink:
Every new materialization of the [[Sink]] results in a new, independent hub, which materializes to its own [[Source]] for consuming the [[Sink]] of that materialization.
The issue here for you, is that it does not yet consume the data from the iterator.
The thing with iterator, is that once you consumed its data, you won't get back to the beginning again. Add to that the fact that both graphs run in parallel, it looks like it "divides" the elements between the two. But actually that is completely random. For example, if you add a sleep of 1 second between the Client A and Client B, so the only client that will print will be A.
In order to get that work, you need to create a source that is reversible. For example, Seq, or List. The following will do:
val myVideoStreamingSource = Seq("A","B","C","D","E")
val source = Source.fromIterator(() => myVideoStreamingSource.zipWithIndex.iterator)

Alpakka Kafka stream never getting terminated

We are using Alpakka Kafka streams for consuming events from Kafka. Here is how the stream is defined as:
ConsumerSettings<GenericKafkaKey, GenericKafkaMessage> consumerSettings =
ConsumerSettings
.create(actorSystem, new KafkaJacksonSerializer<>(GenericKafkaKey.class),
new KafkaJacksonSerializer<>(GenericKafkaMessage.class))
.withBootstrapServers(servers).withGroupId(groupId)
.withClientId(clientId).withProperties(clientConfigs.defaultConsumerConfig());
CommitterSettings committerSettings = CommitterSettings.create(actorSystem)
.withMaxBatch(20)
.withMaxInterval(Duration.ofSeconds(30));
Consumer.DrainingControl<Done> control =
Consumer.committableSource(consumerSettings, Subscriptions.topics(topics))
.mapAsync(props.getMessageParallelism(), msg ->
CompletableFuture.supplyAsync(() -> consumeMessage(msg), actorSystem.dispatcher())
.thenCompose(param -> CompletableFuture.supplyAsync(() -> msg.committableOffset())))
.toMat(Committer.sink(committerSettings), Keep.both())
.mapMaterializedValue(Consumer::createDrainingControl)
.run(materializer);
Here is the piece of code that is shutting down the stream:
CompletionStage<Done> completionStage = control.drainAndShutdown(actorSystem.dispatcher());
completionStage.toCompletableFuture().join();
I tried doing a get too on the completable future. But neither join nor get on future are returning. Have anyone else too faced similar problem? Is there something that I am doing wrong here?
If you want to control stream termination from outside the stream, you need to use a KillSwitch : https://doc.akka.io/docs/akka/current/stream/stream-dynamic.html
Your usage looks correct and I can't identify anything that would hinder draining.
A common thing to miss with Alpakka Kafka consumers is the stop-timeout which defaults to 30 seconds.
When using the DrainingControl you can safely set it to 0 seconds.
See https://doc.akka.io/docs/alpakka-kafka/current/consumer.html#draining-control

spark broadcast isn't being saved at the executors memory

I used spark-shell on EMR - Spark version 2.2.0 / 2.1.0.
While trying to broadcast simple object (my CSV file contain only 1 column and it's less then 2 MB) I noticed it isn't being kept on each executor memory and just in the driver memory although it should be as suggested in the documentation https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-TorrentBroadcast.html
Attached print screen before the broadcast (i.e. sc.broadcast(arr_collected) ) and after the broadcast which shows my conclusion. Additionally I checked the worker's machine memory usage and same as in Spark UI, it's not being change after the broadcasting.
1- print screen before broadcast
2- print screen after broadcast
Attached the log for the broadcasting process after adding 'log4j.logger.org.apache.spark.storage.BlockManager=TRACE' like suggested here -
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-blockmanager.html
3- print screen broadcast logging
Below there is the code -
val input = "s3://bucketName/pathToFile.csv"
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", ",").load(input)
val df_2 = df_read_for_bc.withColumn("is_exist",lit("true").cast("Boolean"))
val arr_collected = df_2.collect()
val broadcast_map_fraud_locations4 = sc.broadcast(arr_collected)
Any ideas?
Can you please use the broadcast varialables to join the data or do some kind of operation. It might be lazy so not using any memory