I am trying to use Spark Streaming and Kafka to ingest and process messages received from a web server.
I am testing the consumer mentioned in https://github.com/dibbhatt/kafka-spark-consumer/blob/master/README.md to take advantage of the extra features it offers.
As a first step, I am trying to use the example provided just to see how it plays out. However, I am having difficulties actually seeing the data in the payload.
Looking at the result of the following function:
ReceiverLauncher.launch
I can see it returns a collection of RDDs, each of type:
MessageAndMetadata[Array[Byte]]
I am stuck at this point and don't know how to parse this and see the actual data. All the examples on the web that use the consumer that ships with Spark create an iterator object, go through it, and process the data. However, the returned object from this custom consumer doesn't give me any iterator interfaces to start with.
There is a getPayload() method in the RDD, but I don't know how to get to the data from it.
The questions I have are:
Is this consumer actually a good choice for a production environment? From the looks of it, the features it offers and the abstraction it provides seem very promising.
Has anybody ever tried it? Does anybody know how to get to the data?
Thanks in advance,
Moe
getPayload() needs to be converted to String, e.g.
new String(line.getPayload())
Related
I have a custom binary format of my messages in Kafka (protobuf) and I want to avoid processing time when doing the de-serialization of said messages.
My idea would be to somehow discard the messages I do not want during the value converter for de-serialization.
I'm trying to generate a custom value converter that would only process certain messages based on some headers, and I would like to avoid the processing time of deserializing all of the messages.
Up to now I have a sort of filter transformation to discard those messages, but I wanted to avoid even processing them so really discard them on the value converter itself. The transformation if I understood correctly always happens after the converters.
I tried to just return a null on it but that failed, meaning the consumer crashed because the message becomes null. I was wondering if there is a way of doing this and if yes any known example?
If not, I can off-course just return an empty SchemaAndValue but I was wondering if there was a nicer way because this way, I still need to return something and then filter them out with a transformation.
EDIT: Based on the answer, which is what I was looking for the easier way is to simply return the ByteArrayConverter for the messages I'm not interested in
Filtering is a type of processing, so a transform is the correct way to do this.
If you meant you want to prevent deserialization, and you're using some custom binary format and filtering based on its content, maybe using record headers would be a better way to exclude events instead. Then use ByteArrayConverter as a pass-through
I have referred this. but, this is an old post so i'm looking for a better solution if any.
I have an input topic that contains 'userActivity' data. Now I wish to gather different analytics based on userInterest, userSubscribedGroup, userChoice, etc... produced to distinct output topics from the same Kafka-streams-application.
Could you help me achieve this... ps: This my first time using Kafka-streams so I'm unaware of any other alternatives.
edit:
It's possible that One record matches multiple criteria, in which case the same record should go into those output topics as well.
if(record1 matches criteria1) then... output to topic1;
if(record1 matches criteria2) then ... output to topic2;
and so on.
note: i'm not looking elseIf kind of solution.
For dynamically choosing which topic to send to at runtime based on each record's key-value pairs. Apache Kafka version 2.0 or later introduced a feature called: Dynamic routing
And this is an example of it: https://kafka-tutorials.confluent.io/dynamic-output-topic/confluent.html
I have a stream of measurements keyed by an ID PCollection<KV<ID,Measurement>> and something like a changelog stream of additional information for that ID PCollection<KV<ID,SomeIDInfo>>. New data is added to the measurement stream quite regularly, say once per second for every ID. The stream with additional information on the other hand is only updated when a user performs manual re-configuration. We can't tell often this happens and, in particular, the update frequency may vary among IDs.
My goal is now to enrich each entry in the measurements stream by the additional information for its ID. That is, the output should be something like PCollection<KV<ID,Pair<Measurement,SomeIDInfo>>>. Or, in other words, I would like to do a left join of the measurements stream with the additional information stream.
I would expect this to be a quite common use case. Coming from Kafka Streams, this can be quite easily implemented with a KStream-KTable-Join. With Beam, however, all my approaches so far seem not to work. I already thought about the following ideas.
Idea 1: CoGroupByKey with fixed time windows
Applying a window to the measurements stream would not be an issue. However, as the additional information stream is updating irregularly and also significantly less frequently than the measurements stream, there is no reasonable common window size such that there is at least one updated information for each ID.
Idea 2: CoGroupByKey with global window and as non-default trigger
Refining the previous idea, I thought about using a processing-time trigger, which fires e.g. every 5 seconds. The issue with this idea is that I need to use accumulatingFiredPanes() for the additional information as there might be no new data for a key between two firings, but I have to use discardingFiredPanes() for the measurements stream as otherwise my panes would quickly become too large. This simply does not work. When I configure my pipeline that way, also the additional information stream discards changes. Setting both trigger to accumulating it works, but, as I said, this is not scalable.
Idea 3: Side inputs
Another idea would be to use side inputs, but also this solution is not really scalable - at least if I don't miss something. With side inputs, I would create a PCollectionView from the additional information stream, which is a map of IDs to the (latest) additional information. The "join" can than be done in a DoFn with a side input of that view. However, the view seems to be shared by all instances that perform the side input. (It's a bit hard to find any information regarding this.) We would like to not make any assumptions regarding the amount of IDs and the size of additional info. Thus, using a side input seems also not to work here.
The side input option you discuss is currently the best option, although you are correct about the scalability concern due to the side input being broadcast to all workers.
Alternatively, you can store the infrequently-updated side in an external key-value store and just do lookups from a DoFn. If you go this route, it's generally useful to do a GroupByKey first on the main input with ID as a key, which lets you cache the lookups with a good cache-hit ratio.
I'd want to read csv file using by Flink, Scala-language and addSource- and readCsvFile-functions. I have not found any simple examples about that. I have only found: https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/scala/com/dataartisans/flinktraining/exercises/datastream_scala/cep/LongRides.scala and this too complex for my purpose.
In definition: StreamExecutionEnvironment.addSource(sourceFunction) should i only use readCsvFile as sourceFunction ?
After reading i'd want to use CEP (Complex Event Processing).
readCsvFile() is only available as part of Flink's DataSet (batch) API, and cannot be used with the DataStream (streaming) API. Here's a pretty good example of readCsvFile(), though it's probably not relevant to what you're trying to do.
readTextFile() and readFile() are methods on StreamExecutionEnvironment, and do not implement the SourceFunction interface -- they are not meant to be used with addSource(), but rather instead of it. Here's an example of using readTextFile() to load a CSV using the DataStream API.
Another option is to use the Table API, and a CsvTableSource. Here's an example and some discussion of what it does and doesn't do. If you go this route, you'll need to use StreamTableEnvironment.toAppendStream() to convert your table stream to a DataStream before using CEP.
Keep in mind that all of these approaches will simply read the file once and create a bounded stream from its contents. If you want a source that reads in an unbounded CSV stream, and waits for new rows to be appended, you'll need a different approach. You could use a custom source, or a socketTextStream, or something like Kafka.
If you have a CSV file with 3 fields - String,Long,Integer
then do below
val input=benv.readCsvFile[(String,Long,Integer)]("hdfs:///path/to/your_csv_file")
PS:-I am using flink shell that is why I have benv
We are following CQRS architecture and using Jonathan Oliver's event-store version 3 for events. We want to create snapshot of the aggregate roots to improve performance.
I found an API (GetStreamsToSnapshot) which can be used for this. It gives all streams based upon how long before the snapshots have been created.
But I am not sure how to use the stream to create the snapshot as I do not know the aggregate type.
Please provide any inputs on how to create snapshots
As you have discovered, GetStreamsToSnapshot gives you a list of streams that are at least X revisions behind the head revision.
From there, it's a matter of loading up each stream. This is where you can append some kind of header information to the stream to determine what type of aggregate you're dealing.
Many times I'm asked why I don't just store the aggregate type information directly into the EventStore and make it a first-class part of the API. The answer is that it doesn't care about aggregates which is a DDD concept. All the EventStore cares about is streams and events.