Dynamic filters with spark-streaming - mongodb

I'm using spark-streaming for below use-case :
I've a kafka topic - data. From this topic, I'm streaming in real-time data using structured spark streaming and apply some filters on it. If the number of rows returned after applying the filters is greater than 1 then the output is 1 else the output is 0 along with some other data from the query.
In simple words, suppose I'm filtering the stream using -
df.filter($A < 10)
where "A", "<" and "10" are dynamic and comes from some database. In actual, these values comes from kafka topic which I'm consuming and updating those values in db. So the query is not static and will be updated after sometime.
Further, I'll have to apply some boolean algeric operators on the results of streams. For eg -
df.filter($A < 10) AND df.filter($B = 1) OR df.filter($C > 1)... and so on
Here, each of the atomic operation (like df.filter($A < 10)) returns either 0 or 1 as described above.
Final result is saved to mongo.
I want to know if both problems can be used using structured spark streaming. If not, then using RDD ?
Otherwise, can someone suggest any way to do this ?

For the first case you can use a broadcast variable based approach as described in this answer. I've also had good luck using a per-executor transient value that was periodically refetched in each executor as described in the second part of this answer.
For the second case you would use a single filter() call that implements the complete set of conditions that causes a message to be included in the output stream.

Related

If many Kafka streams updates domain model (a.k.a materialized view)?

I have a materialized view that is updated from many streams. Every one enrich it partially. Order doesn't matter. Updates comes in not specified time. Is following algorithm is a good approach:
Update comes and I check what is stored in materialized view via get(), that this is an initial one so enrich and save.
Second comes and get() shows that partial update exist - add next information
... and I continue with same style
If there is a query/join, object that is stored has a method that shows that the update is not complete isValid() that could be used in KafkaStreams#filter().
Could you share please is this a good plan? Is there any pattern in Kafka streams world that handle this case?
Please advice.
Your plan looks good , you have the general idea, but you'll have to use the lower Kafka Stream API : Processor API.
There is a .transform operator that allow you to access a KeyValueStatestore, inside this operation implementation you are free to decide if you current aggregated value is valid or not.
Therefore send it downstream or returning null waiting for more information.

re-hydrating tweets / extracting data from partial stream

I have a spark stream I am getting from Kafka (tweets ids) and I am trying to re-hydrate adittional data using tweepy\twarc\twython while streaming.
I tried to extract the ids column from the streaming DataFrame, so I'll be able to iterate over it (could not find a way to iterate the stream directly) and look up the additional data using tweepy.
tried the following:
ids= []
def process_row(row,ids):
ids.append(row.tweet_id)
query = FlattenedDF.writeStream.foreach(process_row).start()
I also tried to convert it directly:
ids = FlattenedDF.select('tweet_id').collect()
got the error : "Queries with streaming sources must be executed with writeStream.start"
is there any better way to do it?
any help would be appreciated.

Ideal way to perform lookup on a stream of Kafka topic

I have the following use-case:
There is a stream of records on a Kafka topic. I have another set of unique IDs. I need to, for each record in the stream, check if the stream's ID is present in the set of unique IDs I have. Basically, this should serve as a filter for my Kafka Streams app. i.e., only to write records of Kafka topic that match the set of Unique IDs I have to another topic.
Our current application is based on Kafka Streams. I looked at KStreams and KTables. Looks like they're good for enrichments. Now, I don't need any enrichments to the data. As for using state stores, I'm not sure how good they are as a scalable solution.
I would like to do something like this:
kStream.filter((k, v) -> {
valueToCheckInKTable = v.get(FIELD_NAME);
if (kTable.containsKey(valueToCheckInKTable)) return record
else ignore
});
The lookup data can be pretty huge. Can someone suggest the best way to do this?
You can read the reference IDs into a table via builder.table("id-topic") with the ID as primary key (note that the value must be non-null -- otherwise it would be interpreted as a delete -- if there is no actual value, just put any non-null dummy value of each record when you write the IDs into the id-topic). To load the full table on startup, you might want to provide a custom timestamp extractor that always returns 0 via Consumed parameter on the table() operator (record are processed in timestamp order and returning 0 ensure that the record from the id-topic are processed first to load the table).
To do the filtering you do a stream-table join:
KStream stream = builder.stream(...);
// the key in the stream must be ID, if this is not the case, you can use `selectKey()` to set a new ke
KStream filteredStream = stream.join(table,...);
As you don't want to do any enrichment, the provided Joiner function can just return the left stream side value unmodified (and can ignored the right hand side table value).

Update scala DF based on events

I'm running into Scala,Apache Spark world and I'm trying to understand how to create a "pipeline" that will generate a DataFrame based on the events I receive.
For instance, the idea is that when I receive a specific log/event I have to insert/update a row in the DF.
Let's make a real example.
I would like to create a DataFrame that will represent the state of the users present in my database(postgres,mongo whatever).
When i say state, I mean the current state of the user(ACTIVE,INCOMPLETE,BLOCKED, etc). This states change based on the users activity, so then I will receive logs(JSON) with key "status": "ACTIVE" and so on.
So for example, I'm receiving logs from a Kafka topic.. at some point I receive a log which I'm interested because it defines useful information about the user(the status etc..)
I take this log, and I create a DF with this log in it.
Then I receive the 2nd log, but this one was performed by the same user, so the row needs to be updated(if the status changed of course!) so no new row but update the existing one. Third log, new user, new information so store as a new row in the existing DF.. and so on.
At the end of this process/pipeline, I should have a DF with the information of all the users present in my db and their "status" so then I can say "oh look at that, there are 43 users that are blocked and 13 that are active! Amazing!"
This is the idea.. the process must be in real time.
So far, I've tried this using files not connecting with a kafka topic.
For instance, I've red file as follow:
val DF = mysession.read.json("/FileStore/tables/bm2ube021498209258980/exampleLog_dp_api-fac53.json","/FileStore/tables/zed9y2s11498229410434/exampleLog_dp_api-fac53.json")
which generats a DF with 2 rows with everything inside.
+--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|[2017-06-22T08:40...|DPAPI|
| AVzO9dq5S78GvkQU|dp_api-2017.06.22| 1|[null,null,[Wrapp...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+
in _source there are all the nested things(the status I mentioned is here!).
Then I've selected some useful information like
DF.select("_id", "_source.request.user_ip","_source.request.aw", "_type").show(false)
+--------------------+------------+------------------------------------+-----+
|_id |user_ip |aw |_type|
+--------------------+------------+------------------------------------+-----+
|AVzO9dqvoaL5S78GvkQU|111.11.11.12|285d5034-dfd6-44ad-9fb7-ba06a516cdbf|DPAPI|
|AVzO9dq5S78GvkQU |111.11.11.82|null |DPAPI|
+--------------------+------------+------------------------------------+-----+
again, the idea is to create this DF with the logs arriving from a kafka topic and upsert the log in this DF.
Hope I explained well, I don't want a "code" solution I'd prefer hints or example on how to achieve this result.
Thank you.
As you are looking for resources I would suggest the following:
Have a look at the Spark Streaming Programming Guide (https://spark.apache.org/docs/latest/streaming-programming-guide.html) and the Spark Streaming + Kafka Integration Guide (https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html).
Use the Spark Streaming + Kafka Integration Guide for information on how to open a Stream with your Kafka content.
Then have a look at the possible transformations you can perform with Spark Streaming on it in chapter "Transformations on DStreams" in the Spark Streaming Programming Guide
Once you have transformed the stream in a way that you can perform a final operation on it have a look at "Output Operations on DStreams" in the Spark Streaming Programming Guide. I think especially .forEachRDD could be what you are looking for - as you can do an operation (like checking whether a certain key word is in your string and based on this do a database call) for each element of the stream.

How to find out Avro schema from binary data that comes in via Spark Streaming?

I set up a Spark-Streaming pipeline that gets measuring data via Kafka. This data was serialized using Avro. The data can be of two types - EquidistantData and DiscreteData. I created these using an avdl file and the sbt-avrohugger plugin. I use the variant that generates Scala case classes that inherit from SpecificRecord.
In my receiving application, I can get the two schemas by querying EquidistantData.SCHEMA$ and DiscreteData.SCHEMA$.
Now, my Kafka stream gives me RDDs whose value class is Array[Byte]. So far so good.
How can I find out from the byte array which schema was used when serializing it, i.e., whether to use EquidistantData.SCHEMA$ or DiscreteData.SCHEMA$?
I thought of sending an appropriate info in the message key. Currently, I don't use the message key. Would this be a feasible way or can I get the schema somehow from the serialized byte array I received?
Followup:
Another possibility would be to use separate topics for discrete and equidistant data. Would this be feasible?