Pyspark structured streaming find right approach to Groupby

Pyspark structured streaming find right approach to Groupby - pyspark

I am new to structured streaming. I have a use case and I want to know the best approach to achieve this.
I have data coming in as streams from kafka like below
{id: "abc", class: "x", student: "1" } -> at 4:55,
{id: "abc", class: "A", student: "1" } -> at 5:00,
{id: "abc", class: "A", student: "2" } -> at 5:05,
{id: "abc", class: "B", student: "1" } -> at 5:10,
{id: "abc", class: "A", student: "1" } -> at 5:15,
{id: "abc", class: "C", student: "1" } -> at 5:20
Now I want to group the records by class and send the results to a kafka topic and this is read by another microservice and some metrics are computed.
To achieve this
I can think of sliding window concept say every 5 mins I can group the last 15 mins of data and send to kafka - The issue here will be multiple groups. There is a group from 4:55 to 5:10, 2nd one from 5:00 to 5:15 and 3rd one from 5:05 to 5:20. The 2nd one is the group what I need because that is complete and has all the records from "class:A" and can be used to compute the metrics for class A. if I send all 3 groups to kafka like in regular streaming application then the metrics computed from second will be overwritten by 3rd which is not expected so to overcome this I can do the below
I can think of creating a sql table on spark and store the records from kafka for some fixed amount of time say(1 hour) and every x minutes read the records from the table which are older than the threshold(1 hour) and send it to kafka. But again this feels like this is why spark provided window concept. so I am not sure if this is right.
Is there any better way to achieve this? Please provide me some suggestions.

Related

Kafka connect transformations ExtractField$Value while preserving more keys

I have the following message structure:
{
"payload" {
"a": 1,
"b": {"X": 1, "Y":2}
},
"timestamp": 1659692671
}
I want to use SMTs to get the following structure:
{
"a": 1,
"b": {"X": 1, "Y":2},
"timestamp": 1659692671
}
When I use ExtractField$Value for payload, I cannot preserve timestamp.
I cannot use Flatten because "b"'s structure should not be flatten.
Any ideas?
Thanks

Unfortunately, moving a field into another Struct isn't possible with the built-in transforms. You'd have to write your own, or use a stream-processing library before the data reaches the connector.
Note that Kafka records themselves have a timestamp, so do you really need the timestamp as a field in the value? One option is to extract the timestamp, then flatten, then add it back. However, this will override the timestamp that the producer had set with that value.

Create aggregated object for a batched REST call in Azure Data Factory

I need to upload a bunch objects to a REST API and I want to aggregate them into batches when I send them in JSON. Unfortunately, the batch object needs to be in a specific JSON format, and I'm having difficulty creating the correct Data Flow in ADF.
The data looks something like this:
CustomerId
Name
Country
1
Alice
USA
2
Bob
CAN
3
Charlie
MEX
For examples sake, I need the data in batches of 2, and when making the REST API call, the JSON data should looks like this:
Batch 1
{
"customers" : [
{
"name" : "Alice",
"country" : "USA"
},
{
"name" : "Bob",
"country" : "CAN"
}]
}
Batch 2
{
"customers" : [
{
"name" : "Charlie",
"country" : "MEX"
}]
}
Can someone help me understand how to write a Data Flow that does this?

This can be achieved with a dataflow like this example Data flow.
Steps:
Create a sample source data. (example source)
Create derivedColumn as below (example derived column):
column_value_1: it’s used to generate rownumber
test: It’s used to create a JSON data like:
{
"name" : "Alice",
"country" : "USA"
}
Create a window activity to generate a rowNumber column.
example window Sort, example window column
Aggregate configuration for Group By uses the expression below. (example Aggregate Group By)
toInteger(divide(rowNumber,2))`
Aggregate configuration for Aggregates uses the expression below. This is used to generate the JSON of customers (example Aggregate Aggregates)
collect(test)
Exclude unnecessary columns and remove the mapping of Group. (example sink config)
Results
The REST API server side will receive the body like this

Kafka Streams - Joining Stream record on KTable with multiple fields

This is probably a silly question but I wonder if it's possible to join a KStream on two fields with a KTable/GlobalKTable? For example let's say I have this object that's the value of the records in the KStream.
{"gameId": 1, "awayTeamId": 1, "homeTeamId": 2}
I will have a KTable/Topic that looks like this -
{"teamId": 1, "teamName": "Team 1", "teamId": 2, "teamName": "Team 2"}
I want to join both the awayTeamId and homeTeamId with a KTable so that I can enrich the message so that the resulting record will look something like this...
{"gameId": 1, "awayTeamId": 1, "awayTeamName": "Team 1", "homeTeamId": 2, "homeTeamName": "Team 2"}
I'm guessing you could maybe have a branch with one branch handling the awayTeam join and another branch handling the homeTeam join and maybe end up creating an entirely new record with a map or mapValues call and maybe combining the two separate branches? I'm not sure if this is a viable method or if there are other avenues to pursue.

Carry left-overs from 'join" on to the next day's job in Spark

I have a job that runs on a daily basis. The purpose of this job is to correlate HTTP requests with their corresponding HTTP replies. This can be achieved because all HTTP requests & HTTP replies have a GUID that uniquely binds them.
So the job deals with two DataFrames: one containing the requests, and one containing the replies. So to correlate the requests with their replies, I am obviously doing an inner join based on that GUID.
The problem that I am running into is that a request that was captured on day X at 23:59:59 might see its reply captured on day X+1 at 00:00:01 (or vice-versa) which means that they will never get correlated together, neither on day X nor on day X+1.
Here is example code that illustrates what I mean:
val day1_requests = """[ { "id1": "guid_a", "val" : 1 }, { "id1": "guid_b", "val" : 3 }, { "id1": "guid_c", "val" : 5 }, { "id1": "guid_d", "val" : 7 } ]"""
val day1_replies = """[ { "id2": "guid_a", "val" : 2 }, { "id2": "guid_b", "val" : 4 }, { "id2": "guid_c", "val" : 6 }, { "id2": "guid_e", "val" : 10 } ]"""
val day2_requests = """[ { "id1": "guid_e", "val" : 9 }, { "id1": "guid_f", "val" : 11 }, { "id1": "guid_g", "val" : 13 }, { "id1": "guid_h", "val" : 15 } ]"""
val day2_replies = """[ { "id2": "guid_d", "val" : 8 }, { "id2": "guid_f", "val" : 12 }, { "id2": "guid_g", "val" : 14 }, { "id2": "guid_h", "val" : 16 } ]"""
val day1_df_requests = spark.read.json(spark.sparkContext.makeRDD(day1_requests :: Nil))
val day1_df_replies = spark.read.json(spark.sparkContext.makeRDD(day1_replies :: Nil))
val day2_df_requests = spark.read.json(spark.sparkContext.makeRDD(day2_requests :: Nil))
val day2_df_replies = spark.read.json(spark.sparkContext.makeRDD(day2_replies :: Nil))
day1_df_requests.show()
day1_df_replies.show()
day2_df_requests.show()
day2_df_replies.show()
day1_df_requests.join(day1_df_replies, day1_df_requests("id1") === day1_df_replies("id2")).show()
// guid_d from request stream is left over, as well as guid_e from reply stream.
//
// The following 'join' is done on the following day.
// I would like to carry 'guid_d' into day2_df_requests and 'guid_e' into day2_df_replies).
day2_df_requests.join(day2_df_replies, day2_df_requests("id1") === day2_df_replies("id2")).show()
I can see 2 solutions.
Solution#1 - custom carry over
In this solution, on day X, I would do a "full_outer" join instead of an inner-join, and I would persist into some storage the results that are missing one side or the other. On the next day X+1, I would load this extra data along with my "regular data" when doing my join.
An additional implementation detail is that my custom carry over would have to discard "old carry overs" otherwise it could pile up, i.e. it is possible that a HTTP request or HTTP reply from 10 days ago never sees its counterpart (maybe the app crashed for instance thus a HTTP request was emitted but not a reply).
Solution#2 - guid folding
In this solution, I would make the assumption that my requests and replies are within a certain amount of time of one another (e.g. 5 minutes). Thus on day X+1, I would also load the last 5 minutes of data from day X and include that as part of my join. This way, I don't need to use extra storage like in solution#1. However, the disadvantage is that this solution requires that the target storage can deal with duplicate entries (for instance, if the target storage is a SQL table, the PK would have to be this GUID and do an upsert instead of an insert).
Question
So my question is whether Spark provides functionality to automatically deal with a situation like that, thus not requiring any of my two solutions and by the same fact making things easier and more elegant?
Bonus question
Let's assume that I need to do the same type of correlation but with a stream of data (i.e. instead of a daily batch job that runs on a static set of data, I use Spark Streaming and data is processed on live streams of requests & replies).
In this scenario, a "full_outer" join is obviously inappropriate (https://dzone.com/articles/spark-structured-streaming-joins) and actually unnecessary since Spark Streaming takes care of that for us by having a sliding window for doing the join.
However, I am curious to know what happens if the job is stopped (or if it crashes) then resumed. Similarly to the batch mode example that I gave above, what if the job was interrupted after a request was consumed (and acknowledged) from stream/queue but before its related reply did? Does Spark Streaming keeps a state of its sliding window hence resuming the job will be able to correlate as if the stream was never interrupted?
P.S. backing up your answer with hyperlinks to reputable docs (like Apache's own) would be much appreciated.

Find & Update partial nested collection

Assume I have a Mongo collection as such:
The general schema: There are Categories, each Category has an array of Topics, and each Topic has a Rating.
[
{CategoryName: "Cat1", ..., Topics: [{TopicName: "T1", rating: 9999, ...},
{TopicName: "T2", rating: 42, ....}]},
{CategoryName: "Cat2", ... , Topics: [...]},
...
]
In my client-side meteor code, I have two operations I'd like to execute smoothly, without any added filtering to be done: Finding, and updating.
I'm imagining the find query as follows:
.find({CategoryName: "Cat1", Topics: [{TopicName: "T1"}]}).fetch()
This will, however, return the whole document - The result I want is only partial:
[{CategoryName: "Cat1", ..., Topics: [{TopicName: "T1", rating: 9999, ...}]}]
Similarly, with updating, I'd like a query somewhat as such:
.update({CategoryName: "Cat1", Topics: [{TopicName: "T1"}]}, {$set: {Topics: [{rating: infinityyy}]}})
To only update the rating of the topic T1, and not all topics of category Cat1.
Again, I'd like to avoid any filtering, as the rest of the data should not even be sent to the client in the first place.
Thanks!

You need to amend your query to the following:
Categories.find(
{ CategoryName: 'Cat1', 'Topics.TopicName': 'T1' },
{ fields: { 'Topics.$': 1 }}, // make sure you put any other fields you want in here too
).fetch()
What this query does is searches for a Category that matches the name Cat1 and has the object with the TopicName equal to T1 inside the Topic array.
In the fields projection we are using the $ symbol to tell MongoDB to return the object that was found as part of the query, and not all the objects in the Topics array.
To update this nested object you need to use the same $ symbol:
Categories.update(
{ CategoryName: "Cat1", 'Topics.TopicName': 'T1' },
{$set: {'Topics.$.rating': 100 },
);
Hope that helps

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark structured streaming find right approach to Groupby - pyspark

Related

Kafka connect transformations ExtractField$Value while preserving more keys

Create aggregated object for a batched REST call in Azure Data Factory

Kafka Streams - Joining Stream record on KTable with multiple fields

Carry left-overs from 'join" on to the next day's job in Spark

Find & Update partial nested collection

Categories

Resources