Dynamic Target Delta Table As Target For Spark Streaming - pyspark

We are processing a stream of web logs. Basically activities that users perform on website. For each activity they perform, we a separate activity delta table.
We are exploring what is the best way to do streaming ingest. We have a kafka stream setup where all the activities are ingested in following format. But depending on the activity, we need to decide the different target table for the event to store.
{
activity_name: "Purchased"
data: {
product: "Soap",
amount: 1200
}
}
Can you help with what is the best way to handle this scenario?

This is call multiplexing. Usually the solution is to use structured streaming with .foreachBatch function as a sink, and then inside that function write data for each of the possible values of activity_name.
Something like this (for example, as it's shown in this blog post):
activity_names = ["Purchased", ...]
app_name = "my_app_name"
def write_tables(df, epoch):
df.cache()
for n in activity_names:
df.filter(f"activity_name = '{n}'") \
.write.mode("append") \
.option("txnVersion", epoch) \
.option("txnAppId", app_name) \
.save(...)
df.unpersist()
stread_df.writeStream \
.foreachBatch(write_tables) \
.option("checkpointLocation", "some_path") \
.start()
Please note that we're using idempotent writes for Delta tables to avoid duplicates if microbatch is restarted in the middle of execution.

Related

Spark Structured Streaming Foreach Batch to Write data to Mounted Blob Storage Container

I am receiving streaming data and wanted to write my data from Spark databricks cluster to Azure Blob Storage Container.
To do this I have mounted my storage account and I am specifying my path into my streaming sink query.
Method 1
dataframe.writeStream\
.format("text")\
.trigger(processingTime='10 seconds')\
.option("checkpointLocation", "/mnt/Checkpoint")\
.option("path", "/mnt/Data")\
.start()
Method 2
def process_row(df, epoch_id):
try:
df.write\
.format("text")\
.trigger(processingTime='10 seconds')\
.option("path", "/mnt/Data")\
.save()
except (Exception) as error:
print("Received an error ",error)
dataframe.writeStream.outputMode('append')\
.foreachBatch(process_row).option("checkpointLocation", "/mnt/Checkpoint").start()
Using Method 1, I am able to write data to blob container.
But using Method 2, data is not getting written to blob storage, nor it's displaying some error.
Method 2 Query Dashboards shows me something like this :
Method 2 Query Dashboard Image
Am I missing something in Method 2's code?
I think foreachBatch is used when you want to, say, process each batch in more than one way or write to more than one location so if you want exception handling I would just put method 1 in your try-catch.
For curiosity sake if you want to use foreachBatch maybe try just .writeStream within scope and not have any write method outside so you leave the write call to the foreachBatch function.
Something like (not tested).
def process_row(df, epoch_id):
try:
df.writeStream\
.format("text")\
.trigger(processingTime='10 seconds')\
.option("path", "/mnt/Data")\
.option("checkpointLocation","/mnt/Checkpoint")\
.outputMode(“append”)\
.start()
except (Exception) as error:
print("Received an error ",error)
dataframe.foreachBatch(process_row)

re-hydrating tweets / extracting data from partial stream

I have a spark stream I am getting from Kafka (tweets ids) and I am trying to re-hydrate adittional data using tweepy\twarc\twython while streaming.
I tried to extract the ids column from the streaming DataFrame, so I'll be able to iterate over it (could not find a way to iterate the stream directly) and look up the additional data using tweepy.
tried the following:
ids= []
def process_row(row,ids):
ids.append(row.tweet_id)
query = FlattenedDF.writeStream.foreach(process_row).start()
I also tried to convert it directly:
ids = FlattenedDF.select('tweet_id').collect()
got the error : "Queries with streaming sources must be executed with writeStream.start"
is there any better way to do it?
any help would be appreciated.

Flink : DataStream to Table

Usecase: Read protobuf messages from Kafka, deserialize them, apply some transformation (flatten out some columns), and write to dynamodb.
Unfortunately, Kafka Flink Connector only supports - csv, json and avro formats. So, I had to use lower level APIs (datastream).
Problem: If I can create a table out of the datastream object, then I can accept a query to run on that table. It would make the transformation part seamless and generic. Is it possible to run a SQL query over datastream object?
If You have a DataStream of objects, then You can simply register given DataStream as Table using StreamTableEnvironment.
This would look more or less like below:
val myStream = ...
val env: StreamExecutionEnvironment = configureFlinkEnv(StreamExecutionEnvironment.getExecutionEnvironment)
val tEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
tEnv.registerDataStream("myTable", myStream, [Field expressions])
Then You should be able to query the dynamic table created from Your DataStream.

Simple reasoning operator Apache Flink

I am trying to implement simple reasoning operator using Apache Flink on scala. Now I can read data as a stream from a .csv file. But I cannot cope with RDF and OWL data processing.
Here is my code to load data from .csv:
val csvTableSource = CsvTableSource
.builder
.path("src/main/resources/data.stream")
.field("subject", Types.STRING)
.field("predicate", Types.STRING)
.field("object", Types.STRING)
.fieldDelimiter(";")
.build()
Could anyone show me an example to load this data with Flink using RDF and OWL? As I understood, an RDF stream contains dynamic data and the OWL is for static. I have to create a simple reasoning operator, which I can ask for information, e.g who is a friend of a friend.
Any help will be appreciated.

Update scala DF based on events

I'm running into Scala,Apache Spark world and I'm trying to understand how to create a "pipeline" that will generate a DataFrame based on the events I receive.
For instance, the idea is that when I receive a specific log/event I have to insert/update a row in the DF.
Let's make a real example.
I would like to create a DataFrame that will represent the state of the users present in my database(postgres,mongo whatever).
When i say state, I mean the current state of the user(ACTIVE,INCOMPLETE,BLOCKED, etc). This states change based on the users activity, so then I will receive logs(JSON) with key "status": "ACTIVE" and so on.
So for example, I'm receiving logs from a Kafka topic.. at some point I receive a log which I'm interested because it defines useful information about the user(the status etc..)
I take this log, and I create a DF with this log in it.
Then I receive the 2nd log, but this one was performed by the same user, so the row needs to be updated(if the status changed of course!) so no new row but update the existing one. Third log, new user, new information so store as a new row in the existing DF.. and so on.
At the end of this process/pipeline, I should have a DF with the information of all the users present in my db and their "status" so then I can say "oh look at that, there are 43 users that are blocked and 13 that are active! Amazing!"
This is the idea.. the process must be in real time.
So far, I've tried this using files not connecting with a kafka topic.
For instance, I've red file as follow:
val DF = mysession.read.json("/FileStore/tables/bm2ube021498209258980/exampleLog_dp_api-fac53.json","/FileStore/tables/zed9y2s11498229410434/exampleLog_dp_api-fac53.json")
which generats a DF with 2 rows with everything inside.
+--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|[2017-06-22T08:40...|DPAPI|
| AVzO9dq5S78GvkQU|dp_api-2017.06.22| 1|[null,null,[Wrapp...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+
in _source there are all the nested things(the status I mentioned is here!).
Then I've selected some useful information like
DF.select("_id", "_source.request.user_ip","_source.request.aw", "_type").show(false)
+--------------------+------------+------------------------------------+-----+
|_id |user_ip |aw |_type|
+--------------------+------------+------------------------------------+-----+
|AVzO9dqvoaL5S78GvkQU|111.11.11.12|285d5034-dfd6-44ad-9fb7-ba06a516cdbf|DPAPI|
|AVzO9dq5S78GvkQU |111.11.11.82|null |DPAPI|
+--------------------+------------+------------------------------------+-----+
again, the idea is to create this DF with the logs arriving from a kafka topic and upsert the log in this DF.
Hope I explained well, I don't want a "code" solution I'd prefer hints or example on how to achieve this result.
Thank you.
As you are looking for resources I would suggest the following:
Have a look at the Spark Streaming Programming Guide (https://spark.apache.org/docs/latest/streaming-programming-guide.html) and the Spark Streaming + Kafka Integration Guide (https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html).
Use the Spark Streaming + Kafka Integration Guide for information on how to open a Stream with your Kafka content.
Then have a look at the possible transformations you can perform with Spark Streaming on it in chapter "Transformations on DStreams" in the Spark Streaming Programming Guide
Once you have transformed the stream in a way that you can perform a final operation on it have a look at "Output Operations on DStreams" in the Spark Streaming Programming Guide. I think especially .forEachRDD could be what you are looking for - as you can do an operation (like checking whether a certain key word is in your string and based on this do a database call) for each element of the stream.