I'm running into Scala,Apache Spark world and I'm trying to understand how to create a "pipeline" that will generate a DataFrame based on the events I receive.
For instance, the idea is that when I receive a specific log/event I have to insert/update a row in the DF.
Let's make a real example.
I would like to create a DataFrame that will represent the state of the users present in my database(postgres,mongo whatever).
When i say state, I mean the current state of the user(ACTIVE,INCOMPLETE,BLOCKED, etc). This states change based on the users activity, so then I will receive logs(JSON) with key "status": "ACTIVE" and so on.
So for example, I'm receiving logs from a Kafka topic.. at some point I receive a log which I'm interested because it defines useful information about the user(the status etc..)
I take this log, and I create a DF with this log in it.
Then I receive the 2nd log, but this one was performed by the same user, so the row needs to be updated(if the status changed of course!) so no new row but update the existing one. Third log, new user, new information so store as a new row in the existing DF.. and so on.
At the end of this process/pipeline, I should have a DF with the information of all the users present in my db and their "status" so then I can say "oh look at that, there are 43 users that are blocked and 13 that are active! Amazing!"
This is the idea.. the process must be in real time.
So far, I've tried this using files not connecting with a kafka topic.
For instance, I've red file as follow:
val DF = mysession.read.json("/FileStore/tables/bm2ube021498209258980/exampleLog_dp_api-fac53.json","/FileStore/tables/zed9y2s11498229410434/exampleLog_dp_api-fac53.json")
which generats a DF with 2 rows with everything inside.
+--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|[2017-06-22T08:40...|DPAPI|
| AVzO9dq5S78GvkQU|dp_api-2017.06.22| 1|[null,null,[Wrapp...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+
in _source there are all the nested things(the status I mentioned is here!).
Then I've selected some useful information like
DF.select("_id", "_source.request.user_ip","_source.request.aw", "_type").show(false)
+--------------------+------------+------------------------------------+-----+
|_id |user_ip |aw |_type|
+--------------------+------------+------------------------------------+-----+
|AVzO9dqvoaL5S78GvkQU|111.11.11.12|285d5034-dfd6-44ad-9fb7-ba06a516cdbf|DPAPI|
|AVzO9dq5S78GvkQU |111.11.11.82|null |DPAPI|
+--------------------+------------+------------------------------------+-----+
again, the idea is to create this DF with the logs arriving from a kafka topic and upsert the log in this DF.
Hope I explained well, I don't want a "code" solution I'd prefer hints or example on how to achieve this result.
Thank you.
As you are looking for resources I would suggest the following:
Have a look at the Spark Streaming Programming Guide (https://spark.apache.org/docs/latest/streaming-programming-guide.html) and the Spark Streaming + Kafka Integration Guide (https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html).
Use the Spark Streaming + Kafka Integration Guide for information on how to open a Stream with your Kafka content.
Then have a look at the possible transformations you can perform with Spark Streaming on it in chapter "Transformations on DStreams" in the Spark Streaming Programming Guide
Once you have transformed the stream in a way that you can perform a final operation on it have a look at "Output Operations on DStreams" in the Spark Streaming Programming Guide. I think especially .forEachRDD could be what you are looking for - as you can do an operation (like checking whether a certain key word is in your string and based on this do a database call) for each element of the stream.
Related
I have a materialized view that is updated from many streams. Every one enrich it partially. Order doesn't matter. Updates comes in not specified time. Is following algorithm is a good approach:
Update comes and I check what is stored in materialized view via get(), that this is an initial one so enrich and save.
Second comes and get() shows that partial update exist - add next information
... and I continue with same style
If there is a query/join, object that is stored has a method that shows that the update is not complete isValid() that could be used in KafkaStreams#filter().
Could you share please is this a good plan? Is there any pattern in Kafka streams world that handle this case?
Please advice.
Your plan looks good , you have the general idea, but you'll have to use the lower Kafka Stream API : Processor API.
There is a .transform operator that allow you to access a KeyValueStatestore, inside this operation implementation you are free to decide if you current aggregated value is valid or not.
Therefore send it downstream or returning null waiting for more information.
I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.
Looking at the example for ClickHouseIO for Apache Beam the name of the output table is hard coded:
pipeline
.apply(...)
.apply(
ClickHouseIO.<POJO>write("jdbc:clickhouse:localhost:8123/default", "my_table"));
Is there a way to dynamically route a record to a table based on its content?
I.e. if the record contains table=1, it is routed to my_table_1, table=2 to my_table_2 etc.
Unfortunately the ClickHouseIO is still in development does not support this. The BigQueryIO does support Dynamic Destinations, so it is possible with Beam.
The limitation in the current ClickHouseIO is around transforming data to match the destination table schema. As a workaround, if your destination tables are known at pipeline creation time you could create a ClickHouseIO per table, then use the data to route to the correct instance of the IO.
You might want to file a feature request in the Beam bug tracker for this.
What I want to do is write an Order service project without using the traditional RDBMS(eg. mysql).After having read the docs from Kafka and confluent for several days, I think Kafka stream and state store could help me to implement this without using RDBMS.
But how to store the data that has a long list result in state store?Especially when I need to query result like using limit and offset?
eg:
I have a table, the columns are:
| userId | orderId | ... |
One user may have many order rows to store, so I need a query method to search the key:userId result in the state store with start and limit.But I can't find a method in the Kafka stream State Store interface.
Do I have to do this using an RDBMS? What is the standard method for me to implement an application like this?
I have streaming of data coming from SparkStreaming. Which i need to process and finally want to store the data in Cassandra. So, earlier i was trying to use SparkCassandra connector. But it doesn't give the access of SparkStreaming Context object on workers. So, I have to use separate cassandra-scala driver. Hence, i ended up with phantom. Now, my question is i have already defined the column family in the cassnandra. So, how do i do the select and update query from scala.
I have followed these documentation link1 but i don't understand why do we need to give the table definition at client (scala code) side. Why can't we just give Keyspace, ClusterPoints and ColumnFamily and be done with it.
object CustomConnector {
val hosts = Seq("IP1", "IP2")
val Connector = ContactPoints(hosts).keySpace("KEYSPACE_NAME")
}
realTimeAgg.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
x.foreachPartition {
How to achieve select/insert in Cassandra table here using phantom
}
This is not yet possible using phantom, we are actively working on phantom-spark to allow you to do this, but at this stage in time this is still a few months away.
In the interim, you will have to rely on the spark cassandra connector and use the non type-safe API to achieve this. It's a more unfortunate setup, but in the very near future this will be resolved.