re-hydrating tweets / extracting data from partial stream

re-hydrating tweets / extracting data from partial stream - pyspark

I have a spark stream I am getting from Kafka (tweets ids) and I am trying to re-hydrate adittional data using tweepy\twarc\twython while streaming.
I tried to extract the ids column from the streaming DataFrame, so I'll be able to iterate over it (could not find a way to iterate the stream directly) and look up the additional data using tweepy.
tried the following:
ids= []
def process_row(row,ids):
ids.append(row.tweet_id)
query = FlattenedDF.writeStream.foreach(process_row).start()
I also tried to convert it directly:
ids = FlattenedDF.select('tweet_id').collect()
got the error : "Queries with streaming sources must be executed with writeStream.start"
is there any better way to do it?
any help would be appreciated.

Related

Flink : DataStream to Table

Usecase: Read protobuf messages from Kafka, deserialize them, apply some transformation (flatten out some columns), and write to dynamodb.
Unfortunately, Kafka Flink Connector only supports - csv, json and avro formats. So, I had to use lower level APIs (datastream).
Problem: If I can create a table out of the datastream object, then I can accept a query to run on that table. It would make the transformation part seamless and generic. Is it possible to run a SQL query over datastream object?

If You have a DataStream of objects, then You can simply register given DataStream as Table using StreamTableEnvironment.
This would look more or less like below:
val myStream = ...
val env: StreamExecutionEnvironment = configureFlinkEnv(StreamExecutionEnvironment.getExecutionEnvironment)
val tEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
tEnv.registerDataStream("myTable", myStream, [Field expressions])
Then You should be able to query the dynamic table created from Your DataStream.

Dynamic filters with spark-streaming

I'm using spark-streaming for below use-case :
I've a kafka topic - data. From this topic, I'm streaming in real-time data using structured spark streaming and apply some filters on it. If the number of rows returned after applying the filters is greater than 1 then the output is 1 else the output is 0 along with some other data from the query.
In simple words, suppose I'm filtering the stream using -
df.filter($A < 10)
where "A", "<" and "10" are dynamic and comes from some database. In actual, these values comes from kafka topic which I'm consuming and updating those values in db. So the query is not static and will be updated after sometime.
Further, I'll have to apply some boolean algeric operators on the results of streams. For eg -
df.filter($A < 10) AND df.filter($B = 1) OR df.filter($C > 1)... and so on
Here, each of the atomic operation (like df.filter($A < 10)) returns either 0 or 1 as described above.
Final result is saved to mongo.
I want to know if both problems can be used using structured spark streaming. If not, then using RDD ?
Otherwise, can someone suggest any way to do this ?

For the first case you can use a broadcast variable based approach as described in this answer. I've also had good luck using a per-executor transient value that was periodically refetched in each executor as described in the second part of this answer.
For the second case you would use a single filter() call that implements the complete set of conditions that causes a message to be included in the output stream.

Update scala DF based on events

I'm running into Scala,Apache Spark world and I'm trying to understand how to create a "pipeline" that will generate a DataFrame based on the events I receive.
For instance, the idea is that when I receive a specific log/event I have to insert/update a row in the DF.
Let's make a real example.
I would like to create a DataFrame that will represent the state of the users present in my database(postgres,mongo whatever).
When i say state, I mean the current state of the user(ACTIVE,INCOMPLETE,BLOCKED, etc). This states change based on the users activity, so then I will receive logs(JSON) with key "status": "ACTIVE" and so on.
So for example, I'm receiving logs from a Kafka topic.. at some point I receive a log which I'm interested because it defines useful information about the user(the status etc..)
I take this log, and I create a DF with this log in it.
Then I receive the 2nd log, but this one was performed by the same user, so the row needs to be updated(if the status changed of course!) so no new row but update the existing one. Third log, new user, new information so store as a new row in the existing DF.. and so on.
At the end of this process/pipeline, I should have a DF with the information of all the users present in my db and their "status" so then I can say "oh look at that, there are 43 users that are blocked and 13 that are active! Amazing!"
This is the idea.. the process must be in real time.
So far, I've tried this using files not connecting with a kafka topic.
For instance, I've red file as follow:
val DF = mysession.read.json("/FileStore/tables/bm2ube021498209258980/exampleLog_dp_api-fac53.json","/FileStore/tables/zed9y2s11498229410434/exampleLog_dp_api-fac53.json")
which generats a DF with 2 rows with everything inside.
+--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|[2017-06-22T08:40...|DPAPI|
| AVzO9dq5S78GvkQU|dp_api-2017.06.22| 1|[null,null,[Wrapp...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+
in _source there are all the nested things(the status I mentioned is here!).
Then I've selected some useful information like
DF.select("_id", "_source.request.user_ip","_source.request.aw", "_type").show(false)
+--------------------+------------+------------------------------------+-----+
|_id |user_ip |aw |_type|
+--------------------+------------+------------------------------------+-----+
|AVzO9dqvoaL5S78GvkQU|111.11.11.12|285d5034-dfd6-44ad-9fb7-ba06a516cdbf|DPAPI|
|AVzO9dq5S78GvkQU |111.11.11.82|null |DPAPI|
+--------------------+------------+------------------------------------+-----+
again, the idea is to create this DF with the logs arriving from a kafka topic and upsert the log in this DF.
Hope I explained well, I don't want a "code" solution I'd prefer hints or example on how to achieve this result.
Thank you.

As you are looking for resources I would suggest the following:
Have a look at the Spark Streaming Programming Guide (https://spark.apache.org/docs/latest/streaming-programming-guide.html) and the Spark Streaming + Kafka Integration Guide (https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html).
Use the Spark Streaming + Kafka Integration Guide for information on how to open a Stream with your Kafka content.
Then have a look at the possible transformations you can perform with Spark Streaming on it in chapter "Transformations on DStreams" in the Spark Streaming Programming Guide
Once you have transformed the stream in a way that you can perform a final operation on it have a look at "Output Operations on DStreams" in the Spark Streaming Programming Guide. I think especially .forEachRDD could be what you are looking for - as you can do an operation (like checking whether a certain key word is in your string and based on this do a database call) for each element of the stream.

construct Jooq stream is too slow

I am using Scala.
I tried to fetch all data from a table with about 4 million rows. I used stream and the code is like
val stream Stream[Record] = expression.stream().iterator().asScala.toStream
stream.map(println(_))
expression is SelectFinalStep[Record] in Jooq.
However, the first line is too slow. It costs minutes. Am I doing something wrong?

Use the Stream API directly
If you're using Scala 2.12, you don't have to transform the Java stream returned by expression.stream() to a Scala Iterator and then to a Scala Stream. Simply call:
expression.stream().forEach(println);
While jOOQ's ResultQuery.stream() method creates a lazy Java 8 Stream, which is discarded again after consumption, Scala's Stream keeps previously fetched records in memory for re-traversal. That's probably what's causing most performance issues, when fetching 4 million records.
A note on resources
Do note that expression.stream() returns a resourceful stream, keeping an open underlying ResultSet and PreparedStatement. Perhaps, it's a good idea to explicitly close the stream after consumption.
Optimise JDBC fetch size
Also, you might want to look into calling expression.fetchSize(), which calls through to JDBC's Statement.setFetchSize(). This allows for the JDBC driver to fetch batches of N rows. Some JDBC drivers default to a reasonable fetch size, others default to fetching all rows into memory prior to passing them to the client.

Another solution would be to fetch the records lazily and construct the a scala stream. For example:
def allRecords():Stream[Record] = {
val cur = expression.fetchLazy()
def inner(): Stream[Record] = {
if(cur.hasNext) {
val next = cur.fetchOne
next #:: inner()
}
else
Stream.empty
}
inner()
}

How to find out Avro schema from binary data that comes in via Spark Streaming?

I set up a Spark-Streaming pipeline that gets measuring data via Kafka. This data was serialized using Avro. The data can be of two types - EquidistantData and DiscreteData. I created these using an avdl file and the sbt-avrohugger plugin. I use the variant that generates Scala case classes that inherit from SpecificRecord.
In my receiving application, I can get the two schemas by querying EquidistantData.SCHEMA$ and DiscreteData.SCHEMA$.
Now, my Kafka stream gives me RDDs whose value class is Array[Byte]. So far so good.
How can I find out from the byte array which schema was used when serializing it, i.e., whether to use EquidistantData.SCHEMA$ or DiscreteData.SCHEMA$?
I thought of sending an appropriate info in the message key. Currently, I don't use the message key. Would this be a feasible way or can I get the schema somehow from the serialized byte array I received?
Followup:
Another possibility would be to use separate topics for discrete and equidistant data. Would this be feasible?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

re-hydrating tweets / extracting data from partial stream - pyspark

Related

Flink : DataStream to Table

Dynamic filters with spark-streaming

Update scala DF based on events

construct Jooq stream is too slow

How to find out Avro schema from binary data that comes in via Spark Streaming?

Categories

Resources