Simple reasoning operator Apache Flink - scala

I am trying to implement simple reasoning operator using Apache Flink on scala. Now I can read data as a stream from a .csv file. But I cannot cope with RDF and OWL data processing.
Here is my code to load data from .csv:
val csvTableSource = CsvTableSource
.builder
.path("src/main/resources/data.stream")
.field("subject", Types.STRING)
.field("predicate", Types.STRING)
.field("object", Types.STRING)
.fieldDelimiter(";")
.build()
Could anyone show me an example to load this data with Flink using RDF and OWL? As I understood, an RDF stream contains dynamic data and the OWL is for static. I have to create a simple reasoning operator, which I can ask for information, e.g who is a friend of a friend.
Any help will be appreciated.

Related

Glue avro schema registry with flink and kafka for any object

I am trying to registry and serialize an abject with flink, kafka, glue and avro. I've seen this method which I'm trying.
Schema schema = parser.parse(new File("path/to/avro/file"));
GlueSchemaRegistryAvroSerializationSchema<GenericRecord> test= GlueSchemaRegistryAvroSerializationSchema.forGeneric(schema, topic, configs);
FlinkKafkaProducer<GenericRecord> producer = new FlinkKafkaProducer<GenericRecord>(
kafkaTopic,
test,
properties);
My problem is that this system doesn't allow to include an object different than GenericRecord, the object that I want to send is another and is very big. So big that is too complex to transform to GenericRecord.
I don't find too much documentation. How can I send an object different than GenericRecord, or any way to include my object inside GenericRecord?
I'm not sure if I understand correctly, but basically the GlueSchemaRegistryAvroSerializationSchema has another method called forSpecific that accepts SpecificRecord. So, You can use avro generation plugin for Your build tool depending on what You use (e.g. for sbt here) that will generate classes from Your avro schema that can then be passed to forSpecific method.

Databricks consuming rest api

I'm new to Azure Databricks and Scala, i'm trying to consume HTTP REST API that's returning JSON, i went around the databricks docs but i don't see any Datasource that would work with rest api.Is there any library or tutorial on how to work with rest api in databricks. If i make multiple api calls (cause of pagination) it would be nice to get it done in parallel way (spark way).
I would be glad if you guys could point me if there is a Databricks or Spark way to consume REST API as i was shocked that there's no information in docs about api datasource.
Here is A Simple Implementation.
Basic Idea is spark.read.json can read an RDD.
So, just create an RDD from GET call and then read it as regular dataframe.
%spark
def get(url: String) = scala.io.Source.fromURL(url).mkString
val myUrl = "https://<abc>/api/v1/<xyz>"
val result = get(myUrl)
val jsonResponseStrip = result.toString().stripLineEnd
val jsonRdd = sc.parallelize(jsonResponseStrip :: Nil)
val jsonDf = spark.read.json(jsonRdd)
That's it.
It sounds to me like what you want, is to import into scala a library for making HTTP requests. I suggest the HTTP instead of a higher level REST interface, because the pagination may be handled in the REST library and may or may not support parallelism.
Managing with the lower level HTTP lets you de-couple pagination. Then you can use the parallelism mechanism of your choice.
There are a number of libraries out there, but recommending a specific one is out of scope.
If you do not want to import a library, you could have your scala notebook call upon another notebook running a language which has HTTP included in the standard library. This notebook would then return the data to your scala notebook.

re-hydrating tweets / extracting data from partial stream

I have a spark stream I am getting from Kafka (tweets ids) and I am trying to re-hydrate adittional data using tweepy\twarc\twython while streaming.
I tried to extract the ids column from the streaming DataFrame, so I'll be able to iterate over it (could not find a way to iterate the stream directly) and look up the additional data using tweepy.
tried the following:
ids= []
def process_row(row,ids):
ids.append(row.tweet_id)
query = FlattenedDF.writeStream.foreach(process_row).start()
I also tried to convert it directly:
ids = FlattenedDF.select('tweet_id').collect()
got the error : "Queries with streaming sources must be executed with writeStream.start"
is there any better way to do it?
any help would be appreciated.

Flink : DataStream to Table

Usecase: Read protobuf messages from Kafka, deserialize them, apply some transformation (flatten out some columns), and write to dynamodb.
Unfortunately, Kafka Flink Connector only supports - csv, json and avro formats. So, I had to use lower level APIs (datastream).
Problem: If I can create a table out of the datastream object, then I can accept a query to run on that table. It would make the transformation part seamless and generic. Is it possible to run a SQL query over datastream object?
If You have a DataStream of objects, then You can simply register given DataStream as Table using StreamTableEnvironment.
This would look more or less like below:
val myStream = ...
val env: StreamExecutionEnvironment = configureFlinkEnv(StreamExecutionEnvironment.getExecutionEnvironment)
val tEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
tEnv.registerDataStream("myTable", myStream, [Field expressions])
Then You should be able to query the dynamic table created from Your DataStream.

Update scala DF based on events

I'm running into Scala,Apache Spark world and I'm trying to understand how to create a "pipeline" that will generate a DataFrame based on the events I receive.
For instance, the idea is that when I receive a specific log/event I have to insert/update a row in the DF.
Let's make a real example.
I would like to create a DataFrame that will represent the state of the users present in my database(postgres,mongo whatever).
When i say state, I mean the current state of the user(ACTIVE,INCOMPLETE,BLOCKED, etc). This states change based on the users activity, so then I will receive logs(JSON) with key "status": "ACTIVE" and so on.
So for example, I'm receiving logs from a Kafka topic.. at some point I receive a log which I'm interested because it defines useful information about the user(the status etc..)
I take this log, and I create a DF with this log in it.
Then I receive the 2nd log, but this one was performed by the same user, so the row needs to be updated(if the status changed of course!) so no new row but update the existing one. Third log, new user, new information so store as a new row in the existing DF.. and so on.
At the end of this process/pipeline, I should have a DF with the information of all the users present in my db and their "status" so then I can say "oh look at that, there are 43 users that are blocked and 13 that are active! Amazing!"
This is the idea.. the process must be in real time.
So far, I've tried this using files not connecting with a kafka topic.
For instance, I've red file as follow:
val DF = mysession.read.json("/FileStore/tables/bm2ube021498209258980/exampleLog_dp_api-fac53.json","/FileStore/tables/zed9y2s11498229410434/exampleLog_dp_api-fac53.json")
which generats a DF with 2 rows with everything inside.
+--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|[2017-06-22T08:40...|DPAPI|
| AVzO9dq5S78GvkQU|dp_api-2017.06.22| 1|[null,null,[Wrapp...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+
in _source there are all the nested things(the status I mentioned is here!).
Then I've selected some useful information like
DF.select("_id", "_source.request.user_ip","_source.request.aw", "_type").show(false)
+--------------------+------------+------------------------------------+-----+
|_id |user_ip |aw |_type|
+--------------------+------------+------------------------------------+-----+
|AVzO9dqvoaL5S78GvkQU|111.11.11.12|285d5034-dfd6-44ad-9fb7-ba06a516cdbf|DPAPI|
|AVzO9dq5S78GvkQU |111.11.11.82|null |DPAPI|
+--------------------+------------+------------------------------------+-----+
again, the idea is to create this DF with the logs arriving from a kafka topic and upsert the log in this DF.
Hope I explained well, I don't want a "code" solution I'd prefer hints or example on how to achieve this result.
Thank you.
As you are looking for resources I would suggest the following:
Have a look at the Spark Streaming Programming Guide (https://spark.apache.org/docs/latest/streaming-programming-guide.html) and the Spark Streaming + Kafka Integration Guide (https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html).
Use the Spark Streaming + Kafka Integration Guide for information on how to open a Stream with your Kafka content.
Then have a look at the possible transformations you can perform with Spark Streaming on it in chapter "Transformations on DStreams" in the Spark Streaming Programming Guide
Once you have transformed the stream in a way that you can perform a final operation on it have a look at "Output Operations on DStreams" in the Spark Streaming Programming Guide. I think especially .forEachRDD could be what you are looking for - as you can do an operation (like checking whether a certain key word is in your string and based on this do a database call) for each element of the stream.