Spark streaming to Power BI - pyspark

As the name suggest, I want to see real time data in the Power BI. I have built the pipeline till spark streaming where I am processing the data and now I want to push into the Power BI. And preferably using Python.
Now as per my understanding, there could be different ways as mentioned below.
Store the streaming data into Azure Blob and make it live connection in the power BI. I know on paper it sounds perfect but is it possible to do like this?
2nd way could be to make a Push dataset in Power BI, get the REST API URL and Post the request from Spark streaming to it. So here my question is, is it possible to send a POST request from spark streaming to Power BI? Google only tells me to send request in Spark. Nothing in regards to Spark streaming.
Using PubNub SDK. Is it possible? Basically how to send data from spark streaming to PubNub subscriber?
Ingest the Spark stream into HBASE and then establish live connection in PowerBI with it. is it possible?
My target is to have minimum latency.
Any help would be much appreciated.

This may not be the best way to do it. I think the best way would be to write your own structured streaming integration. However, you can use PubNub. Here is some Python code to publish your dashboard data on a PubNub channel.
import requests
def publishRecord(record):
requests.get('http://pubsub.pubnub.com/publish/publish_key_here/subscribe_key_here/0/pubnub_channel_name/0/' + str(record))
rdd.foreach(publishRecord)
You can use TypeScript to subscribe to this same channel in your dashboard.

Related

How to operate a Kafka cluster and a streaming application 24/7 on budget?

I want to stream financial data (trades, orderbook) from an exchange websocket endpoint and store that data somewhere to build up my own data history for backtesting purposes. Furthermore I might want to analyze the data in real time.
I found the idea of an event driven system very interesting so that I ended up building my own dockerized confluent Kafka cluster (with avro schema-registry) and a python producer that sends the streaming data into a Kafka topic. Then I set up a Faust app to stream process the data and store it as a new topic in Kafka.
It's working fine on my laptop, but now I'm wondering how I could put this to production? Obviously I cannot do it on my laptop, because I need this application to run 24/7 without interruptions.
When I look at the fully managed Kafka cloud solutions like confluent then I find it quite expensive, especially as I'm not running a business, it's rather a private hobby project. And maybe I don't even need that kind of highly scalable and professional service.
What could be a cost efficient approach for me to get my streaming and storage application to work?
Is there another Kafka cloud solution more reduced to my needs?
Should I set up my own server? Maybe raspberry pi?
Or should I use a different approach?
I'm sorry if my problem description might not be very specific, it's a reflection of me being overwhelmed with all these system architecture questions and cloud services.
Any advice and recommendation are appreciated!

Visualise data from kafka consumer

I'm new to kafka. I have written a simple producer script which writes JSON data in the form of latlong coordinates and another consumer app that can read the latlong data as it is being produced. If I wish to plot the latlong data using D3js or highcharts how do I it? Any suggestions or links would be greatly appreciated. I did my research but I couldn't find any relevant tutorials yet.
D3.js is a front-end technology. Kafka is backend.
You need some mechanism to forward data from a Kafka consumer to a given browser. Websockets is one option, and there's several resources out there about it
For example, https://github.com/sulthan309/Live-Dashboard-using-Kafka-and-Spring-Websocket
The alternative is to feed the data into a system like Druid, InfluxDB, Elasticsearch, or other storage system intended for time-series numerical data, from which you can use real BI tools to analyze it

Which service I have to use from azure for real time streaming provided by azure?

I'm trying to do real-time analytics with Azure, and when I have gone through services, I have seen three services provided by Azure are HDInsight(Kafka), Azure stream Analytics, and Azure Events hub what are the services do I have to use.
I'm trying stream data on real-time either from SQL server or from twitter or some other and to store in on Azure Data warehouse or Data Lake.
To answer the high level question, Event Hubs (including Event Hubs Kafka) and HDInsight can be used for data ingestion. The first one provides a serverless service while the second one provides a managed Kafka cluster.
Azure Stream Analytics focuses on data processing, transformation and analytics. You can use a SQL query to do this and take data from Event Hubs (or IoT Hubs) and move it to various sinks such as SQL, SQL Data Warehouse, Data Lake, etc.
To answer your particular question, you can look at this tutorial showing how to use Event Hubs and Stream Analytics to process Twitter data.
Also, you mentioned you want to take data from SQL Server? Is it streaming data? Azure Stream Analytics support data from SQL for reference data (slow moving data used to enrich a stream of data). Are you looking for something to do ETL and move data from SQL server to other places at regular pace? For this Azure Data Factory or SQL Server Integration Services could be a good choice.
Let me know if this answers your question, I'll be happy to give you more info.
Jean-Sébastien
(Azure Stream Analytics)

Flafka (Http -> Flume->Kafka ->Spark Streaming)

I have one use case for real time streaming, we will be using Kafka(0.9) for message buffer and spark streaming(1.6) for stream processing (HDP 2.4). We will receive ~80-90K/Sec event on Http. Can you please suggest a recommended architecture for data ingestion into Kafka topics which will be consumed by spark streaming.
We are considering flafka architecture.
Is Flume listening to Http and sending to Kafka (Flafka )for real time streaming a good option?
Please share other possible approaches if any.
One approach could be Kafka Connect. Look for a source that fit in your needs or develop a custom new one.

How realtime data input to Druid?

I have analytic server (for example click counter). I want to send data to druid using some api. How should I do that?
Can I use it as replacement for google analytics?
As se7entyse7en said:
You can ingest your data to Kafka and then use druid's Kafka
firehose to ingest your data to druid through real-time ingestion.
After that you can interactively query druid using its api.
It must be said that firehoses can be setup only on Druid realtime nodes.
Here is a tutorial how to setup the Kafka firehose: Loading Streaming Data.
Beside Kafka firehose, you can setup other provided firehoses - Amazon S3 firehose, RabbitMQ firehose, etc... by including them and you can even write your own firehose as an extension, an example is here. Here are all druid extensions.
It must be said that Druid is shifting real-time ingestion from realtime nodes to the Indexing service, as explained here.
Right now the best practise is to run Realtime Index Task on Indexing Service and then you can use Druid's API to send data to this task. You can use the API directly but it's far more easier to use Tranquility. It's a library that will automatically create new Realtime Index Task for new segments and it'll allow you to send messages to the right task. You can also set replication and sharding level etc. Just run the indexing service, use Tranquility and you can start sending your messages to Druid.
You can ingest your data to Kafka and then use druid's Kafka firehose to ingest your data to druid through real-time ingestion. After that you can interactively query druid using its api.
The best way to use, considering your druid is a 0.9.x version is tranquility. The rest api is pretty solid and allows you to control your data schema. The druid.io quickstart page and hit the "Load streaming data" section.
I am loading in clickstream data for our website at real time and its been working very well. So, yes you can replace google analytics with druid (assuming, you have the required infrastructure).