How can I connect Spark with Power-BI? So that I can fetch all data directly from Spark using Python. I have seen several posts but that's not meet my requirements.
Related
I am looking for a way to read kinesis structured streaming in Databricks. I am able to use a spark cluster to continuously read streaming data. However I need a way now to get those timely records and do CRUD into Postgres or any DBMS tables. Is there a way I can do this only via Databricks?
This question already has answers here:
Kafka Connect- Modifying records before writing into sink
(2 answers)
Closed 1 year ago.
As I've read from Kafka: The definitive guide book, Kafka Connect can simplify the task of loading CSV files into Kafka. But because we didn't write any code for business logic implementation (like Python/Java code), what should I do if I want to get data from CSV, and add many data from different sources to generate a new message, or even generate new data from system logs to that new message, before loading it into Kafka? Is Kafka Connect still a good approach in this use case?
The source for this answer is from this Stackoverflow thread: Kafka Connect- Modifying records before writing into HDFS sink
You have several options.
Single Message Transforms, great for light-weight changes as messages pass through Connect. Configuration-file-based, and extensible using the provided API if there's not an existing transform that does what you want. See the discussion here on when SMT is suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like.
I am working on POC to implement real time analytics where we have following components.
Confluent Kafka : Which gets events from third party services in Avro format (Event contains many fields up to 40). We are also using Kafka-Registry to deal with different kind of event formats.
I am trying to use MemSQL for analytics for which I have to push events to memsql table in specific format.
I have gone through memsql website , blogs etc but most of them are suggesting to use Spark memsql connector in which you can transform data which we are getting from confluent Kafka.
I have few questions.
If I use simple Java/Go application in place of Spark.
Is there any utility provided by Confluent Kafka and memsql
Thanks.
I recommend using MemSQL Pipelines. https://docs.memsql.com/memsql-pipelines/v6.0/kafka-pipeline-quickstart/
In current versions of MemSQL, you'll need to set up a transform, which will be a small golang or python script which reads in the avro and outputs TSV. Instructions on how to do that is here https://docs.memsql.com/memsql-pipelines/v6.0/transforms/, but the tldr is, you need a script which does
while True:
record_size = read_an_8_byte_int_from_stdin()
avro_record = stdin.read(record_size)
stdout.write(AvroToTSV(avro_record))
Stay tuned for native Avro support in MemSQL.
My team uses different databases, say mongodb and cassandra.
I need to know if it is possible to integrate a single spark cluster with both mongodb and cassandra clusters.
Or, in other words, is it possible to create dataframes from mongodb and cassandra in the same spark application?
Spark only sees DataFrames and RDDs. It doesn't really matter which database you're using, as long as a connector exists. You can make as many external connections as needed within a single Spark Context
Any data source that's read into those formats can be combined
I continuously have data being written to cassandra from an outside source.
Now, I am using spark streaming to continuously read this data from cassandra with the following code:
val ssc = new StreamingContext(sc, Seconds(5))
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
val dstream = new ConstantInputDStream(ssc, cassandraRDD)
dstream.foreachRDD { rdd =>
println("\n"+rdd.count())
}
ssc.start()
ssc.awaitTermination()
sc.stop()
However, the following line:
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
takes the entire table data from cassandra every time. Now just the newest data saved into the table.
What I want to do is have spark streaming read only the latest data, ie, the data added after its previous read.
How can I achieve this? I tried to Google this but got very little documentation regarding this.
I am using spark 1.4.1, scala 2.10.4 and cassandra 2.1.12.
Thanks!
EDIT:
The suggested duplicate question (asked by me) is NOT a duplicate, because it talks about connecting spark streaming and cassandra and this question is about streaming only the latest data. BTW, streaming from cassandra IS possible by using the code I provided. However, it takes the entire table every time and not just the latest data.
There will be some low-level work on Cassandra that will allow notifying external systems (an indexer, a Spark stream etc.) of new mutations incoming to Cassandra, read this: https://issues.apache.org/jira/browse/CASSANDRA-8844