Async Checkpointing in Spark Structured Streaming using RocksDB - pyspark

I am currently exploring on enabling async checkpointing in Spark Structured streaming , but not able to find any way for the same. DataBricks is offering the same for its flavour of Spark.
Spark Structured Streaming 3.3.1 and RocksDB 7.7.3
Any suggestions on the same.

Shared your question with the Speedb hive on discord and here is what we have for you; From Hilik, our co-founder and chief scientist:
"Rocksdb currently does not have a mechanism for async checkpoints. The checkpoint is done by halting all the i/o flush the memtables and then use a hard link on the file system. Since this is a very destructive operation it exists on our to-do list. if you are interested please suggest a feature to the community and we will prioritize it according to the interest"
Hope this helps, let us know if you have any other questions.
Join the discussions on Discord and click here for the thread about your question

Related

How to do PCA with Spark Streaming Dataframe

Just curious to know, how can we run a Principal Component Analysis on streaming data in distributed mode? If we can, is it mathematically valid enough?
Have anyone done that before? Can you guys share your experience over it? Is there any API Spark provides to do the same on Spark Streaming mode?

mongodb change streams java

Since this feature is relatively new (mongo 3.6) I found very few java examples.
My questions:
1. What is the best practices for watching change streams?
2. Does it have to be a blocking call to watch the stream? (This means a thread per collection which is less desired)
This is the example I encountered:
http://mongodb.github.io/mongo-java-driver/3.6/driver/tutorials/change-streams/
The blocking call is:
collection.watch().forEach(printBlock);
Thanks,
Rotem.
Change streams make a lot more sense when you look at them in the context of reactive streams. It took me a while to realize this concept has a much broader existence than just the MongoDB driver.
I recommend reviewing the article above and then looking at the example provided here. The two links helped clear things up, and provided insight on how to write code leveraging the reactive streams Mongo driver, which is non-blocking.
Use mongo reactive driver so that it will be non-blocking. And we used this approach and running in production for last one month, no issue.

Play Framework with Spark MLib vs PredictionIO

Good morning,
currently I'm exploring my options for building an internal platform for the company I work for. Our team is responsible for the company's data warehouse and reporting.
As we evolve, we'll be developing an intranet to answer some of the company's necessities and, for some time now, I'm considering scala (and PlayFramework) as the way to go.
This will also envolve a lot of machine learning to cluster clients, predict sales evolution, and so on. This is when I've started to think in Spark ML and came across PredictionIO.
As we are shifting our skills towards data science, what will benefit and teach us/company most:
build everything on top of Play and Spark and have both the plataform and machine learning on the same project
using Play and PredictionIO where most of the stuff is already prepared
I'm not trying to open a question opinion based, rather then, learn from your experience / architectures / solutions.
Thank you
Both are good options: 1. use PredictionIO if you are new to ML, easy to start but it will limit you in a long run, 2. use spark if you have confidence in your data science and data engineering team, spark has excellent and easy to use api along with extensive ML library, saying that in order to put things into production, you will require some distributed spark knowledge - experience and it is tricky at times to make it efficient and reliable.
Here are options:
spark databricks cloud expensive but easy to use spark, no data engineering
PredictionIO if you certain that their ML can solve all your business cases
spark in google dataproc, easy managed cluster for 60% less than aws, still some engineering required
In summary: PredictionIO for a quick fix, and spark for long term data - science / engineering development. You can start with databricks to minimise expertise overheads and move to dataproc as you go along to minimise costs
PredictionIO uses Spark's MLLib for the majority of their engine templates.
I'm not sure why you're separating the two?
PredictionIO is as flexible as Spark is, and can alternatively use other libraries such as deeplearning4j & H2O to name a few.

Is Hadoop File System Used by All NoSQL Frameworks?

I an new to Big Data; obviously most applications using NoSQL frameworks such as MongoDB, CouchDb, and Cassandra require access to huge amount of data. Now, my question is if all these NoSQL tools use Hadoop file system as their storage, or some how file system of their own?
If they use Hadoop file system, then do they have an easy way to integrate with Hadoop file system?
Thanks
No, they do not use HDFS by default. Many of the NoSQL databases have been made to scale out well. That is, the data can be separated onto a bunch of regular non-HDFS machines and if configured correctly (in some cases this could be a big if) they will operate efficiently.
So they do not use HDFS for their scaling systems, but they can be integrated with Hadoop
Documentation and Webinar about MongoDB and Hadoop.
Blog about CouchDB and Hadoop.
Documentation about Cassandra and Hadoop.

Apache Kafka on Github [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Are there some good demo project using Apache Kafka (version 0.8 is preferred) on Github (or somewhere else)? We've been testing it with some toy projects. But I'd like to check out some real-world bigger projects.
Toy projects are as good as demo projects. It will be fun writing your own simple producers/consumers. You can create your own demo project (and while doing so you will learn a lot). Think of a problem where you require large amount of say streaming data (Think logs of a running application). Now make Kafka to read those Logs. Kafka is just a kind of message queue. Until and unless you write your consumers I don't think there will be any fun. So for a real world consumer pickup say Twitter Storm. Send all the log lines of your application to the brokers where the Storm Consumer (Aka Kafka Spout) picks up those lines and send them to Bolts (Spout/Bolt are Storm terminology similar to Map Reduce but for real time).
This way you will have a full fledged demo application.
Now the main question. How to generate logs to feed to Kafka (for a demo project, if you dont have any application). There are plenty of huge data sets available (open source). From Youtube to Amazon to Twitter, all provide them. Just download it and think of some application. For example consider Youtube video logs (http://netsg.cs.sfu.ca/youtubedata/). Simulate as if they are coming online. Input them to Kafka. Let Storm Consumer (or Kafka Spout) pick that log line for you from the Kafka broker. Give each line of log to Bolt where in say the bolt just reads the line (do some analytics) and calculates the hottest/trending genres for videos watched in last X minutes.
Writing all of this should not take much time. Enjoy!
i have been using kafka from quite sometime. i am using franz-kafka nodejs client to implement pubsubhubbub spec.
i too didn't find any projects using kafka but you can ask me any questions that you have, i will try to answer them.
Thanks
You can try this https://github.com/wurstmeister/storm-kafka-0.8-plus
uses the 0.8 build