How to do PCA with Spark Streaming Dataframe - pyspark

Just curious to know, how can we run a Principal Component Analysis on streaming data in distributed mode? If we can, is it mathematically valid enough?
Have anyone done that before? Can you guys share your experience over it? Is there any API Spark provides to do the same on Spark Streaming mode?

Related

Any pointers how to build a data lineage solution across multiple application?

We have a requirement to capture data-lineage across multiple applications. These applications span multi-tech stack ranging from PL/SQL to Java to Spark.
Any hints how to proceed will be of great help.
Thanks
Anuj Mehra

Async Checkpointing in Spark Structured Streaming using RocksDB

I am currently exploring on enabling async checkpointing in Spark Structured streaming , but not able to find any way for the same. DataBricks is offering the same for its flavour of Spark.
Spark Structured Streaming 3.3.1 and RocksDB 7.7.3
Any suggestions on the same.
Shared your question with the Speedb hive on discord and here is what we have for you; From Hilik, our co-founder and chief scientist:
"Rocksdb currently does not have a mechanism for async checkpoints. The checkpoint is done by halting all the i/o flush the memtables and then use a hard link on the file system. Since this is a very destructive operation it exists on our to-do list. if you are interested please suggest a feature to the community and we will prioritize it according to the interest"
Hope this helps, let us know if you have any other questions.
Join the discussions on Discord and click here for the thread about your question

Is it possible to write a new SpreadsheetDocument to a write-only stream?

My understanding of the OpenXML SDK is that it offers both a DOM oriented mode and a high-performance streaming SAX mode.
My goal is to write a spreadsheet directly to a network stream. Such a stream is write-only. I didn't get far at all; it didn't work.
SpreadsheetDocument.Create throws an exception when the stream to be written to does not support reading, writing, and seeking, which rules out streaming over a network.
Are there any options in the SDK that I'm overlooking that will enable this?
Are there any options in the SDK that I'm overlooking that will enable this?
No.

Play Framework with Spark MLib vs PredictionIO

Good morning,
currently I'm exploring my options for building an internal platform for the company I work for. Our team is responsible for the company's data warehouse and reporting.
As we evolve, we'll be developing an intranet to answer some of the company's necessities and, for some time now, I'm considering scala (and PlayFramework) as the way to go.
This will also envolve a lot of machine learning to cluster clients, predict sales evolution, and so on. This is when I've started to think in Spark ML and came across PredictionIO.
As we are shifting our skills towards data science, what will benefit and teach us/company most:
build everything on top of Play and Spark and have both the plataform and machine learning on the same project
using Play and PredictionIO where most of the stuff is already prepared
I'm not trying to open a question opinion based, rather then, learn from your experience / architectures / solutions.
Thank you
Both are good options: 1. use PredictionIO if you are new to ML, easy to start but it will limit you in a long run, 2. use spark if you have confidence in your data science and data engineering team, spark has excellent and easy to use api along with extensive ML library, saying that in order to put things into production, you will require some distributed spark knowledge - experience and it is tricky at times to make it efficient and reliable.
Here are options:
spark databricks cloud expensive but easy to use spark, no data engineering
PredictionIO if you certain that their ML can solve all your business cases
spark in google dataproc, easy managed cluster for 60% less than aws, still some engineering required
In summary: PredictionIO for a quick fix, and spark for long term data - science / engineering development. You can start with databricks to minimise expertise overheads and move to dataproc as you go along to minimise costs
PredictionIO uses Spark's MLLib for the majority of their engine templates.
I'm not sure why you're separating the two?
PredictionIO is as flexible as Spark is, and can alternatively use other libraries such as deeplearning4j & H2O to name a few.

standard process for developing datastage steps

I have experience using Pentaho Kettle and Talend Data Integration for ETL jobs and typically the high-level process for developing transformations is:
define source connections
define target connections
define transformation of data between source and target
What is the 'standard' high-level process for developing datastage jobs? Is it similar to the process identified above?
Exactly the same. If you know the basic concepts of an ETL tool, they apply to all of the tools.
The three steps you listed are, however, very high level. Depending on what you're trying to do, that list can be increased quite dramatically.