Spark binary data source vs sc.binaryFiles

Spark binary data source vs sc.binaryFiles - scala

Spark 3.0 enables reading binary data using a new data source:
val df = spark.read.format(“binaryFile”).load("/path/to/data")
Using previous spark versions you cloud load data using:
val rdd = sc.binaryFiles("/path/to/data")
Beyond having the option to access binary data using the High-Level API (Dataset) is there any additional benefits or features that spark 3.0 introduce with this feature?

I dont think there is any additional benefit besides developers have more control over data with high level API (Dataframe/ Dataset) than low level (RDD), and they dont need to worry about performance as it is well optimized/ managed by high level API by its own.
Reference -
https://spark.apache.org/docs/3.0.0-preview/sql-data-sources-binaryFile.html
P.S. - I do think my answer does not qualify as a formal answer. I earlier wanted to add it as comment only but unable to do so because I am yet to earn privilege of commenting.. :)

Related

Is it possible to generate DataFrame rows from the context of a Spark Worker?

The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?

To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).

What happens to a Spark DataFrame used in Structured Streaming when its underlying data is updated at the source?

I have a use case where I am joining a streaming DataFrame with a static DataFrame. The static DataFrame is read from a parquet table (a directory containing parquet files).
This parquet data is updated by another process once a day.
My question is what would happen to my static DataFrame?
Would it update itself because of the lazy execution or is there some weird caching behavior that can prevent this?
Can the updation process make my code crash?
Would it be possible to force the DataFrame to update itself once a day in any way?
I don't have any code to share for this because I haven't written any yet, I am just exploring what the possibilities are. I am working with Spark 2.3.2

A big (set of) question(s).
I have not implemented all aspects myself (yet), but this is my understanding and one set of info from colleagues who performed an aspect that I found compelling and also logical. I note that there is not enough info out there on this topic.
So, if you have a JOIN (streaming --> static), then:
If standard coding practices as per Databricks applied and .cache is applied, the SparkStructuredStreamingProgram will read in static source only once, and no changes seen on subsequent processing cycles and no program failure.
If standard coding practices as per Databricks applied and caching NOT used, the SparkStructuredStreamingProgram will read in static source every loop, and all changes will be seen on subsequent processing cycles hencewith.
But, JOINing for LARGE static sources not a good idea. If large dataset evident, use Hbase, or some other other key value store, with mapPartitions if volitatile or non-volatile. This is more difficult though. It was done by an airline company I worked at and was no easy task the data engineer, designer told me. Indeed, it is not that easy.
So, we can say that updates to static source will not cause any crash.
"...Would it be possible to force the DataFrame to update itself once a day in any way..." I have not seen any approach like this in the docs or here on SO. You could make the static source a dataframe using var, and use a counter on the driver. As the micro batch physical plan is evaluated and genned every time, no issue with broadcast join aspects or optimization is my take. Whether this is the most elegant, is debatable - and is not my preference.
If your data is small enough, the alternative is to read using a JOIN and thus perform the look up, via the use of the primary key augmented with some max value in a
technical column that is added to the key to make the primary key a
compound primary key - and that the data is updated in the background with a new set of data, thus not overwritten. Easiest
in my view if you know the data is volatile and the data is small. Versioning means others may still read older data. That is why I state this, it may be a shared resource.
The final say for me is that I would NOT want to JOIN with the latest info if the static source is large - e.g. some Chinese
companies have 100M customers! In this case I would use a KV store as
LKP using mapPartitions as opposed to JOIN. See
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
that provides some insights. Also, this is old but still applicable
source of information:
https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/.
Both are good reads. But requires some experience and to see the
forest for the trees.

Spark.ml on Apache Beam

Is it possible to use Spark libraries such as Spark.ml within a Beam pipeline?
From my understanding, you will write your pipeline in “beam syntax” and let Beam
execute it on spark using spark as a runner.
Hence, I don’t see how you could use spark.ml within beam.
But maybe I get something wrong here?
Did someone already try to use it, if not, do other ML libraries exist for native usage in Beam (except from Tensorflow Transform)?
Many Thanks,
Jonathan

Apache Beam unifies stream and batch data processing. Its portable, meaning SDKs can be written in any language and it can be executed in any data processing frameworks with enough capabilities(see: runners). ML in not its main concern. So its programming model does not define any unified API to work with ML.
But id does not mean that you cannot use it with ML libraries to preprocess data needed to your ML model for training or inference. It is well suited to do it for you. Beam comes with set of build IOs. Which may help you to get data from many sources.

Recommended way to access HBase using Scala

Now that SpyGlass is no longer being maintained, what is the recommended way to access HBase using Scala/Scalding? A similar question was asked in 2013, but most of the suggested links are either dead or to defunct projects. The only link that seems useful is to Apache Flink. Is that considered the best option nowadays? Are people still recommending SpyGlass for new projects even though it isn't been maintained? Performance (massively parallel) and testability are priorities.

According to my experiences in writing data Cassandra using Flink Cassandra connector, I think the best way is to use Flink built-in connectors. Since Flink 1.4.3 you can use HBase Flink connector. See here

I connect to HBase in Flink using java. Just create HBase Connection object in open and close it within close methods of RichFunction (i.e. RichSinkFunction). These methods are called once by each flink slot.
I think you can do something like this in Scala too.

Depends on what do you mean by "recommended", I guess.
DIY
Eel
If you just want to access data on HBase from a Scala application, you may want to have a look at Eel, which includes libraries to interact with many storage formats and systems in the Big Data landscape and is natively written in Scala.
You'll most likely be interested in using the eel-hbase module, which from a few releases includes an HBaseSource class (as well as an HBaseSink). It's actually so recent I just noticed the README still mentions that HBase is not supported. There are no explicit examples with Hive, but source and sinks work in similar ways.
Kite
Another alternative could be Kite, which also has a quite extensive set of examples you can draw inspiration from (including with HBase), but it looks less active of a project than Eel.
Big Data frameworks
If you want a framework that helps you instead of brewing your own solution with libraries. Of course you'll have to account for some learning curve.
Spark
Spark is a fairly mature project and the HBase project itself as built a connector for Spark 2.1.1 (Scaladocs here). Here is an introductory talk that can come to your help.
The general idea is that you could use this custom data source as suggested in this example:
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat, HBaseRelation.HBASE_CONFIGFILE -> conf))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
Giving you access to HBase data through the Spark SQL API. Here is a short extract from the same example:
val df1 = withCatalog(cat1, conf1)
val df2 = withCatalog(cat2, conf2)
val s1 = df1.filter($"col0" <= "row120" && $"col0" > "row090").select("col0", "col2")
val s2 = df2.filter($"col0" <= "row150" && $"col0" > "row100").select("col0", "col5")
val result = s1.join(s2, Seq("col0"))
Performance considerations aside, as you may see the language can feel pretty natural for data manipulation.
Flink
Two answers already dealt with Flink, so I won't add much more, except for a link to an example from the latest stable release at the time of writing (1.4.2) that you may be interested in having a look at.

Scala persistent key value store?

Please advise on persistent key value store for Scala. Is it possible to use Scala Spark components to build such a store with good access times? (I am new to Spark and planing to use it). Thanks!

Spark is a library used for data processing. The underlying datastore is normally Hadoop. So there is a conceptual difference between what Spark is and what a data store is.
You are looking for a persistant key-value store, I would suggest Redis because it is easy to setup, is relatively mature and has a Scala client.
However, you could also use any key-value store depending on your specific needs and what may already exist in your environment. Take a look at these Websites for some guidance:
http://en.wikipedia.org/wiki/NoSQL#Key-value_stores
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark binary data source vs sc.binaryFiles - scala

Related

Is it possible to generate DataFrame rows from the context of a Spark Worker?

What happens to a Spark DataFrame used in Structured Streaming when its underlying data is updated at the source?

Spark.ml on Apache Beam

Recommended way to access HBase using Scala

Scala persistent key value store?

Categories

Resources