safe downcasting in Apache Spark SQL - scala

We have some well settled source and target data and spark sql in scala is used in between. There are cases when the schema of the target is more restrictive than the one in the source but business says that target schema is more accurate. At this point we cannot change the schemas. We could live with simple casting down but just in case the business people are wrong we would like to have some sanity check and have safe downcasts without truncating data silently.
We use dataframes and source and target are parque files and at the very end we convert to strongly typed datasets with Scala.
What would be the best approach to get this done in a generic way? Anything that it can be done at while reading with a schema that would error out and not truncate? Or having some sort of UDFs and validate as we load data?
It seems like a problem that other should have faced and I'm sort of novice to Spark. Just looking for some sort of established practice so don't have to re-invent the wheel.

Related

Is it possible to generate DataFrame rows from the context of a Spark Worker?

The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).

What happens to a Spark DataFrame used in Structured Streaming when its underlying data is updated at the source?

I have a use case where I am joining a streaming DataFrame with a static DataFrame. The static DataFrame is read from a parquet table (a directory containing parquet files).
This parquet data is updated by another process once a day.
My question is what would happen to my static DataFrame?
Would it update itself because of the lazy execution or is there some weird caching behavior that can prevent this?
Can the updation process make my code crash?
Would it be possible to force the DataFrame to update itself once a day in any way?
I don't have any code to share for this because I haven't written any yet, I am just exploring what the possibilities are. I am working with Spark 2.3.2
A big (set of) question(s).
I have not implemented all aspects myself (yet), but this is my understanding and one set of info from colleagues who performed an aspect that I found compelling and also logical. I note that there is not enough info out there on this topic.
So, if you have a JOIN (streaming --> static), then:
If standard coding practices as per Databricks applied and .cache is applied, the SparkStructuredStreamingProgram will read in static source only once, and no changes seen on subsequent processing cycles and no program failure.
If standard coding practices as per Databricks applied and caching NOT used, the SparkStructuredStreamingProgram will read in static source every loop, and all changes will be seen on subsequent processing cycles hencewith.
But, JOINing for LARGE static sources not a good idea. If large dataset evident, use Hbase, or some other other key value store, with mapPartitions if volitatile or non-volatile. This is more difficult though. It was done by an airline company I worked at and was no easy task the data engineer, designer told me. Indeed, it is not that easy.
So, we can say that updates to static source will not cause any crash.
"...Would it be possible to force the DataFrame to update itself once a day in any way..." I have not seen any approach like this in the docs or here on SO. You could make the static source a dataframe using var, and use a counter on the driver. As the micro batch physical plan is evaluated and genned every time, no issue with broadcast join aspects or optimization is my take. Whether this is the most elegant, is debatable - and is not my preference.
If your data is small enough, the alternative is to read using a JOIN and thus perform the look up, via the use of the primary key augmented with some max value in a
technical column that is added to the key to make the primary key a
compound primary key - and that the data is updated in the background with a new set of data, thus not overwritten. Easiest
in my view if you know the data is volatile and the data is small. Versioning means others may still read older data. That is why I state this, it may be a shared resource.
The final say for me is that I would NOT want to JOIN with the latest info if the static source is large - e.g. some Chinese
companies have 100M customers! In this case I would use a KV store as
LKP using mapPartitions as opposed to JOIN. See
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
that provides some insights. Also, this is old but still applicable
source of information:
https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/.
Both are good reads. But requires some experience and to see the
forest for the trees.

Using Scala Pickling serialization In APACHE SPARK over KryoSerializer and JavaSerializer

While searching best Serialization techniques for apache-spark I found below link
https://github.com/scala/pickling#scalapickling
which states Serialization in scala will be more faster and automatic with this framework.
And as Scala Pickling has following advantages. (Ref - https://github.com/scala/pickling#what-makes-it-different)
So, I wanted to know whether this Scala Pickling (PickleSerializer) can be used in apache-spark instead of KryoSerializer.
If yes what are the necessary changes is to be done. (Example would be helpful)
If No why not. (Please explain)
Thanks in advance. And forgive me if I am wrong.
Note : I am using scala language to code apache-spark (Version. 1.4.1) application.
I visited Databricks for a couple of months in 2014 to try and incorporate a PicklingSerializer into Spark somehow, but couldn't find a way to include type information needed by scala/pickling into Spark without changing interfaces in Spark. At the time, it was a no-go to change interfaces in Spark. E.g., RDDs would need to include Pickler[T] type information into its interface in order for the generation mechanism in scala/pickling to kick in.
All of that changed though with Spark 2.0.0. If you use Datasets or DataFrames, you get so-called Encoders. This is even more specialized than scala/pickling.
Use Datasets in Spark 2.x. It's much more performant on the serialization front than plain RDDs

replacing sql database using akka-persistence - why is it not happening?

There are ways to replace SQL databases in Haskell, Clojure:
http://www.datomic.com/ (Clojure)
https://github.com/dmbarbour/haskell-vcache
https://hackage.haskell.org/package/acid-state
However I cannot find a library for doing so in Scala , using akka-persistence.
I wonder why ?
I heard that https://www.querki.net/ is doing something similar (https://github.com/jducoeur/Querki), but it not a copyleft library (unlike acid-state for Haskell).
I am wonder if I am looking at this from the wrong angle, I am wondering why other languages have these solutions and Scala does not seem to have it, maybe there is a fundamental reason for that ? Am I missing something ?
The libraries you mention do quite different things:
akka-persistence Store the state of an actor. If you have an actor that uses internal state. This is quite specialized.
acid-state serializes Haskell data to disk.
Datomic is a system for overriding temporal data in a way that does not destroy the original data.
Object stores works well with dynamic languages like Clojure and Python, since they work dynamic data that can be serialize to disk.
I found it much nicer to work with MongoDB in Python than in Scala.
When the NoSQL movement started there were initial excitement, but after using these systems some people realized that you are giving up good properties that databases have.
Datomic is an interesting project with new ideas. There is a Scala clone of it. Not sure how stable it is:
https://github.com/dwhjames/datomisca

MapReduce implementation in Scala

I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
There is also Scoobi framework, that has awesome expressiveness.
http://hadoop.apache.org/ is language agnostic.
Personally, I've become a big fan of Spark
http://spark-project.org/
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.
You may be interested in scouchdb, a Scala interface to using CouchDB.
Another idea is to use GridGain. ScalaDudes have an example of using GridGain with Scala. And here is another example.
A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.
For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
There is also a new scala wrapper for cascading from Twitter, called Scalding.
After looking very briefly over the documentation for Scalding it seems
that while it makes the integration with cascading smoother it still does
not solve what I see as the main problem with cascading: type safety.
Every operation in cascading operates on cascading's tuples (basically a
list of field values with or without a separate schema), which means that
type errors, I.e. Joining a key as a String and key as a Long leads
to run-time failures.
to further jshen's point:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).
I've added MapReduce implementation using Hadoop on Github with few test cases here: https://github.com/sauravsahu02/MapReduceUsingScala.
Hope that helps. Note that the application is already tested.