Using Scala Pickling serialization In APACHE SPARK over KryoSerializer and JavaSerializer - scala

While searching best Serialization techniques for apache-spark I found below link
https://github.com/scala/pickling#scalapickling
which states Serialization in scala will be more faster and automatic with this framework.
And as Scala Pickling has following advantages. (Ref - https://github.com/scala/pickling#what-makes-it-different)
So, I wanted to know whether this Scala Pickling (PickleSerializer) can be used in apache-spark instead of KryoSerializer.
If yes what are the necessary changes is to be done. (Example would be helpful)
If No why not. (Please explain)
Thanks in advance. And forgive me if I am wrong.
Note : I am using scala language to code apache-spark (Version. 1.4.1) application.

I visited Databricks for a couple of months in 2014 to try and incorporate a PicklingSerializer into Spark somehow, but couldn't find a way to include type information needed by scala/pickling into Spark without changing interfaces in Spark. At the time, it was a no-go to change interfaces in Spark. E.g., RDDs would need to include Pickler[T] type information into its interface in order for the generation mechanism in scala/pickling to kick in.
All of that changed though with Spark 2.0.0. If you use Datasets or DataFrames, you get so-called Encoders. This is even more specialized than scala/pickling.
Use Datasets in Spark 2.x. It's much more performant on the serialization front than plain RDDs

Related

Can Cascading rewritten/replaced Apache Spark & Scala? Is it more optimal?

I have to replace map reduce code written in pig and java to Apache Spark & Scala as much as possible, and reuse or find an alternative where it is not possible.
I can find most of pig conversions to spark. Now, I have encountered with java cascading code of which I have minimal knowledge.
I have researched cascading and understood how plumbing works but i cannot come to a conclusion whether to replace it with spark. Here are my few basic doubts.
Can cascading java code completely be rewritten in Apache Spark?
If possible, Should Cascading code be replaced with Apache Spark? Is it more optimal and fast?(Considering RAM is not the issue for RDD's)
Scalding is a Scala library built on top of Cascading. Should this be used to convert java code to Scala code which will remove java source code dependency? Will this be more optimal?
Cascading works on mapreduce which reads I/O Stream whereas Spark reads from memory. Is this the only difference, or are there any limitations or special features which can be only performed by either one?
I am very new to Big-Data segment, and very immature with concepts/comparisons of all Big-Data related terminologies Hadoop, Spark, Map-Reduce, Hive, Flink etc. I got hold of these Big Data responsibility with with my new job profile and minimal senior knowledge/experience. Please, provide answer explanatory if possible. Thanks

safe downcasting in Apache Spark SQL

We have some well settled source and target data and spark sql in scala is used in between. There are cases when the schema of the target is more restrictive than the one in the source but business says that target schema is more accurate. At this point we cannot change the schemas. We could live with simple casting down but just in case the business people are wrong we would like to have some sanity check and have safe downcasts without truncating data silently.
We use dataframes and source and target are parque files and at the very end we convert to strongly typed datasets with Scala.
What would be the best approach to get this done in a generic way? Anything that it can be done at while reading with a schema that would error out and not truncate? Or having some sort of UDFs and validate as we load data?
It seems like a problem that other should have faced and I'm sort of novice to Spark. Just looking for some sort of established practice so don't have to re-invent the wheel.

Recommended way to access HBase using Scala

Now that SpyGlass is no longer being maintained, what is the recommended way to access HBase using Scala/Scalding? A similar question was asked in 2013, but most of the suggested links are either dead or to defunct projects. The only link that seems useful is to Apache Flink. Is that considered the best option nowadays? Are people still recommending SpyGlass for new projects even though it isn't been maintained? Performance (massively parallel) and testability are priorities.
According to my experiences in writing data Cassandra using Flink Cassandra connector, I think the best way is to use Flink built-in connectors. Since Flink 1.4.3 you can use HBase Flink connector. See here
I connect to HBase in Flink using java. Just create HBase Connection object in open and close it within close methods of RichFunction (i.e. RichSinkFunction). These methods are called once by each flink slot.
I think you can do something like this in Scala too.
Depends on what do you mean by "recommended", I guess.
DIY
Eel
If you just want to access data on HBase from a Scala application, you may want to have a look at Eel, which includes libraries to interact with many storage formats and systems in the Big Data landscape and is natively written in Scala.
You'll most likely be interested in using the eel-hbase module, which from a few releases includes an HBaseSource class (as well as an HBaseSink). It's actually so recent I just noticed the README still mentions that HBase is not supported. There are no explicit examples with Hive, but source and sinks work in similar ways.
Kite
Another alternative could be Kite, which also has a quite extensive set of examples you can draw inspiration from (including with HBase), but it looks less active of a project than Eel.
Big Data frameworks
If you want a framework that helps you instead of brewing your own solution with libraries. Of course you'll have to account for some learning curve.
Spark
Spark is a fairly mature project and the HBase project itself as built a connector for Spark 2.1.1 (Scaladocs here). Here is an introductory talk that can come to your help.
The general idea is that you could use this custom data source as suggested in this example:
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat, HBaseRelation.HBASE_CONFIGFILE -> conf))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
Giving you access to HBase data through the Spark SQL API. Here is a short extract from the same example:
val df1 = withCatalog(cat1, conf1)
val df2 = withCatalog(cat2, conf2)
val s1 = df1.filter($"col0" <= "row120" && $"col0" > "row090").select("col0", "col2")
val s2 = df2.filter($"col0" <= "row150" && $"col0" > "row100").select("col0", "col5")
val result = s1.join(s2, Seq("col0"))
Performance considerations aside, as you may see the language can feel pretty natural for data manipulation.
Flink
Two answers already dealt with Flink, so I won't add much more, except for a link to an example from the latest stable release at the time of writing (1.4.2) that you may be interested in having a look at.

what are the options for hadoop on scala

We are starting a big-data based analytic project and we are considering to adopt scala (typesafe stack). I would like to know the various scala API's/projects which are available to do hadoop , map reduce programs.
Definitely check out Scalding. Speaking as a user and occasional contributor, I've found it to be a very useful tool. The Scalding API is also meant to be very compatible with the standard Scala collections API. Just as you can call flatMap, map, or groupBy on normal collections, you can do the same on scalding Pipes, which you can imagine as a distributed List of tuples. There's also a typed version of the API which provides stronger type-safety guarantees. I haven't used Scoobi, but the API seems similar to what they have.
Additionally, there are a few other benefits:
Scalding is heavily used in production at Twitter and has been battle-tested on Twitter-scale datasets.
It has several active contributors both inside and outside Twitter that are committed to making it great.
It is interoperable with your existing Cascading jobs.
In addition to the Typed API, it has a a Fields API which may be more familiar to users of R and data-frame frameworks.
It provides a robust Matrix Library.
I've had success with Scoobi. It's straightforward to use, strongly typed, hides most of the Hadoop mess (by doing thing like automatically serializing your objects for you), and totally Scala. One of the things I like about its API is that the designers wanted the Scoobi collections to feel just like the standard Scala collections, so you actually use them much the same way, except that operations run on Hadoop instead of locally. This actually makes it pretty easy to switch between Scoobi collections and Scala collections while you're developing and testing.
I've also used Scrunch, which is built on top of the Java-based Crunch. I haven't used it in a while, but it's now part of Apache.
Twitter is investing a lot of effort into Scalding, including a nice Matrix library that could be used for various machine learning tasks. I need to give Scoobi a try, too.
For completeness, if you're not wedded to MapReduce, have a look at the Spark project. It performs far better in many scenarios, including in their port of Hive to Spark, appropriately called Shark. As a frequent Hive user, I'm excited about that one.
The first two I would likely investigate are Scalding (which builds on top of Cascading) and Scoobi. I have not used either, though, but Scalding, in particular, looks like it provides a really nice API.
Another option is Stratosphere, It offers a Scala API that converts the Scala types to Stratosphere's internal data types.
The API is quite similar to Scalding but Stratosphere natively supports advanced data flows (so you don't have to chain MapReduce Jobs). You will have much better performance with Stratosphere than with Scalding.
Stratosphere does not run on Hadoop MapReduce but on Hadoop YARN, so you can use your existing YARN cluster.
This is the word count example in Stratosphere (with the Scala API):
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))

MapReduce implementation in Scala

I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
There is also Scoobi framework, that has awesome expressiveness.
http://hadoop.apache.org/ is language agnostic.
Personally, I've become a big fan of Spark
http://spark-project.org/
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.
You may be interested in scouchdb, a Scala interface to using CouchDB.
Another idea is to use GridGain. ScalaDudes have an example of using GridGain with Scala. And here is another example.
A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.
For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
There is also a new scala wrapper for cascading from Twitter, called Scalding.
After looking very briefly over the documentation for Scalding it seems
that while it makes the integration with cascading smoother it still does
not solve what I see as the main problem with cascading: type safety.
Every operation in cascading operates on cascading's tuples (basically a
list of field values with or without a separate schema), which means that
type errors, I.e. Joining a key as a String and key as a Long leads
to run-time failures.
to further jshen's point:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).
I've added MapReduce implementation using Hadoop on Github with few test cases here: https://github.com/sauravsahu02/MapReduceUsingScala.
Hope that helps. Note that the application is already tested.