I have used SciSpark for reading numerous netcdf files in Spark 2.x. However, I am unable to compile it for Spark 3.x (due to various types of errors such as cannot be applied to (Array[Int]), found : Array[Long] with required: Array[Int], cannot be applied to (AnyVal), etc.). Unable to even conclude if it is compatible with Spark 3. Could anyone please suggest if:
SciSpark is compatible with Spark 3, and
Any alternative to SciSpark in Spark 3.x?
Looks like https://github.com/SciSpark/SciSpark isn't updated to use Spark 3 because last commit was 4 year ago.
If you already have a Spark cluster you can use Apache Sedona to read netcdf files. Documentation about netcdf files is sparse, but you can ask questions in their mailing list.
Related
I have to replace map reduce code written in pig and java to Apache Spark & Scala as much as possible, and reuse or find an alternative where it is not possible.
I can find most of pig conversions to spark. Now, I have encountered with java cascading code of which I have minimal knowledge.
I have researched cascading and understood how plumbing works but i cannot come to a conclusion whether to replace it with spark. Here are my few basic doubts.
Can cascading java code completely be rewritten in Apache Spark?
If possible, Should Cascading code be replaced with Apache Spark? Is it more optimal and fast?(Considering RAM is not the issue for RDD's)
Scalding is a Scala library built on top of Cascading. Should this be used to convert java code to Scala code which will remove java source code dependency? Will this be more optimal?
Cascading works on mapreduce which reads I/O Stream whereas Spark reads from memory. Is this the only difference, or are there any limitations or special features which can be only performed by either one?
I am very new to Big-Data segment, and very immature with concepts/comparisons of all Big-Data related terminologies Hadoop, Spark, Map-Reduce, Hive, Flink etc. I got hold of these Big Data responsibility with with my new job profile and minimal senior knowledge/experience. Please, provide answer explanatory if possible. Thanks
This question already has answers here:
Difference between DataSet API and DataFrame API [duplicate]
Spark 2.0 Dataset vs DataFrame
(3 answers)
Closed 4 years ago.
DataSet gives the best performance than dataframe. DataSet provide Encoders and type-safe but dataframe still in usage is there any particular scenario only dataframe is used in that scenario or is there any function which is working on dataframe and not working in dataset.
Dataframe is actually a Dataset[Row].
It also has many tools and functions associated with it which enables working with the Row as opposed to a generic Dataset[SomeClass]
This gives DataFrame the immediate advantage of being able to use these tools and functions without having to write them yourself.
DataFrame actually enjoys better performance than Dataset. The reason for this is that Spark can understand the internals of the built-in functions associated with DataFrame and this enables the Catalyst optimization (rearrange and change the execution tree) as well as performing wholestage codegen to avoid a lot of the virtualization.
Furthermore, when writing Dataset functions, the relevant object type (e.g. case class) need to be constructed (which includes copying). This can be a overhead depending on the usage.
Another advantage of Dataframe is that it's schema is set at run time rather than at compile time. This means that if you read for example from a parquet file, the schema would be set by the content of the file. This enables to handle dynamic cases (e.g. to perform ETL)
There are probably more reasons and advantages but I think those are the important ones.
Now that SpyGlass is no longer being maintained, what is the recommended way to access HBase using Scala/Scalding? A similar question was asked in 2013, but most of the suggested links are either dead or to defunct projects. The only link that seems useful is to Apache Flink. Is that considered the best option nowadays? Are people still recommending SpyGlass for new projects even though it isn't been maintained? Performance (massively parallel) and testability are priorities.
According to my experiences in writing data Cassandra using Flink Cassandra connector, I think the best way is to use Flink built-in connectors. Since Flink 1.4.3 you can use HBase Flink connector. See here
I connect to HBase in Flink using java. Just create HBase Connection object in open and close it within close methods of RichFunction (i.e. RichSinkFunction). These methods are called once by each flink slot.
I think you can do something like this in Scala too.
Depends on what do you mean by "recommended", I guess.
DIY
Eel
If you just want to access data on HBase from a Scala application, you may want to have a look at Eel, which includes libraries to interact with many storage formats and systems in the Big Data landscape and is natively written in Scala.
You'll most likely be interested in using the eel-hbase module, which from a few releases includes an HBaseSource class (as well as an HBaseSink). It's actually so recent I just noticed the README still mentions that HBase is not supported. There are no explicit examples with Hive, but source and sinks work in similar ways.
Kite
Another alternative could be Kite, which also has a quite extensive set of examples you can draw inspiration from (including with HBase), but it looks less active of a project than Eel.
Big Data frameworks
If you want a framework that helps you instead of brewing your own solution with libraries. Of course you'll have to account for some learning curve.
Spark
Spark is a fairly mature project and the HBase project itself as built a connector for Spark 2.1.1 (Scaladocs here). Here is an introductory talk that can come to your help.
The general idea is that you could use this custom data source as suggested in this example:
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat, HBaseRelation.HBASE_CONFIGFILE -> conf))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
Giving you access to HBase data through the Spark SQL API. Here is a short extract from the same example:
val df1 = withCatalog(cat1, conf1)
val df2 = withCatalog(cat2, conf2)
val s1 = df1.filter($"col0" <= "row120" && $"col0" > "row090").select("col0", "col2")
val s2 = df2.filter($"col0" <= "row150" && $"col0" > "row100").select("col0", "col5")
val result = s1.join(s2, Seq("col0"))
Performance considerations aside, as you may see the language can feel pretty natural for data manipulation.
Flink
Two answers already dealt with Flink, so I won't add much more, except for a link to an example from the latest stable release at the time of writing (1.4.2) that you may be interested in having a look at.
While searching best Serialization techniques for apache-spark I found below link
https://github.com/scala/pickling#scalapickling
which states Serialization in scala will be more faster and automatic with this framework.
And as Scala Pickling has following advantages. (Ref - https://github.com/scala/pickling#what-makes-it-different)
So, I wanted to know whether this Scala Pickling (PickleSerializer) can be used in apache-spark instead of KryoSerializer.
If yes what are the necessary changes is to be done. (Example would be helpful)
If No why not. (Please explain)
Thanks in advance. And forgive me if I am wrong.
Note : I am using scala language to code apache-spark (Version. 1.4.1) application.
I visited Databricks for a couple of months in 2014 to try and incorporate a PicklingSerializer into Spark somehow, but couldn't find a way to include type information needed by scala/pickling into Spark without changing interfaces in Spark. At the time, it was a no-go to change interfaces in Spark. E.g., RDDs would need to include Pickler[T] type information into its interface in order for the generation mechanism in scala/pickling to kick in.
All of that changed though with Spark 2.0.0. If you use Datasets or DataFrames, you get so-called Encoders. This is even more specialized than scala/pickling.
Use Datasets in Spark 2.x. It's much more performant on the serialization front than plain RDDs
I'm using pyspark 1.6.0.
I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.
In PYSPARK I read the binary file using:
sc.binaryFiles("s3n://.......")
This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .
I tried:
binaryRecordsStream(directory, recordLength)
but I couldn't get this working...
Can anyone share some lights how PYSPARK streaming read binary data file?
In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API
I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>) API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>). The solution, after reading what binaryFiles() does under the covers was to do this:
JavaPairInputDStream<String, PortableDataStream> rawAuctions =
sc.fileStream("s3n://<bucket>/<folder>",
String.class, PortableDataStream.class, StreamInputFormat.class);
Then parse the individual byte messages from the PortableDataStream objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.