Spark scala coding standards - scala

I am reaching out to the community to understand the impact of coding in certain ways in scala for Spark. I received some review comments that I feel need discussion. Coming from a traditional Java and OOP background, I am writing my opinion and questions here. I would appreciate if you could chime in with your wisdom. I am in a Spark 1.3.0 environment.
1. Use of for loops: Is it against the rules to use for loops?
There are distributed data structures like RDDs and DataFrames in Spark. We should not be collecting and using for loops on them, as the computation will end up happening on the driver node alone. This will have adverse affects especially if the data is large.
But if I have a utility map that stores parameters for the job, it is fine to use a for loop on it if desired. Using a for loop or a map on the iteratable is a coding choice. It is important to understand that this map here is different from map on a distributed data structure. This map will still happen on the driver node alone.
2. Use of var vs val
val is an immutable reference to an object and var is a mutable reference. In the example below
val driverDf = {
var df = dataLoader.loadDriverInput()
df = df.sqlContext.createDataFrame(df.rdd, df.schema)
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
}
Even though we have used var for the df, driverDf is an immutable reference to the originally created data frame. This kind of use for var is perfectly fine.
Similarly the following is also fine.
var driverDf = dataLoader.loadDriverInput();
driverDf = applyTransformations (driverDf)
def applyTransformations (driverDf:DataFrame)={...}
Are there any generic rules that say vars cannot be used in Spark environment?
3. Use of if-else vs case, not throw exceptions
Is it against standard practices to not throw exceptions or not to use if-else?
4. Use of hive context vs sql context
Are there any performance implications of using SQLContext vs HiveContext (I know HiveContext extends SQLContext) for underneath Hive tables?
Is it against standards to create multiple HiveContexts in the program. My job is iterates through a part of a whole data frame of values every time. Whole data frame is cached in a one hive context. Each iteration data frame is created from the whole data using a new hive context and cached. This cache is purged at the end of iteration. This approach gave me performance improvements in Spark 1.3.0. Is this approach breaking any standards?
I appreciate the responses.

Regarding loops, as you mentioned correctly, you should prefer RDD map to perform operations in parallel and on multiple nodes. For smaller iterables, you can go with for loop. Again it comes down to the driver memory and time it takes to iterate.
For smaller sets of around 100, the distributed way of handling will incur unnecessary network usage rather than giving performance boost
val or var is a choice at scala level rather than spark. I never heard of it. Its dependent on your requirement.
No sure what you you asked. The only major negative for using if-else is making them cumbersome and while handling inner-if-else. Apart from that, all should be fine. An exception can be thrown based on a condition. I see that's the one of many ways to handle issues in an otherwise Happy path flow.
As mentioned here, the compiler generates more byte code for match..case rather than simple if. So its simple condition check vs code readability - complex condition check
HiveContext gives the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. Please not in spark 2.0, both HIveContext and SQLContext are replaced with SparkSession.

Related

Is it possible to generate DataFrame rows from the context of a Spark Worker?

The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).

scala rapids using an opaque UDF for a single column dataframe that produces another column

I am trying to acquaint myself with RAPIDS Accelerator-based computation using Spark (3.3) with Scala. The primary contention in being able to use GPU appears to arise from the blackbox nature of UDFs. An automatic solution would be the Scala UDF compiler. But it won't work with cases where there are loops.
Doubt: Would I be able to get GPU contribution if my dataframe has only one column and produces another column, as this is a trivial case. If so, at least in some cases, even with no change in Spark code, the GPU performance benefit can be attained, even in case where the size of data is much higher than GPU memory. This would be great as sometimes it would be easy to simply merge all columns into one making a single column of WrappedArray using concat_ws that a UDF can simply convert into an Array. For all practical purposes to the GPU then the data is already in columnar fashion and only negligible overhead for row (on CPU) to column (on GPU) needs to be done.The case I am referring to would look like:
val newDf = df.withColumn(colB, opaqueUdf(col("colA")))
Resources: I tried to find good sources/examples to learn Spark-based approach for using RAPIDS, but it seems to me that only Python-based examples are given. Is there any resource/tutorial that gives some sample examples in coversion of Spark UDFs to make them RAPIDS compatible.
Yes #Quiescent, you are right. The Scala UDF -> Catalyst compiler can be used for simple UDFs that have a direct translation to Catalyst. Supported operations can be found here: https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html. Loops are definitely not supported in this automatic translation, because there isn't a direct expression that we can translate it to.
It all depends on how heavy opaqueUdf is, and how many rows are in your column. The GPU is going to be really good if there are many rows and the operation in the UDF is costly (say it's doing many arithmetic or string operations successively on that column). I am not sure why you want to "merge all columns into one", so can you clarify why you want to do that? On the conversion to Array, is that the purpose of the UDF, or are you wanting to take in N columns -> perform some operation likely involving loops -> produce an Array?
Another approach to accelerating UDFs with GPUs is to use our RAPIDS Accelerated UDFs. These are java or scala UDFs that you implement purposely, and they use the cuDF API directly. The Accelerated UDF document also links to our spark-rapids-examples repo, which has information on how to write Java or Scala UDFs in this way, please take a look there as well.

What happens to a Spark DataFrame used in Structured Streaming when its underlying data is updated at the source?

I have a use case where I am joining a streaming DataFrame with a static DataFrame. The static DataFrame is read from a parquet table (a directory containing parquet files).
This parquet data is updated by another process once a day.
My question is what would happen to my static DataFrame?
Would it update itself because of the lazy execution or is there some weird caching behavior that can prevent this?
Can the updation process make my code crash?
Would it be possible to force the DataFrame to update itself once a day in any way?
I don't have any code to share for this because I haven't written any yet, I am just exploring what the possibilities are. I am working with Spark 2.3.2
A big (set of) question(s).
I have not implemented all aspects myself (yet), but this is my understanding and one set of info from colleagues who performed an aspect that I found compelling and also logical. I note that there is not enough info out there on this topic.
So, if you have a JOIN (streaming --> static), then:
If standard coding practices as per Databricks applied and .cache is applied, the SparkStructuredStreamingProgram will read in static source only once, and no changes seen on subsequent processing cycles and no program failure.
If standard coding practices as per Databricks applied and caching NOT used, the SparkStructuredStreamingProgram will read in static source every loop, and all changes will be seen on subsequent processing cycles hencewith.
But, JOINing for LARGE static sources not a good idea. If large dataset evident, use Hbase, or some other other key value store, with mapPartitions if volitatile or non-volatile. This is more difficult though. It was done by an airline company I worked at and was no easy task the data engineer, designer told me. Indeed, it is not that easy.
So, we can say that updates to static source will not cause any crash.
"...Would it be possible to force the DataFrame to update itself once a day in any way..." I have not seen any approach like this in the docs or here on SO. You could make the static source a dataframe using var, and use a counter on the driver. As the micro batch physical plan is evaluated and genned every time, no issue with broadcast join aspects or optimization is my take. Whether this is the most elegant, is debatable - and is not my preference.
If your data is small enough, the alternative is to read using a JOIN and thus perform the look up, via the use of the primary key augmented with some max value in a
technical column that is added to the key to make the primary key a
compound primary key - and that the data is updated in the background with a new set of data, thus not overwritten. Easiest
in my view if you know the data is volatile and the data is small. Versioning means others may still read older data. That is why I state this, it may be a shared resource.
The final say for me is that I would NOT want to JOIN with the latest info if the static source is large - e.g. some Chinese
companies have 100M customers! In this case I would use a KV store as
LKP using mapPartitions as opposed to JOIN. See
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
that provides some insights. Also, this is old but still applicable
source of information:
https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/.
Both are good reads. But requires some experience and to see the
forest for the trees.

Map on dataframe takes too long [duplicate]

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.
Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.
Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

Convert a JavaPairRDD into list without collect() [duplicate]

We know that if we need to convert RDD to a list, then we should use collect(). but this function puts a lot of stress on the driver (as it brings all the data from different executors to the driver) which causes performance degradation or worse (whole application may fail).
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
Basically in current scenario where we deal with huge amount of data in batch or stream data processing, APIs like collect() and collectAsMap() has become completely useless in a real project with real amount of data. We can use it in demo code, but that's all there to use for these APIs. So why to have an API which we can not even use (Or am I missing something).
Can there be a better way to achieve the same result through some other method or can we implement collect() and collectAsMap() in a more effective way other that just calling
List<String> myList= RDD.collect.toList (which effects performance)
I looked up to google but could not find anything which can be effective. Please help if someone has got a better approach.
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
No, and there can't be. And if there were such a way, collect would be implemented using it in the first place.
Well, technically you could implement List interface on top of RDD (or most of it?), but that would be a bad idea and quite pointless.
So why to have an API which we can not even use (Or am I missing something).
collect is intended to be used for cases where only large RDDs are inputs or intermediate results, and the output is small enough. If that's not your case, use foreach or other actions instead.
As you want to collect the Data in a Java Collection, the data has to collect on single JVM as the java collections won't be distributed. There is no way to get all data in collection by not getting data. The interpretation of problem space is wrong.
collect and similar are not meant to be used in normal spark code. They are useful for things like debugging, testing, and in some cases when working with small datasets.
You need to keep your data inside of the rdd, and use rdd transformations and actions without ever taking the data out. Methods like collect which pull you data out of spark and onto your driver defeat the purpose and undo any advantage that spark might be providing since now you're processing all of your data on a single machine anyway.