Best practice in Spark to filter dataframe, execute different actions on resulted dataframes and then union the new dataframes back - scala

Since I am new to Spark I would like to ask a question about a pattern that I am using in Spark but don't know if it's a bad practice ( splitting a dataframe in two based on a filter, execute different actions on them and then joining them back ).
To give an example, having dataframe df:
val dfFalse = df.filter(col === false).distinct()
val dfTrue = df.filter(col === true).join(otherDf, Seq(id), "left_anti").distinct()
val newDf = dfFalse union dfTrue
Since my original dataframe has milions of rows I am curious if this filtering twice is a bad practice and I should use some other pattern in Spark which I may not be aware of. In other cases I even need to do 3,4 filters and then apply different actions to individual data frames and then union them all back.
Kind regards,

There are several key points to take into account when you start to use Spark to process big amounts of data in order to analyze our performance:
Spark parallelism depends of the number of partitions that you have in your distributed memory representations(RDD or Dataframes). That means that the process(Spark actions) will be executed in parallel across the cluster. But note that there are two main different kind of transformations: Narrow transformations and wide transformations. The former represent operations that will be executed without shuffle, so the data donĀ“t need to be reallocated in different partitions thus avoiding data transfer among workers. Consider that if you what to perform a distinct by a specific key Spark must reorganize the data in order to detect the duplicates. Take a look to the doc.
Regarding doing more or less filter transformations:
Spark is based on a lazy evaluation model, it means that all the transformations that you executes on a dataframe are not going to be executed unless you call an action, for example a write operation. And the Spark optimizer evaluates your transformations in order to create an optimized execution plan. So, if you have five or six filter operations it will never traverse the dataframe six times(in contrast to other dataframe frameworks). The optimizer will take your filtering operations and will create one. Here some details.
So have in mind that Spark is a distributed in memory data processor and it is a must to know these details because you can spawn hundreds of cores over hundred of Gbs.

The efficiency of this approach highly depends on the ability to reduce the amount of the overlapped data files that are scanned by both the splits.
I will focus on two techniques that allow data-skipping:
Partitions - if the predicates are based on a partitioned column, only the necessary data will be scanned, based on the condition. In your case, if you split the original dataframe into 2 based on a partitioned column filtering, each dataframe will scan only the corresponding portion of the data. In this case, your approach will be perform really well as no data will be scanned twice.
Filter/predicate pushdown - data stored in a format supporting filter pushdown (Parquet for example) allows reading only the files that contains records with values matching the condition. In case that the values of the filtered column are distributed across many files, the filter pushdown will be inefficient since the data is skipped on a file basis and if a certain file contains values for both the splits, it will be scanned twice. Writing the data sorted by the filtered column might improve the efficiency of the filter pushdown (on read) by gathering the same values into a fewer amount of files.
As long as you manage to split your dataframe, using the above techniques, and minimize the amount of the overlap between the splits, this approach will be more efficient.

Related

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)
This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.
If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers

Spark: Is it possible to load an RDD from multiple files in different formats?

I have an heterogeneously-formatted input of files, batch mode.
I want to run a batch over a number of files. These files are of different formats, and they will have different mappings to normalize data (e.g. extract fields with different schema names or positions in the records, to a standard naming).
Given the tabular nature of the data, I'm considering using Dataframes (cannot use datasets due to the Spark version I'm bound to).
In order to apply different extraction logic to each file - do they need to be loaded each file in a separate dataframe, then apply extraction logic (extraction of some files, a process which is different per each file type, configured in terms of e.g. CSV/JSON/XML, position of fields to select (CSV), name of field to select (JSON), etc.), and then join datasets?
That would force me to iterate files, and act on each dataframe separately, and join dataframes afterwards; instead of applying the same (configurable) logic.
I know I could make it with RDD, i.e. loading all files into the RDD, emitting PairRDD[fileId, record], and then run a map where you would look the fileId to get the configuration to apply to that record, which tells you which logic to apply.
I'd rather use Dataframes, for all of the niceties it offers over raw RDDS, in terms of performance, simplicity and parsing.
Is there a better way to use Dataframes to address this problem than the one already explained? Any suggestions or misconceptions I may have?
I'm using Scala, though it should not matter to this problem.

Spark: understanding the DAG and forcing transformations

Hello stackoverflow community.
I ask your help in understanding if my thoughts are correct or I'm missing some points in my Spark job.
I currently have two rdds that I want to subtract.
Both the rdds are built as different transformations on the same father RDD.
First of all, the father RDD is cached after it is obtained:
val fatherRdd = grandFather.repartition(n).mapPartitions(mapping).cache
Then the two rdds are transformed.
One is (pseudocode):
son1= rddFather.filter(filtering_logic).map(take_only_key).distinct
The other one is :
son2= rddFather.filter(filtering_logic2).map(take_only_key).distinct
The two sons are then subtracted to obtain only the keys in son1:
son1.subtract(son2)
I would expect the squence of the transformations to be the following:
mapPartitions
repartition
caching
Then, starting from cached data, map filter map distinct on both rdds and then subtracting.
This is not happening, what I see are the two distinct operations running in parallel, apparently not exploiting the benefits of caching (there are no skipped tasks), and taking almost the same computation time.
Below the image of the dag taken from spark ui.
Do you have any suggestions for me?
You are correct in your observations. Transformations on RDDs are lazy, so caching will happen after the first time the RDD is actually computed.
If you call an action on your parent RDD, it should be computed and cached. Then your subsequent operations will operate on the cached data.

Apache Spark: Why are many small dataframes so much slower than few big ones?

So I am having the following scenario:
I have a service where I do some calculation on parameters in a dataframe. For example I am doing the describe() operation. I got the parameters via and http-post (Array[String]+schema) and read them in via read.json function on the sql context.
I can either get it in one one big dataframe with 10.000 parameters or in 10.000 small dataframes with just one parameter. Each having around 12.000 rows with timestamps.
In the end I need to collect the dataframe(s) to send it to a different service for further calculations. It would be easier to to it parameter wise, because of the way the input is created.
But I figured out, that doing the collecting/converting to json on the many small dataframes is way more expensive than on the one huge dataframe.
For the big dataframe taking about 6 seconds and all small ones at least 20 seconds. For one this does not seem to be so important but I want to do it on at least 3000 of these 10.000 parameter inputs.
Why is that so? It does not seem to be the difference in calculation, but the difference in collecting it once vs many times.
When you call collect(), Spark must submit job and send data to one node.
Let's consider 10 DataFrames of n elements and one with 10n elements.
One big collect() -> 10n data is sent, one execution plan and one job created
10 DataFrames -> 10*collect() -> 10*n data is sent, 10 execution plans needs to be generated and 10 jobs submited.
Of course it depends also on hardware and network, i.e. if you can have small DataFrame on one node then it may be faster than sending over network.

Custom scalding tap (or Spark equivalent)

I am trying to dump some data that I have on a Hadoop cluster, usually in HBase, with a custom file format.
What I would like to do is more or less the following:
start from a distributed list of records, such as a Scalding pipe or similar
group items by some computed function
make so that items belonging to the same group reside on the same server
on each group, apply a transformation - that involves sorting - and write the result on disk. In fact I need to write a bunch of MapFile - which are essentially sorted SequenceFile, plus an index.
I would like to implement the above with Scalding, but I am not sure how to do the last step.
While of course one cannot write sorted data in a distributed fashion, it should still be doable to split data into chunks and then write each chunk sorted locally. Still, I cannot find any implementation of MapFile output for map-reduce jobs.
I recognize it is a bad idea to sort very large data, and this is the reason even on a single server I plan to split data into chunks.
Is there any way to do something like that with Scalding? Possibly I would be ok with using Cascading directly, or really an other pipeline framework, such as Spark.
Using Scalding (and the underlying Map/Reduce) you will need to use the TotalOrderPartitioner, which does pre-sampling to create appropriate buckets/splits of the input data.
Using Spark will speed up due to the faster access paths to the disk data. However it will still require shuffles to disk/hdfs so it will not be like orders of magnitude better.
In Spark you would use a RangePartitioner, which takes the number of partitions and an RDD:
val allData = sc.hadoopRdd(paths)
val partitionedRdd = sc.partitionBy(new RangePartitioner(numPartitions, allData)
val groupedRdd = partitionedRdd.groupByKey(..).
// apply further transforms..