Apache Spark: Why are many small dataframes so much slower than few big ones? - scala

So I am having the following scenario:
I have a service where I do some calculation on parameters in a dataframe. For example I am doing the describe() operation. I got the parameters via and http-post (Array[String]+schema) and read them in via read.json function on the sql context.
I can either get it in one one big dataframe with 10.000 parameters or in 10.000 small dataframes with just one parameter. Each having around 12.000 rows with timestamps.
In the end I need to collect the dataframe(s) to send it to a different service for further calculations. It would be easier to to it parameter wise, because of the way the input is created.
But I figured out, that doing the collecting/converting to json on the many small dataframes is way more expensive than on the one huge dataframe.
For the big dataframe taking about 6 seconds and all small ones at least 20 seconds. For one this does not seem to be so important but I want to do it on at least 3000 of these 10.000 parameter inputs.
Why is that so? It does not seem to be the difference in calculation, but the difference in collecting it once vs many times.

When you call collect(), Spark must submit job and send data to one node.
Let's consider 10 DataFrames of n elements and one with 10n elements.
One big collect() -> 10n data is sent, one execution plan and one job created
10 DataFrames -> 10*collect() -> 10*n data is sent, 10 execution plans needs to be generated and 10 jobs submited.
Of course it depends also on hardware and network, i.e. if you can have small DataFrame on one node then it may be faster than sending over network.

Related

Best practice in Spark to filter dataframe, execute different actions on resulted dataframes and then union the new dataframes back

Since I am new to Spark I would like to ask a question about a pattern that I am using in Spark but don't know if it's a bad practice ( splitting a dataframe in two based on a filter, execute different actions on them and then joining them back ).
To give an example, having dataframe df:
val dfFalse = df.filter(col === false).distinct()
val dfTrue = df.filter(col === true).join(otherDf, Seq(id), "left_anti").distinct()
val newDf = dfFalse union dfTrue
Since my original dataframe has milions of rows I am curious if this filtering twice is a bad practice and I should use some other pattern in Spark which I may not be aware of. In other cases I even need to do 3,4 filters and then apply different actions to individual data frames and then union them all back.
Kind regards,
There are several key points to take into account when you start to use Spark to process big amounts of data in order to analyze our performance:
Spark parallelism depends of the number of partitions that you have in your distributed memory representations(RDD or Dataframes). That means that the process(Spark actions) will be executed in parallel across the cluster. But note that there are two main different kind of transformations: Narrow transformations and wide transformations. The former represent operations that will be executed without shuffle, so the data donĀ“t need to be reallocated in different partitions thus avoiding data transfer among workers. Consider that if you what to perform a distinct by a specific key Spark must reorganize the data in order to detect the duplicates. Take a look to the doc.
Regarding doing more or less filter transformations:
Spark is based on a lazy evaluation model, it means that all the transformations that you executes on a dataframe are not going to be executed unless you call an action, for example a write operation. And the Spark optimizer evaluates your transformations in order to create an optimized execution plan. So, if you have five or six filter operations it will never traverse the dataframe six times(in contrast to other dataframe frameworks). The optimizer will take your filtering operations and will create one. Here some details.
So have in mind that Spark is a distributed in memory data processor and it is a must to know these details because you can spawn hundreds of cores over hundred of Gbs.
The efficiency of this approach highly depends on the ability to reduce the amount of the overlapped data files that are scanned by both the splits.
I will focus on two techniques that allow data-skipping:
Partitions - if the predicates are based on a partitioned column, only the necessary data will be scanned, based on the condition. In your case, if you split the original dataframe into 2 based on a partitioned column filtering, each dataframe will scan only the corresponding portion of the data. In this case, your approach will be perform really well as no data will be scanned twice.
Filter/predicate pushdown - data stored in a format supporting filter pushdown (Parquet for example) allows reading only the files that contains records with values matching the condition. In case that the values of the filtered column are distributed across many files, the filter pushdown will be inefficient since the data is skipped on a file basis and if a certain file contains values for both the splits, it will be scanned twice. Writing the data sorted by the filtered column might improve the efficiency of the filter pushdown (on read) by gathering the same values into a fewer amount of files.
As long as you manage to split your dataframe, using the above techniques, and minimize the amount of the overlap between the splits, this approach will be more efficient.

Spark Scala - processing different child dataframes parallely in bulk

I am working on a fraudulent transaction detection project which makes use of spark and primarily uses rule-based approach to risk score incoming transactions. For this rule based approach, several maps are created from the historical data to represent the various patterns in transactions and these are then used later while scoring the transaction. Due to rapid increase in data size, we are now modifying code to generate these maps at each account level.
earlier code was for eg.
createProfile(clientdata)
but now it becomes
accountList.map(account=>createProfile(clientData.filter(s"""account=${account}""")))
Using this approach , the profiles are generated but since this operations are happening sequentially , hence it doesn't seem to be feasible.
Also, createProfile function itself is making use of dataframes, sparkContext/SparkSessions hence, this is leading to the issueof not able to send these tasks to worker nodes as based on my understanding only driver can access the dataframes and sparkSession/sparkContext. Hence , the following code is not working
import sparkSession.implicit._
val accountListRdd=accountList.toSeq.toDF("accountNumber")
accountList.rdd.map(accountrow=>createProfile(clientData.filter(s"""account=${accountrow.get(0).toString}""")))
The above code is not working but represents the logic for the desired output behaviour.
Another approach, i am looking at is using multithreading at driver level using scala Future .But even in this scenario , there are many jvm objects being created in a single createProfile function call , so by increasing threads , even if this approach works , it can lead to a lot jvm objects, which itself canlead to garbage collection and memory overhead issues.
just to put timing perspective, createProfile takes about 10 min on average for a single account and we have 3000 accounts , so sequentially it will take many days. With multi threading even if we achieve a factor of 10 , it will take many days. So we need parallelism in the order of 100s .
One of things that could have worked in case it existed was ..lets say if there is Something like a spark groupBy within a groupBY kind of operation, where at first level we can group By "account" and then do other operations
(currently issue is UDF won't be able to handle the kind of operations we want to perform)
Another solution if practically possible is the way SPark Streaming works-
it has a forEachRDD method and also spark.streaming.concurrentjobs parameter which allows processing of multiple RDDs in parallel . I am not sure how it works but maybe that kind of implementation may help.
Above is the problem description and my current views on it.
Please let me know if anyone has any idea about this! Also ,I will prefer a logical change rather than suggestion of different technology

pyspark dataframe writing results

I am working on a project where i need to read 12 files average file size is 3 gb. I read them with RDD and create dataframe with spark.createDataFrame. Now I need to process 30 Sql queries on the dataframe most them need output of previous one like depend on each other so i save all my intermediate state in dataframe and create temp view for that dataframe.
The program takes only 2 minutes for execute part but the problem is while writing them to csv file or show the results or calling count() function takes too much time. I have tries re-partition thing but still it is taking to much time.
1.What could be the solution?
2.Why it is taking too much time to write even all processing taking small amount of time?
I solved above problem with persist and cache in pyspark.
Spark is a lazy programming language. Two types of Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. When the action is triggered after the result, new RDD is not formed like transformation.
Every time i do some operation it was just transforming, so if i call that particular dataframe it will it parent query every time since spark is lazy,so adding persist stopped calling parent query multiple time. It saved lots of processing time.

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)
This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.
If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.