How to split a large data frame and use the smaller parts to do multiple broadcast joins in Spark? - scala

Let's say we have two very large data frames - A and B. Now, I understand if I use same hash partitioner for both RDDs and then do the join, the keys will be co-located and the join might be faster with reduced shuffling (the only shuffling that will happen will be when the partitioner changes on A and B).
I wanted to try something different though - I want to try broadcast join like so -> let's say the B is smaller than A so we pick B to broadcast but B is still a very big dataframe. So, what we want to do is to make multiple data frames out of B and then send each as broadcast to be joined on A.
Has anyone tried this?
To split one data frame into many I am only seeing randomSplit method but that doesn't look so great an option.
Any other better way to accomplish this task?
Thanks!

Has anyone tried this?
Yes, someone already tried that. In particular GoDataDriven. You can find details below:
presentation - https://databricks.com/session/working-skewed-data-iterative-broadcast
code - https://github.com/godatadriven/iterative-broadcast-join
They claim pretty good results for skewed data, however there are three problems you have to consider doing this yourself:
There is no split in Spark. You have to filter data multiple times or eagerly cache complete partitions (How do I split an RDD into two or more RDDs?) to imitate "splitting".
Huge advantage of broadcast is reduction in the amount of transferred data. If data is large, then amount of data to be transferred can actually significantly increase: (Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark)
Each "join" increases complexity of the execution plan and with long series of transformations things can get really slow on the driver side.
randomSplit method but that doesn't look so great an option.
It is actually not a bad one.
Any other better way to accomplish this task?
You may try to filter by partition id.

Related

What is the correct way to organize the ptransforms in a beam pipeline?

I'm developing one pipeline that reads data from Kafka.
The source kafka topic is quite big in terms of traffic, there are 10k messages inserted per second and each of the message is around 200kB
I need to filter the data in order to apply the transformations that I need but something I'm sure is if there is an order in which I need to apply the filter and window functions.
read->window->filter->transform->write
would be more efficient than
read->filter->window->transform->write
or it would be the same performance both options?
I know that samza is just a model that only tells the what and not the how and the runner optimizes the pipeline but I just want to be sure I got it correct
Thanks
If there is substantial filtering, windowing after the filter will technically reduce the amount of work performed, though that saved work is cheap enough that I doubt it'd make a measurable difference. (Presumably the runner could notice that the filter does not observe the assigned window and lift it in that case, but as mentioned it's unclear if there are really savings to be gained here...)

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

Efficient way to check if there are NA's in pyspark

I have a pyspark dataframe, named df. I want to know if his columns contains NA's, I don't care if it is just one row or all of them. The problem is, my current way to know if there are NA's, is this one:
from pyspark.sql import functions as F
if (df.where(F.isnull('column_name')).count() >= 1):
print("There are nulls")
else:
print("Yey! No nulls")
The issue I see here, is that I need to compute the number of nulls in the whole column, and that is a huge amount of time wasted, because I want the process to stop when it finds the first null.
I thought about this solution but I am not sure it works (because I work in a cluster with a lot of other people so the execution time depends on the multiple jobs other people run in the cluster, so I can't compare the two approaches in even conditions):
(df.where(F.isnull('column_name')).limit(1).count() == 1)
Does adding the limit help ? Are there more efficient ways to achieve this ?
There is no non-exhaustive search for something that isn't there.
We can probably squeeze a lot more performance out of your query for the case where a record with a null value exists (see below), but what about when it doesn't? If you're planning on running this query multiple times, with the answer changing each time, you should be aware (I don't mean to imply that you aren't) that if the answer is "there are no null values in the entire dataframe", then you will have to scan the entire dataframe to know this, and there isn't a fast way to do that. If you need this kind of information frequently and the answer can frequently be "no", you'll almost certainly want to persist this kind of information somewhere, and update it whenever you insert a record that might have null values by checking just that record.
Don't use count().
count() is probably making things worse.
In the count case Spark used wide transformation and actually applies LocalLimit on each partition and shuffles partial results to perform GlobalLimit.
In the take case Spark used narrow transformation and evaluated LocalLimit only on the first partition.
In other words, .limit(1).count() is likely to select one example from each partition of your dataset, before selecting one example from that list of examples. Your intent is to abort as soon as a single example is found, but unfortunately, count() doesn't seem smart enough to achieve that on its own.
As alluded to by the same example, though, you can use take(), first(), or head() to achieve the use case you want. This will more effectively limit the number of partitions that are examined:
If no shuffle is required (no aggregations, joins, or sorts), these operations will be optimized to inspect enough partitions to satisfy the operation - likely a much smaller subset of the overall partitions of the dataset.
Please note, count() can be more performant in other cases. As the other SO question rightly pointed out,
neither guarantees better performance in general.
There may be more you can do.
Depending on your storage method and schema, you might be able to squeeze more performance out of your query.
Since you aren't even interested in the value of the row that was chosen in this case, you can throw a select(F.lit(True)) between your isnull and your take. This should in theory reduce the amount of information the workers in the cluster need to transfer. This is unlikely to matter if you have only a few columns of simple types, but if you have complex data structures, this can help and is very unlikely to hurt.
If you know how your data is partitioned and you know which partition(s) you're interested in or have a very good guess about which partition(s) (if any) are likely to contain null values, you should definitely filter your dataframe by that partition to speed up your query.

Apache Spark Handling Skewed Data

I have two tables I would like to join together. One of them has a very bad skew of data. This is causing my spark job to not run in parallel as a majority of the work is done on one partition.
I have heard and read and tried to implement salting my keys to increase the distribution.
https://www.youtube.com/watch?v=WyfHUNnMutg at 12:45 seconds is exactly what I would like to do.
Any help or tips would be appreciated. Thanks!
Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:
Here are a couple of suggestions:
Tresata skew join RDD https://github.com/tresata/spark-skewjoin
python skew join:
https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
The tresata library looks like this:
import com.tresata.spark.skewjoin.Dsl._ // for the implicits
// skewjoin() method pulled in by the implicits
rdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2),
DefaultSkewReplication(1)).sortByKey(true).collect.toLis

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.