Scala - Spark Repartition not giving expected results - scala

I want to repartition my spark dataframe based on a column X. Say X column has 3 distinct values(X1,X2,X3). The number of distinct values could be varying.
I want one partition to contain records with only 1 value of X . ie. I want 3 partitions with 1 having records where X=X1 , other with X=X2 and last with X=X3.
I have unique walues of X from dataframe by query
val uniqueList = DF.select("X").distinct().map(x => x(0).toString).collect()
which is giving list of unique values correctly.
And to repartition I am doing
DF = DF.repartition(uniqueList.length, col('X'))
However, my partitions in DF are not coming as expected. Data is not distributed correctly as one partition is empty, second contains records with X1 and third partition has records with both X2 and X3.
Can someone please help if I am missing something.
EDIT:
My column X could have varying number of unique values. It could have 3 or 3000 unique values.
If I do below
DF = DF.repartition(col('X'))
I will only get 200 partitions, as that is the default value of spark.sql.shuffle.partitions. Thus I am giving number of partition
If there are 3000 unique values of X then I want to repartition my DF in such a way that there are 3000 partitions and each partition contains records for one particular value of X. So that I can run mapPartition and process each partition parallel.

Repartitioning is based on hash partitioning (take the hash code of the partitioning key modulo the number of partitions), so whether each partition only has one value is purely chance.
If you can map each partitioning key to a unique Int in the range of zero to (number of unique values - 1), since the hash code of an Int in Scala is that integer, this would ensure that if there are at least as many partitions as there are unique values, no partition has multiple distinct partitioning key values.
That said, coming up with the assignment of values to such Ints is inherently not parallelizable and requires either a sequential scan or knowing the distinct values ahead of time.
Probabilistically, the chance that a particular value hashes into a particular partition of (n partitions) is 1/n. As n increases relative to the number of distinct values, the chance of no partition having more than one distinct value increases (at the limit, if you could have 2^32 partitions, nearly all of them would be empty but an actual hash collision would still guarantee multiple distinct values in a partition). So if you can tolerate empty partitions, choosing a number of partitions that's sufficiently greater than the number of distinct values would reduce the chance of a sub-ideal result.

By any chance,does your coln X contains null values? Then Spark tries to create one partition for this. Since you are also giving the number of partitions as int, may be Spark tries to squish X2 and X3. So you can try two things - just give the column name for reparationing (still one partition extra) or try removing the null values from X, if they exist.

Does this work?
val repartitionedDF = DF.repartition(col("X"))
Here's an example I blogged about
Data:
first_name,last_name,country
Ernesto,Guevara,Argentina
Vladimir,Putin,Russia
Maria,Sharapova,Russia
Bruce,Lee,China
Jack,Ma,China
Code:
df
.repartition(col("country"))
.write
.partitionBy("country")
.parquet(outputPath)
Filesystem output:
partitioned_lake1/
country=Argentina/
part-00044-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
country=China/
part-00059-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
country=Russia/
part-00002-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet

Related

How to keep partitions created before, for group by

I have a DataFrame, which I repartitioned with repartitionByRange or custom-partitioner.
Now, I'm operating Group By on this DataFrame, but the GroupBy causes a new repartition - hash partitioning on the columns selected to group by.
I was wondering how can you keep your original created partitions for GroupBy. Of course the partitions should be partitioned by the selected columns to group by, but how can I determinate a different hashing strategy for group by?

Apache Spark, range-joins, data skew and performance

I have the following Apache Spark SQL join predicate:
t1.field1 = t2.field1 and t2.start_date <= t1.event_date and t1.event_date < t2.end_date
data:
t1 DataFrame have over 50 millions rows
t2 DataFrame have over 2 millions rows
almost all t1.field1 fields in t1 DataFrame have the same value(null).
Right now the Spark cluster hangs for more than 10 minutes on a single task in order to perform this join and because of data skew. Only one worker and one task on this worker works at this point of time. All other 9 workers are idle. How to improve this join in order to distribute the load from this one particular task to whole Spark cluster?
I am assuming you are doing inner join.
Below steps can be followed to optimise join -
1. Before joining we can filter out t1 and t2 based on smallest or largest start_date, event_date, end_date. It will reduce number of rows.
Check if t2 dataset have null value for field1, if not before join t1 dataset can be filtered based on notNull condition. It will reduce t1 size
If your job is getting only few executors than available one then you have less number of partitions. Simply repartition the dataset, set an optimal number so than you don't endup with large number of partitions or vice versa.
You can check if partitioning has happened properly (no skewness) by looking at tasks execution time, it should be similar.
Check if smaller dataset can be fit in executors memory, broadcast_join can be used.
You might like to read - https://github.com/vaquarkhan/Apache-Kafka-poc-and-notes/wiki/Apache-Spark-Join-guidelines-and-Performance-tuning
If almost all the rows in t1 have t1.field1 = null, and the event_date row is numeric (or you convert it to a timestamp), you can first use Apache DataFu to do a ranged join, and filter out the rows in which t1.field1 != t2.field1 afterwards.
The range join code would look like this:
t1.joinWithRange("event_date", t2, "start_date", "end_date", 10)
The last argument - 10 - is the decrease factor. This does bucketing, as Raphael Roth suggested in his answer.
You can see an example of such a ranged join in the blog post introducing DataFu-Spark.
Full disclosure - I am a member of DataFu and wrote the blog post.
I assume spark already pushed the not-null filter on t1.field1, you can verify this in the explain-plan.
I would rather experiment with creating an additional attribute which can be used as an equi-join condition, e.g. by bucketing. For example you could create a month attribute. To do this, you would need to enumerate monthsin t2, this is usually done using UDFs. See this SO-question for an example : How to improve broadcast Join speed with between condition in Spark

Create sample value for failure records spark

I have a scenario where my dataframe has 3 columns a,b and c. I need to validate if the length of all the columns is equal to 100. Based on validation I am creating status column like a_status,b_status,c_status with values 5 (Success) and 10 (Failure). In Failure scenarios I need to update count and create new columns a_sample,b_sample,c_sample with some 5 failure sample values separated by ",". For creating samples column I tried like this
df= df.select(df.columns.toList.map(col(_)) :::
df.columns.toList.map( x => (lit(getSample(df.select(x, x + "_status").filter(x + "_status=10" ).select(x).take(5))).alias(x + "_sample")) ).toList: _* )
getSample method will just get array of rows and concatenate as a string. This works fine for limited columns and data size. However if the number of columns > 200 and data is > 1 million rows it creates huge performance impact. Is there any alternate approach for same.
While the details of your problem statement are unclear, you can break up the task into two parts:
Transform data into a format where you identify several different types of rows you need to sample.
Collect sample by row type.
The industry jargon for "row type" is stratum/strata and the way to do (2), without collecting data to the driver, which you don't want to do when the data is large, is via stratified sampling, which Spark implements via df.stat.sampleBy(). As a statistical function, it doesn't work with exact row numbers but fractions. If you absolutely must get a sample with an exact number of rows there are two strategies:
Oversample by fraction and then filter unneeded rows, e.g., using the row_number() window function followed by a filter 'row_num < n.
Build a custom user-defined aggregate function (UDAF), firstN(col, n). This will be much faster but a lot more work. See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
An additional challenge for your use case is that you want this done per column. This is not a good fit with Spark's transformations such as grouping or sampleBy, which operate on rows. The simple approach is to make several passes through the data, one column at a time. If you absolutely must do this in a single pass through the data, you'll need to build a much more custom UDAF or Aggregator, e.g., the equivalent of takeFirstNFromAWhereBHasValueC(n, colA, colB, c).

SparkSQL PostgresQL Dataframe partitions

I have a very simple setup of SparkSQL connecting to a Postgres DB and I'm trying to get a DataFrame from a table, the Dataframe with a number X of partitions (lets say 2). The code would be the following:
Map<String, String> options = new HashMap<String, String>();
options.put("url", DB_URL);
options.put("driver", POSTGRES_DRIVER);
options.put("dbtable", "select ID, OTHER from TABLE limit 1000");
options.put("partitionColumn", "ID");
options.put("lowerBound", "100");
options.put("upperBound", "500");
options.put("numPartitions","2");
DataFrame housingDataFrame = sqlContext.read().format("jdbc").options(options).load();
For some reason, one partition of the DataFrame contains almost all rows.
For what I can understand lowerBound/upperBound are the parameters used to finetune this. In SparkSQL's documentation (Spark 1.4.0 - spark-sql_2.11) it says they are used to define the stride, not to filter/range the partition column. But that raises several questions:
The stride is the frequency (number of elements returned each query) with which Spark will query the DB for each executor (partition)?
If not, what is the purpose of this parameters, what do they depend on and how can I balance my DataFrame partitions in a stable way (not asking all partitions contain the same number of elements, just that there is an equilibrium - for example 2 partitions 100 elements 55/45 , 60/40 or even 65/35 would do)
Can't seem to find a clear answer to these questions around and was wondering if maybe some of you could clear this points for me, because right now is affecting my cluster performance when processing X million rows and all the heavy lifting goes to one single executor.
Cheers and thanks for your time.
Essentially the lower and upper bound and the number of partitions are used to calculate the increment or split for each parallel task.
Let's say the table has partition column "year", and has data from 2006 to 2016.
If you define the number of partitions as 10, with lower bound 2006 and higher bound 2016, you will have each task fetching data for its own year - the ideal case.
Even if you incorrectly specify the lower and / or upper bound, e.g. set lower = 0 and upper = 2016, there will be a skew in data transfer, but, you will not "lose" or fail to retrieve any data, because:
The first task will fetch data for year < 0.
The second task will fetch data for year between 0 and 2016/10.
The third task will fetch data for year between 2016/10 and 2*2016/10.
...
And the last task will have a where condition with year->2016.
T.
Lower bound are indeed used against the partitioning column; refer to this code (current version at the moment of writing this):
https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
Function columnPartition contains the code for the partitioning logic and the use of lower / upper bound.
lowerbound and upperbound have been currently identified to do what they do in the previous answers. A followup to this would be how to balance the data across partitions without looking at the min max values or if your data is heavily skewed.
If your database supports "hash" function, it could do the trick.
partitionColumn = "hash(column_name)%num_partitions"
numPartitions = 10 // whatever you want
lowerBound = 0
upperBound = numPartitions
This will work as long as the modulus operation returns a uniform distribution over [0,numPartitions)

Ordered Data with Cassandra RandomPartitioner

I have about a billion pieces of data that I would like to store in Cassandra. The data items are ordered by time, and one of the main queries I'll be doing is to find the items between two time ranges, in order. I'd really prefer to use the RandomParititioner, if at all possible. Is there a way to do this in Cassandra?
At first, since I'm coming from SQL, I assumed I should create each event as a row, but then it occurred to me that I was thinking about it the wrong way and I should really use columns. Columns in Cassandra seem to be ordered, but I'm confused as to just how ordered they are. If I use a time as the column name, is there a way for me to get all of the columns from one time to another in order?
Another thing I looked at was the 0.7 feature of secondary indices, but I've had trouble finding documentation for whether I can use these to view a range of things in order.
All I want is the Cassandra equivalent of this SQL: "Select * from Stuff where date > X and date < Y order by date asc". How can I do this?
The partitioner only affects the distribution of keys around the ring, not the order of columns within a key. Columns are always ordered according to the Column Comparator defined for the column family.
You can call get_slice with a SlicePredicate that specifies a SliceRange to get all the columns of a key within a range.
To model your data, you can create 1 row for each day (or suitable time shard) and have a column for each piece of data. Something like,
"yyyy-mm-dd" : { #key, one for each day
timeStampMillis1:dataid1 : "value1" # one column for each piece of data
timeStampMillis2:dataid2 : "value2"
timeStampMillis3:dataid3 : "value3"
}
The column names should be binary, using the binary comparator. The first 8 bytes are the timestamp, while the rest of the bytes are the id of the data.
Assuming X and Y are on the same day, to find all items between X and Y, do a do a get_slice on the day key, with a SlicePredicate with a SliceRange specifying a start of X and a finish of Y+1. Both start and finish are byte arrays of 8 bytes.
To find data over multiple days, read from multiple keys.