How to keep partitions created before, for group by - scala

I have a DataFrame, which I repartitioned with repartitionByRange or custom-partitioner.
Now, I'm operating Group By on this DataFrame, but the GroupBy causes a new repartition - hash partitioning on the columns selected to group by.
I was wondering how can you keep your original created partitions for GroupBy. Of course the partitions should be partitioned by the selected columns to group by, but how can I determinate a different hashing strategy for group by?

Related

Union with Redshift native tables and external tables (Spectrum)

If I have a view that contains a union between a native table and external table like so (pseudocode):
create view vwPageViews as
select from PageViews
union all
select from PageViewsHistory
PageViews has for the last 2 years. External table has for older data than 2 years.
If a user selects from the view with filters for the last 6 months, how does RS Spectrum handle it - does it read the entire external table even though none will be returned (and accordingly cost us money for all of it)? (Assuming the s3 files are parquet based).
ex.
Select from vwPageViews where MyDate >= '01/01/2021'
What's the best approach for querying both cold and historical data using RS and Spectrum? Thanks!
How this will happen on Spectrum will depend on whether or not you have provided partitions for the data in S3. Without partitions (and a where clause on the partition) the Spectrum engines in S3 will have to read every file to determine if the needed data is in any of them. The cost of this will depend on the number and size of the files AND what format they are in. (CSV is more expensive than Parquet for example.)
The way around this is to partition the data in S3 and to have a WHERE clause on the partition value. This will exclude files from needing to be read when they don't match on the partition value.
The rub is in providing the WHERE clause for the partition as this will likely be less granular than the date or timestamp you using in your base data. For example if you partition on YearMonth (YYYYMM) and want to have a day level WHERE clause you will need to 2 parts to the WHERE clause - WHERE date_col >= 2015-07-12 AND part_col >= 201507. How to produce both WHERE conditions will depend on your solution around Redshift.

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

Scala - Spark Repartition not giving expected results

I want to repartition my spark dataframe based on a column X. Say X column has 3 distinct values(X1,X2,X3). The number of distinct values could be varying.
I want one partition to contain records with only 1 value of X . ie. I want 3 partitions with 1 having records where X=X1 , other with X=X2 and last with X=X3.
I have unique walues of X from dataframe by query
val uniqueList = DF.select("X").distinct().map(x => x(0).toString).collect()
which is giving list of unique values correctly.
And to repartition I am doing
DF = DF.repartition(uniqueList.length, col('X'))
However, my partitions in DF are not coming as expected. Data is not distributed correctly as one partition is empty, second contains records with X1 and third partition has records with both X2 and X3.
Can someone please help if I am missing something.
EDIT:
My column X could have varying number of unique values. It could have 3 or 3000 unique values.
If I do below
DF = DF.repartition(col('X'))
I will only get 200 partitions, as that is the default value of spark.sql.shuffle.partitions. Thus I am giving number of partition
If there are 3000 unique values of X then I want to repartition my DF in such a way that there are 3000 partitions and each partition contains records for one particular value of X. So that I can run mapPartition and process each partition parallel.
Repartitioning is based on hash partitioning (take the hash code of the partitioning key modulo the number of partitions), so whether each partition only has one value is purely chance.
If you can map each partitioning key to a unique Int in the range of zero to (number of unique values - 1), since the hash code of an Int in Scala is that integer, this would ensure that if there are at least as many partitions as there are unique values, no partition has multiple distinct partitioning key values.
That said, coming up with the assignment of values to such Ints is inherently not parallelizable and requires either a sequential scan or knowing the distinct values ahead of time.
Probabilistically, the chance that a particular value hashes into a particular partition of (n partitions) is 1/n. As n increases relative to the number of distinct values, the chance of no partition having more than one distinct value increases (at the limit, if you could have 2^32 partitions, nearly all of them would be empty but an actual hash collision would still guarantee multiple distinct values in a partition). So if you can tolerate empty partitions, choosing a number of partitions that's sufficiently greater than the number of distinct values would reduce the chance of a sub-ideal result.
By any chance,does your coln X contains null values? Then Spark tries to create one partition for this. Since you are also giving the number of partitions as int, may be Spark tries to squish X2 and X3. So you can try two things - just give the column name for reparationing (still one partition extra) or try removing the null values from X, if they exist.
Does this work?
val repartitionedDF = DF.repartition(col("X"))
Here's an example I blogged about
Data:
first_name,last_name,country
Ernesto,Guevara,Argentina
Vladimir,Putin,Russia
Maria,Sharapova,Russia
Bruce,Lee,China
Jack,Ma,China
Code:
df
.repartition(col("country"))
.write
.partitionBy("country")
.parquet(outputPath)
Filesystem output:
partitioned_lake1/
country=Argentina/
part-00044-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
country=China/
part-00059-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
country=Russia/
part-00002-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet

Apache Spark, range-joins, data skew and performance

I have the following Apache Spark SQL join predicate:
t1.field1 = t2.field1 and t2.start_date <= t1.event_date and t1.event_date < t2.end_date
data:
t1 DataFrame have over 50 millions rows
t2 DataFrame have over 2 millions rows
almost all t1.field1 fields in t1 DataFrame have the same value(null).
Right now the Spark cluster hangs for more than 10 minutes on a single task in order to perform this join and because of data skew. Only one worker and one task on this worker works at this point of time. All other 9 workers are idle. How to improve this join in order to distribute the load from this one particular task to whole Spark cluster?
I am assuming you are doing inner join.
Below steps can be followed to optimise join -
1. Before joining we can filter out t1 and t2 based on smallest or largest start_date, event_date, end_date. It will reduce number of rows.
Check if t2 dataset have null value for field1, if not before join t1 dataset can be filtered based on notNull condition. It will reduce t1 size
If your job is getting only few executors than available one then you have less number of partitions. Simply repartition the dataset, set an optimal number so than you don't endup with large number of partitions or vice versa.
You can check if partitioning has happened properly (no skewness) by looking at tasks execution time, it should be similar.
Check if smaller dataset can be fit in executors memory, broadcast_join can be used.
You might like to read - https://github.com/vaquarkhan/Apache-Kafka-poc-and-notes/wiki/Apache-Spark-Join-guidelines-and-Performance-tuning
If almost all the rows in t1 have t1.field1 = null, and the event_date row is numeric (or you convert it to a timestamp), you can first use Apache DataFu to do a ranged join, and filter out the rows in which t1.field1 != t2.field1 afterwards.
The range join code would look like this:
t1.joinWithRange("event_date", t2, "start_date", "end_date", 10)
The last argument - 10 - is the decrease factor. This does bucketing, as Raphael Roth suggested in his answer.
You can see an example of such a ranged join in the blog post introducing DataFu-Spark.
Full disclosure - I am a member of DataFu and wrote the blog post.
I assume spark already pushed the not-null filter on t1.field1, you can verify this in the explain-plan.
I would rather experiment with creating an additional attribute which can be used as an equi-join condition, e.g. by bucketing. For example you could create a month attribute. To do this, you would need to enumerate monthsin t2, this is usually done using UDFs. See this SO-question for an example : How to improve broadcast Join speed with between condition in Spark

how to do it in spark i.e. iterate groups and save each group as file at a time?

I have a huge data , which is accumalated each year , quarterly-wise.
This data is skewed a bit , when I try to get all data into one dataframe by repartitoning it into ("year", "quarter") it is shuffling a lot of data on disk spill which is making my job slow , more over only one executor working 80% of the time.
Hence I decided to
1) get distinct groups of dataframe , grouping by year and quarter-wise.
2) iterate/loop this distinct data frame by group
fetch the data group where = year of the group
save this dataframe/grop as parquet file
continue iteration.
In Java we can use for loop on groups
but in spark with scala how to do it ?