I'm trying to get the Frequency of distinct values in a Spark dataframe column, something like "value_counts" from Python Pandas. By frequency I mean, the highest occurring value in a table column (such as rank 1 value, rank 2, rank 3 etc. In the expected output, 1 has occurred 9 times in column a, so it has topmost frequency.
I'm using Spark SQL but it is not working out, may be because of the reduce operation I have written is wrong.
**Pandas Example**
value_counts().index[1]
**Current Code in Spark**
val x= parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql
(s"select 'ParquetRDD' as TableName,
'$field' as column,
min($field) as min, max($field) as max,
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field) as frequency from peopleRDDtable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
withSum.show()
The problem area is with query below.
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field)
**Expected output**
TableName | column | min | max | frequency1 |
_____________+_________+______+_______+____________+
ParquetRDD | a | 1 | 30 | 9 |
_____________+_________+______+_______+____________+
ParquetRDD | b | 2 | 21 | 5 |
How do I solve this ? please help.
I could solve the issue with below with using count($field) instead of approx_count_distinct($field). Then I used Rank analytical function to get the first rank of value. It worked.
Related
I've looked into my job and have identified that I do indeed have a skewed task. How do I determine what the actual value is inside this task that is causing the skew?
My Python Transforms code looks like this:
from transforms.api import Input, Output, transform
#transform(
...
)
def my_compute_function(...):
...
df = df.join(df_2, ["joint_col"])
...
Theory
Skew problems originate from anything that causes an exchange in your job. Things that cause exchanges include but are not limited to: joins, windows, groupBys.
These operations result in data movement across your Executors based upon the found values inside the DataFrames used. This means that when a used DataFrame has many repeated values on the column dictating the exchange, those rows all end up in the same task, thus increasing its size.
Example
Let's consider the following example distribution of data for your join:
DataFrame 1 (df1)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_1 | 2 |
| key_2 | 1 |
DataFrame 2 (df2)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_2 | 2 |
| key_3 | 1 |
These DataFrames when joined together on col_1 will have the following data distributed across the executors:
Task 1:
Receives: 5 rows of key_1 from df1
Receives: 4 rows of key_1 from df2
Total Input: 9 rows of data sent to task_1
Result: 5 * 4 = 20 rows of output data
Task 2:
Receives: 1 row of key_2 from df1
Receives: 1 row of key_2 from df2
Total Input: 2 rows of data sent to task_2
Result: 1 * 1 = 1 rows of output data
Task 3:
Receives: 1 row of key_3 from df2
Total Input: 1 rows of data sent to task_3
Result: 1 * 0 = 0 rows of output data (missed key; no key found in df1)
If you therefore look at the counts of input and output rows per task, you'll see that Task 1 has far more data than the others. This task is skewed.
Identification
The question now becomes how we identify that key_1 is the culprit of the skew since this isn't visible in Spark (the underlying engine powering the join).
If we look at the above example, we see that all we need to know is the actual counts per key of the joint column. This means we can:
Aggregate each side of the join on the joint key and count the rows per key
Multiply the counts of each side of the join to determine the output row counts
The easiest way to do this is by opening the Analysis (Contour) tool in Foundry and performing the following analysis:
Add df1 as input to a first path
Add Pivot Table board, using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df1_, resulting in an output from the path which is only df1_col_1 and df1_COUNT.
Add df2 as input to a second path
Add Pivot Table board, again using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df2_, resulting in an output from the path which is only df2_col_1 and df2_COUNT.
Create a third path, using the result of the first path (df1_col_1 and df1_COUNT1)
Add a Join board, making the right side of the join the result of the second path (df2_col_1 and df2_col_1). Ensure the join type is Full join
Add all columns from the right side (you don't need to add a prefix, all the columns are unique
Configure the join board to join on df1_col_1 equals df2_col_1
Add an Expression board to create a new column, output_row_count which multiplies the two COUNT columns together
Add a Sort board that sorts on output_row_count descending
If you now preview the resultant data, you will have a sorted list of keys from both sides of the join that are causing the skew
I am pre-processing data for machine learning inputs, a target value column, call it "price" has many outliers, and rather than winsorizing price over the whole set I want to winsorize within groups labeled by "product_category". There are other features, product_category is just a price-relevant label.
There is a Scala stat function that works great:
df_data.stat.approxQuantile("price", Array(0.01, 0.99), 0.00001)
// res19: Array[Double] = Array(3.13, 318.54)
Unfortunately, it doesn't support computing the quantiles within groups. Nor does is support window partitions.
df_data
.groupBy("product_category")
.approxQuantile($"price", Array(0.01, 0.99), 0.00001)
// error: value approxQuantile is not a member of
// org.apache.spark.sql.RelationalGroupedDataset
What is the best way to compute say the p01 and p99 within groups of a spark dataframe, for the purpose of replacing values beyond that range, ie winsorizing?
My dataset schema can be imagined like this, and its over 20MM rows with appx 10K different labels for "product_category", so performance is also a concern.
df_data and a winsorized price column:
+---------+------------------+--------+---------+
| item | product_category | price | pr_winz |
+---------+------------------+--------+---------+
| I000001 | XX11 | 1.99 | 5.00 |
| I000002 | XX11 | 59.99 | 59.99 |
| I000003 | XX11 |1359.00 | 850.00 |
+---------+------------------+--------+---------+
supposing p01 = 5.00, p99 = 850.00 for this product_category
Here is what I came up with, after struggling with the documentation (there are two functions approx_percentile and percentile_approx that apparently do the same thing).
I was not able to figure out how to implement this except as a spark sql expression, not sure exactly why grouping only works there. I suspect because its part of Hive?
Spark DataFrame Winsorizor
Tested on DF in 10 to 100MM rows range
// Winsorize function, groupable by columns list
// low/hi element of [0,1]
// precision: integer in [1, 1E7-ish], in practice use 100 or 1000 for large data, smaller is faster/less accurate
// group_col: comma-separated list of column names
import org.apache.spark.sql._
def grouped_winzo(df: DataFrame, winz_col: String, group_col: String, low: Double, hi: Double, precision: Integer): DataFrame = {
df.createOrReplaceTempView("df_table")
spark.sql(s"""
select distinct
*
, percentile_approx($winz_col, $low, $precision) over(partition by $group_col) p_low
, percentile_approx($winz_col, $hi, $precision) over(partition by $group_col) p_hi
from df_table
""")
.withColumn(winz_col + "_winz", expr(s"""
case when $winz_col <= p_low then p_low
when $winz_col >= p_hi then p_hi
else $winz_col end"""))
.drop(winz_col, "p_low", "p_hi")
}
// winsorize the price column of a dataframe at the p01 and p99
// percentiles, grouped by 'product_category' column.
val df_winsorized = grouped_winzo(
df_data
, "price"
, "product_category"
, 0.01, 0.99, 1000)
I'm preprocessing my data(2000K+ rows), and want to count the duplicated columns in a spark dataframe, for example:
id | col1 | col2 | col3 | col4 |
----+--------+-------+-------+-------+
1 | 3 | 999 | 4 | 999 |
2 | 2 | 888 | 5 | 888 |
3 | 1 | 777 | 6 | 777 |
In this case, the col2 and col4's values are the same, which is my interest, so let the count +1.
I had tried toPandas(), transpose, and then duplicateDrop() in pyspark, but it's too slow.
Is there any function could solve this?
Any idea will be appreciate, thank you.
So you want to count the number of duplicate values based on the columns col2 and col4? This should do the trick below.
val dfWithDupCount = df.withColumn("isDup", when($"col2" === "col4", 1).otherwise(0))
This will create a new dataframe with a new boolean column saying that if col2 is equal to col4, then enter the value 1 otherwise 0.
To find the total number of rows, all you need to do is do a group by based on isDup and count.
import org.apache.spark.sql.functions._
val groupped = df.groupBy("isDup").agg(sum("isDup")).toDF()
display(groupped)
Apologies if I misunderstood you. You could probably use the same solution if you were trying to match any of the columns together, but that would require nested when statements.
I have a data frame in scala spark as
category | score |
A | 0.2
A | 0.3
A | 0.3
B | 0.9
B | 0.8
B | 1
I would like to
add a row id column as
category | score | row-id
A | 0.2 | 0
A | 0.3 | 1
A | 0.3 | 2
B | 0.9 | 0
B | 0.8 | 1
B | 1 | 2
Basically I want the row id to be monotonically increasing for each distinct value in column category. I already have a sorted dataframe so all the rows with same category are grouped together. However, I still don't know how to generate the row_id that restarts when a new category appears. Please help!
This is a good use case for Window aggregation functions
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import df.sparkSession.implicits._
val window = Window.partitionBy('category).orderBy('score)
df.withColumn("row-id", row_number.over(window))
Window functions work kind of like groupBy except that instead of each group returning a single value, each row in each group returns a single value. In this case the value is the row's position within the group of rows of the same category. Also, if this is the effect that you are trying to achieve, then you don't need to have pre-sorted the column category beforehand.
Here's the DataFrame:
id | sector | balance
---------------------------
1 | restaurant | 20000
2 | restaurant | 20000
3 | auto | 10000
4 | auto | 10000
5 | auto | 10000
How to find the count of each sector type and remove the records with sector type count below a specific LIMIT?
The following:
dataFrame.groupBy(columnName).count()
gives me the number of times a value appears in that column.
How to do it in Spark and Scala using DataFrame API?
You can use SQL Window to do so.
import org.apache.spark.sql.expressions.Window
yourDf.withColumn("count", count("*")
.over(Window.partitionBy($"colName")))
.where($"count">2)
// .drop($"count") // if you don't want to keep count column
.show()
For your given dataframe
import org.apache.spark.sql.expressions.Window
dataFrame.withColumn("count", count("*")
.over(Window.partitionBy($"sector")))
.where($"count">2)
.show()
You should see results like this:
id | sector | balance | count
------------------------------
3 | auto | 10000 | 3
4 | auto | 10000 | 3
5 | auto | 10000 | 3
Don't know if it is the best way. But this worked for me.
def getRecordsWithColumnFrequnecyLessThanLimit(dataFrame: DataFrame, columnName: String, limit: Integer): DataFrame = {
val g = dataFrame.groupBy(columnName)
.count()
.filter("count<" + limit)
.select(columnName)
.rdd
.map(r => r(0)).collect()
dataFrame.filter(dataFrame(columnName) isin (g:_*))
}
Since it's a dataframe you can use SQL query like
select sector, count(1)
from TABLE
group by sector
having count(1) >= LIMIT