Lets say I have a table
name | val
------------
a |10
b |10
c |20
d |30
and I also have a user input of 25. How do I select all the rows so that the sum of val is one row over 25. So for a user input of 25 I would get back the first three rows which give me a value of 40. The equivalent code for what I'm trying to do is
total = 0
user_input = 25
while total < user_input and rows_by_val_asc_iterator.has_next():
row = rows_by_val_asc_iterator.next()
total = total + row.val
You could do this using something called a "window function". Basically it lets you have a running total. Select rows til the sum exceeds your desired total.
For example, try something like this:
select name, val
from (
select name, val, (sum(val) over (order by val, name)) as total
from vals
) as t
where total - val < 25
Related
I have the follow df:
index initial_range final_range
1 1000000 5999999
2 6000000 6299999
3 6300000 6399999
4 6400000 6499999
5 6600000 6699999
6 6700000 6749999
7 6750000 6799999
8 7000000 7399999
9 7600000 7699999
10 7700000 7749999
11 7750000 7799999
12 6500000 6549999
See that the 'initial_range' field and 'final_range' field are intervals of abrangency.
When we compare row index 1 and index 2, we observe that the end of the value of the 'final_range' field starts in the next one as sequence+1 in the 'initial_range' index 2. So, in the example ended in 5999999 and started in 6000000 in index 2. I need grouping this cases and return the follow df:
index initial_range final_range grouping
1 1000000 5999999 1000000-6549999
2 6000000 6299999 1000000-6549999
3 6300000 6399999 1000000-6549999
4 6400000 6499999 1000000-6549999
5 6600000 6699999 6600000-6799999
6 6700000 6749999 6600000-6799999
7 6750000 6799999 6600000-6799999
8 7000000 7399999 7000000-7399999
9 7600000 7699999 7600000-7799999
10 7700000 7749999 7600000-7799999
11 7750000 7799999 7600000-7799999
12 6500000 6549999 1000000-6549999
See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken.
Some details:
The index 4 for 5 the sequence+1 is broken, so the new 'grouping' change. In other words, every time the sequence is broken a new sequence needs to be written.
In index 12 the grouping 1000000-6549999 appear again, because the 6500000 is the next number of 6499999 in index 4.
I tried this code:
comparison = df == df.shift()+1
df['grouping'] = comparison['initial_range'] & comparison['final_range']
But, the logic sequence, don't worked.
Can anyone help me?
Well this was a tough one, here is my answer,
First of all, I am using UDF so expect the performance to be a little bad,
import copy
import pyspark.sql.functions as F
from pyspark.sql.types import *
rn = 0
def check_vals(x, y):
global rn
if (y != None) and (int(x)+1) == int(y):
return rn + 1
else:
# Using copy to deepcopy and not forming a shallow one.
res = copy.copy(rn)
# Increment so that the next value with start form +1
rn += 1
# Return the same value as we want to group using this
return res + 1
return 0
rn_udf = F.udf(lambda x, y: check_vals(x, y), IntegerType())
Next,
from pyspark.sql.window import Window
# We want to check the final_range values according to the initial_value
w = Window().orderBy(F.col('initial_range'))
# First of all take the next row values of initial range in a column called nextRange so that we can compare
# Check if the final_range+1 == nextRange, if yes use rn value, if not then use rn and increment it for the next iteration.
# Now find the max and min values in the partition created by the check_1 column.
# Concat min and max values
# order it by ID to get the initial ordering, I have to cast it to integer but you might not need it
# drop all calculated values
df.withColumn('nextRange', F.lead('initial_range').over(w)) \
.withColumn('check_1', rn_udf("final_range", "nextRange")) \
.withColumn('min_val', F.min("initial_range").over(Window.partitionBy("check_1"))) \
.withColumn('max_val', F.max("final_range").over(Window.partitionBy("check_1"))) \
.withColumn('range', F.concat("min_val", F.lit("-"), "max_val")) \
.orderBy(F.col("ID").cast(IntegerType())) \
.drop("nextRange", "check_1", "min_val", "max_val") \
.show(truncate=False)
Output:
+---+-------------+-----------+---------------+
|ID |initial_range|final_range|range |
+---+-------------+-----------+---------------+
|1 |1000000 |5999999 |1000000-6549999|
|2 |6000000 |6299999 |1000000-6549999|
|3 |6300000 |6399999 |1000000-6549999|
|4 |6400000 |6499999 |1000000-6549999|
|5 |6600000 |6699999 |6600000-6799999|
|6 |6700000 |6749999 |6600000-6799999|
|7 |6750000 |6799999 |6600000-6799999|
|8 |7000000 |7399999 |7000000-7399999|
|9 |7600000 |7699999 |7600000-7799999|
|10 |7700000 |7749999 |7600000-7799999|
|11 |7750000 |7799999 |7600000-7799999|
|12 |6500000 |6549999 |1000000-6549999|
+---+-------------+-----------+---------------+
I've looked into my job and have identified that I do indeed have a skewed task. How do I determine what the actual value is inside this task that is causing the skew?
My Python Transforms code looks like this:
from transforms.api import Input, Output, transform
#transform(
...
)
def my_compute_function(...):
...
df = df.join(df_2, ["joint_col"])
...
Theory
Skew problems originate from anything that causes an exchange in your job. Things that cause exchanges include but are not limited to: joins, windows, groupBys.
These operations result in data movement across your Executors based upon the found values inside the DataFrames used. This means that when a used DataFrame has many repeated values on the column dictating the exchange, those rows all end up in the same task, thus increasing its size.
Example
Let's consider the following example distribution of data for your join:
DataFrame 1 (df1)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_1 | 2 |
| key_2 | 1 |
DataFrame 2 (df2)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_2 | 2 |
| key_3 | 1 |
These DataFrames when joined together on col_1 will have the following data distributed across the executors:
Task 1:
Receives: 5 rows of key_1 from df1
Receives: 4 rows of key_1 from df2
Total Input: 9 rows of data sent to task_1
Result: 5 * 4 = 20 rows of output data
Task 2:
Receives: 1 row of key_2 from df1
Receives: 1 row of key_2 from df2
Total Input: 2 rows of data sent to task_2
Result: 1 * 1 = 1 rows of output data
Task 3:
Receives: 1 row of key_3 from df2
Total Input: 1 rows of data sent to task_3
Result: 1 * 0 = 0 rows of output data (missed key; no key found in df1)
If you therefore look at the counts of input and output rows per task, you'll see that Task 1 has far more data than the others. This task is skewed.
Identification
The question now becomes how we identify that key_1 is the culprit of the skew since this isn't visible in Spark (the underlying engine powering the join).
If we look at the above example, we see that all we need to know is the actual counts per key of the joint column. This means we can:
Aggregate each side of the join on the joint key and count the rows per key
Multiply the counts of each side of the join to determine the output row counts
The easiest way to do this is by opening the Analysis (Contour) tool in Foundry and performing the following analysis:
Add df1 as input to a first path
Add Pivot Table board, using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df1_, resulting in an output from the path which is only df1_col_1 and df1_COUNT.
Add df2 as input to a second path
Add Pivot Table board, again using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df2_, resulting in an output from the path which is only df2_col_1 and df2_COUNT.
Create a third path, using the result of the first path (df1_col_1 and df1_COUNT1)
Add a Join board, making the right side of the join the result of the second path (df2_col_1 and df2_col_1). Ensure the join type is Full join
Add all columns from the right side (you don't need to add a prefix, all the columns are unique
Configure the join board to join on df1_col_1 equals df2_col_1
Add an Expression board to create a new column, output_row_count which multiplies the two COUNT columns together
Add a Sort board that sorts on output_row_count descending
If you now preview the resultant data, you will have a sorted list of keys from both sides of the join that are causing the skew
I am pre-processing data for machine learning inputs, a target value column, call it "price" has many outliers, and rather than winsorizing price over the whole set I want to winsorize within groups labeled by "product_category". There are other features, product_category is just a price-relevant label.
There is a Scala stat function that works great:
df_data.stat.approxQuantile("price", Array(0.01, 0.99), 0.00001)
// res19: Array[Double] = Array(3.13, 318.54)
Unfortunately, it doesn't support computing the quantiles within groups. Nor does is support window partitions.
df_data
.groupBy("product_category")
.approxQuantile($"price", Array(0.01, 0.99), 0.00001)
// error: value approxQuantile is not a member of
// org.apache.spark.sql.RelationalGroupedDataset
What is the best way to compute say the p01 and p99 within groups of a spark dataframe, for the purpose of replacing values beyond that range, ie winsorizing?
My dataset schema can be imagined like this, and its over 20MM rows with appx 10K different labels for "product_category", so performance is also a concern.
df_data and a winsorized price column:
+---------+------------------+--------+---------+
| item | product_category | price | pr_winz |
+---------+------------------+--------+---------+
| I000001 | XX11 | 1.99 | 5.00 |
| I000002 | XX11 | 59.99 | 59.99 |
| I000003 | XX11 |1359.00 | 850.00 |
+---------+------------------+--------+---------+
supposing p01 = 5.00, p99 = 850.00 for this product_category
Here is what I came up with, after struggling with the documentation (there are two functions approx_percentile and percentile_approx that apparently do the same thing).
I was not able to figure out how to implement this except as a spark sql expression, not sure exactly why grouping only works there. I suspect because its part of Hive?
Spark DataFrame Winsorizor
Tested on DF in 10 to 100MM rows range
// Winsorize function, groupable by columns list
// low/hi element of [0,1]
// precision: integer in [1, 1E7-ish], in practice use 100 or 1000 for large data, smaller is faster/less accurate
// group_col: comma-separated list of column names
import org.apache.spark.sql._
def grouped_winzo(df: DataFrame, winz_col: String, group_col: String, low: Double, hi: Double, precision: Integer): DataFrame = {
df.createOrReplaceTempView("df_table")
spark.sql(s"""
select distinct
*
, percentile_approx($winz_col, $low, $precision) over(partition by $group_col) p_low
, percentile_approx($winz_col, $hi, $precision) over(partition by $group_col) p_hi
from df_table
""")
.withColumn(winz_col + "_winz", expr(s"""
case when $winz_col <= p_low then p_low
when $winz_col >= p_hi then p_hi
else $winz_col end"""))
.drop(winz_col, "p_low", "p_hi")
}
// winsorize the price column of a dataframe at the p01 and p99
// percentiles, grouped by 'product_category' column.
val df_winsorized = grouped_winzo(
df_data
, "price"
, "product_category"
, 0.01, 0.99, 1000)
I'd like to be able to select rows in batches from a very large keyed table being stored remotely on disk. As a toy example to test my function I set up the following tables t and nt...
t:([sym:110?`A`aa`Abc`B`bb`Bac];px:110?10f;id:1+til 110)
nt:0#t
I select from the table only records that begin with the character "A", count the number of characters, divide the count by the number of rows I would like to fetch for each function call (10), and round that up to the nearest whole number...
aRec:select from t where sym like "A*"
counter:count aRec
divy:counter%10
divyUP:ceiling divy
Next I set an idx variable to 0 and write an if statement as the parameterized function. This checks if idx equals divyUP. If not, then it should select the first 10 rows of aRec, upsert those to the nt table, increment the function argument, x, by 10, and increment the idx variable by 1. Once the idx variable and divyUP are equal it should exit the function...
idx:0
batches:{[x]if[not idx=divyUP;batch::select[x 10]from aRec;`nt upsert batch;x+:10;idx+::1]}
However when I call the function it returns a type error...
q)batches 0
'type
[1] batches:{[x]if[not idx=divyUP;batch::select[x 10]from aRec;`nt upsert batch;x+:10;idx+::1]}
^
I've tried using it with sublist too, though I get the same result...
batches:{[x]if[not idx=divyUP;batch::x 10 sublist aRec;`nt upsert batch;x+:10;idx+::1]}
q)batches 0
'type
[1] batches:{[x]if[not idx=divyUP;batch::x 10 sublist aRec;`nt upsert batch;x+:10;idx+::1]}
^
However issuing either of those above commands outside of the function both return the expected results...
q)select[0 10] from aRec
sym| px id
---| ------------
A | 4.236121 1
A | 5.932252 3
Abc| 5.473628 5
A | 0.7014928 7
Abc| 3.503483 8
A | 8.254616 9
Abc| 4.328712 10
A | 5.435053 19
A | 1.014108 22
A | 1.492811 25
q)0 10 sublist aRec
sym| px id
---| ------------
A | 4.236121 1
A | 5.932252 3
Abc| 5.473628 5
A | 0.7014928 7
Abc| 3.503483 8
A | 8.254616 9
Abc| 4.328712 10
A | 5.435053 19
A | 1.014108 22
A | 1.492811 25
The issue is that in your example, select[] and sublist requires a list as an input but your input is not a list. Reason for that is when there is a variable in items(which will form a list), it is no longer considered as a simple list meaning blank(space) cannot be used to separate values. In this case, a semicolon is required.
q) x:2
q) (1;x) / (1 2)
Select command: Change input to (x;10) to make it work.
q) t:([]id:1 2 3; v: 3 4 5)
q) {select[(x;2)] from t} 1
`id `v
---------
2 4
3 5
Another alternative is to use 'i'(index) column:
q) {select from t where i within x + 0 2} 1
Sublist Command: Convert left input of the sublist function to a list (x;10).
q) {(x;2) sublist t}1
You can't use the select[] form with variable input like that, instead you can use a functional select shown in https://code.kx.com/q4m3/9_Queries_q-sql/#912-functional-forms where you input as the 5th argument the rows you want
Hope this helps!
I'm trying to get the Frequency of distinct values in a Spark dataframe column, something like "value_counts" from Python Pandas. By frequency I mean, the highest occurring value in a table column (such as rank 1 value, rank 2, rank 3 etc. In the expected output, 1 has occurred 9 times in column a, so it has topmost frequency.
I'm using Spark SQL but it is not working out, may be because of the reduce operation I have written is wrong.
**Pandas Example**
value_counts().index[1]
**Current Code in Spark**
val x= parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql
(s"select 'ParquetRDD' as TableName,
'$field' as column,
min($field) as min, max($field) as max,
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field) as frequency from peopleRDDtable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
withSum.show()
The problem area is with query below.
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field)
**Expected output**
TableName | column | min | max | frequency1 |
_____________+_________+______+_______+____________+
ParquetRDD | a | 1 | 30 | 9 |
_____________+_________+______+_______+____________+
ParquetRDD | b | 2 | 21 | 5 |
How do I solve this ? please help.
I could solve the issue with below with using count($field) instead of approx_count_distinct($field). Then I used Rank analytical function to get the first rank of value. It worked.