How to count the number of values in a column in a dataframe based on the values in the other dataframe - pyspark

I have two dataframes. the first one is a raw dataframe so its item_value column has all the item values. and the other dataframe has columns named min,avg,max which has min,avg,max values specified for the items in the first dataframe. and I want to count the number of item values in the first dataframe based on the specified agg values in the second dataframe.
the first dataframe looks like this
item_name
item_value
A
1.4
A
2.1
B
3.0
A
2.8
B
4.5
B
1.1
the second dataframe looks like this
item_name
min
avg
max
A
1.1
2
2.7
B
2.1
3
4.0
I want to count the number of item values that are greater than the defined min,avg,max values in the other dataframe
So the result I want is
item_name
min
avg
max
A
3
2
1
B
2
1
1
Any help would be much appreciated
*please forgive my grammar

If you don't mind SQL implementation, you can try:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
sql = """
select df2.item_name,
sum(case when df1.item_value > df2.min then 1 else 0 end) as min,
sum(case when df1.item_value > df2.avg then 1 else 0 end) as avg,
sum(case when df1.item_value > df2.max then 1 else 0 end) as max
from df2 join df1 on df2.item_name=df1.item_name
group by df2.item_name
"""
df = spark.sql(sql)
df.show()

Related

PySpark subtract last row from first row in a group

I want to use window function to partition by ID and have the last row of each group to be subtracted from the first row and create a separate column with the output. What is the cleanest way to achieve that result?
ID col1
1 1
1 2
1 4
2 1
2 1
2 6
3 5
3 5
3 7
Desired output:
ID col1 col2
1 1 3
1 2 3
1 4 3
2 1 5
2 1 5
2 6 5
3 5 2
3 5 2
3 7 2
Code below
w=Window.partitionBy('ID').orderBy('col1').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('out', last('col1').over(w)-first('col1').over(w)).show()
Sounds like you’re defining the “first” row as the row with the minimum value of col1 in the group, and the “last” row as the row with maximum value of col1 in the group. To compute them, you can use the MIN and MAX window functions:
SELECT
ID,
col1,
(MAX(col1) OVER (PARTITION BY ID)) - (MIN(col1) OVER (PARTITION BY ID)) AS col2
FROM
...
If you’re defining “first” and “last” row somehow differently (e.g., in terms of some timestamp), you can use the more general FIRST_VALUE and LAST_VALUE window functions:
SELECT
ID,
col1,
(LAST_VALUE(col1) OVER (PARTITION BY ID ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
-
(FIRST_VALUE(col1) OVER (PARTITION BY ID ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
AS col2
FROM
...
The two snippets above are equivalent, but the latter is more general: you can specify ordering by a different column and/or you can modify the window specification.

Overriding values in dataframe while joining 2 dataframes

In the below example I would like to override the values in Spark Dataframe A with corresponding value in Dataframe B (if it exists). Is there a way to do it using Spark (Scala)?
Dataframe A
ID Name Age
1 Paul 30
2 Sean 35
3 Rob 25
Dataframe B
ID Name Age
1 Paul 40
Result
ID Name Age
1 Paul 40
2 Sean 35
3 Rob 25
The combined use of a left join and coalesce should do the trick, something like:
dfA
.join(dfB, "ID", "left")
.select(
dfA.col("ID"),
dfA.col("Name"),
coalesce(dfB.col("Age"), dfA.col("Age")).as("Age")
)
Explanation: For a specific ID some_id, there is 2 cases:
if dfB does not contain some_id: then the left join will produce null for dfB.col("Age") and the
coalesce will return the first non-null value from expressions we
passed to it, i.e. the value of dfA.col("Age")
if dfB contains some_id then the value from dfB.col("Age") will be used.

What will be the SQL to get the results by applying group by on specific matching string of column value?

What will be the SQL to get the results by applying the group by on a specific matching string of column value?
Table: results ( this is not the physical table but the result of specific queries )
rname count1 count2
Avg-1 2 2
Avg-1 1 1
Avg-2 2 2
Avg-1 1 1
Zen-3 2 2
Zen/D 2 1
QA/C 3 1
QA/D 2 1
The expected output is:
rname count1 count2
Avg 6 6
Zen 4 3
QA 5 2
In expected output count1 is sum of all count1 of rname which match the string 'Avg', 'Zen' and 'QA' respectively. Same for count2.
What will be the SQL can you please give me some directions?
demo:db<>fiddle
SELECT
(regexp_split_to_array(rname,'[-/]'))[1],
SUM(count1) AS count1,
SUM(count2) AS count2
FROM
mytable
GROUP BY 1
regexp_split_to_array(rname,'[-/]') splits the rname value at the - or the / character. Taking the first part ([1]) gives you Avg, Zen or QA
Group hy this result (using the column index 1)
SUM up the values

Retain only 3 highest positive and negative records in a table

I am new to databases and postgres as such.
I have a table called names which has 2 columns name and value which gets updated every x seconds with new name value pairs. My requirement is to retain only 3 positive and 3 negative values at any point of time and delete the rest of the rows during each table update.
I use the following query to delete the old rows and retain the 3 positive and 3 negative values ordered by value.
delete from names
using (select *,
row_number() over (partition by value > 0, value < 0 order by value desc) as rn
from names ) w
where w.rn >=3
I am skeptical to use a conditional like value > 0 in a partition statement. Is this approach correct?
For example,
A table like this prior to delete :
name | value
--------------
test | 10
test1 | 11
test1 | 12
test1 | 13
test4 | -1
test4 | -2
My table after delete should look like :
name | value
--------------
test1 | 13
test1 | 12
test1 | 11
test4 | -1
test4 | -2
demo:db<>fiddle
This works generally as expected: value > 0 clusters the values into all numbers > 0 and all numbers <= 0. The ORDER BY value orders these two groups as expected well.
So, the only thing, I would change:
row_number() over (partition by value >= 0 order by value desc)
remove: , value < 0 (Because: Why should you group the positive values into negative and other? You don't have any negative numbers in your positive group and vice versa.)
change: value > 0 to value >= 0 to ignore the 0 as long as possible
For deleting: If you want to keep the top 3 values of each direction:
you should change w.rn >= 3 into w.rn > 3 (it keeps the 3rd element as well)
you need to connect the subquery with the table records. In real cases you should use id columns for that. In your example you could take the value column: where n.value = w.value AND w.rn > 3
So, finally:
delete from names n
using (select *,
row_number() over (partition by value >= 0 order by value desc) as rn
from names ) w
where n.value = w.value AND w.rn > 3
If it's not a hard requirement to delete the other rows, you could instead select only the rows you're interested in:
WITH largest AS (
SELECT name, value
FROM names
ORDER BY value DESC
LIMIT 3),
smallest AS (
SELECT name, value
FROM names
ORDER BY value ASC
LIMIT 3)
SELECT * FROM largest
UNION
SELECT * FROM smallest
ORDER BY value DESC

Remove duplicates based on only 1 column

My data is in the following format:
rep_id user_id other non-duplicated data
1 1 ...
1 2 ...
2 3 ...
3 4 ...
3 5 ...
I am trying to achieve a column for deduped_rep with 0/1 such that only first rep id across the associated users has a 1 and rest have 0.
Expected result:
rep_id user_id deduped_rep
1 1 1
1 2 0
2 3 1
3 4 1
3 5 0
For reference, in Excel, I would use the following formula:
IF(SUMPRODUCT(($A$2:$A2=A2)*($A$2:$A2=A2))>1,0,1)
I know there is the FIXED() LoD calculation http://kb.tableau.com/articles/howto/removing-duplicate-data-with-lod-calculations, but I only see use cases of it deduplicating based on another column. However, mine are distinct.
Define a field first_reg_date_per_rep_id as
{ fixed rep_id : min(registration_date) }
The define a field is_first_reg_date? as
registration_date = first_reg_date_per_rep_id
You can use that last Boolean field to distinguish the first record for each rep_id from later ones
try this query
select
rep_id,
user_id,
row_number() over(partition by rep_id order by rep_id,user_id) deduped_rep
from
table