I am currently working with an API environment that requires the usage of pyspark.
In this way, I need to execute a daily comparison between two dataframes in order to determine with records are new, updated and deleted.
Here is an example of two dataframes:
today = spark.createDataFrame([
[1, "Apple", 5000, "A"],
[2, "Banana", 4000, "A"],
[3, "Orange", 3000, "B"],
[4, "Grape", 4500, "C"],
[5, "Watermelon", 2000, "A"]
], ["item_id", "name", "cost", "classification"])
yesterday = spark.createDataFrame([
[1, "Apple", 5000, "B"],
[2, "Bananas", 4000, "A"],
[3, "Tangerine", 3000, "B"],
[4, "Grape", 4500, "C"]
], ["item_id", "name", "cost", "classification"])
I want to compare both dataframes and determine what is new and what is updated. For the new items, I get it quite easy:
today.join(yesterday, on="item_id", how="left_anti").show()
# +---------+------------+------+----------------+
# | item_id | name | cost | classification |
# +---------+------------+------+----------------+
# | 5 | Watermelon | 2000 | A |
# +---------+------------+------+----------------+
However for the items that got updated, I have no idea how to compare those results. I need to get all the rows that have different values for the remaining columns of the dataframe.
My expected result, in the case above is:
# +---------+------------+------+----------------+
# | item_id | name | cost | classification |
# +---------+------------+------+----------------+
# | 1 | Apple | 5000 | A |
# | 2 | Banana | 4000 | A |
# | 3 | Orange | 3000 | B |
# +---------+------------+------+----------------+
Use .subtract() method to get today's rows that not present in yesterday, then left-semi join with yesterday
today.subtract(yesterday).join(yesterday, on="item_id", how="left_semi").show()
# +-------+------+----+--------------+
# |item_id| name|cost|classification|
# +-------+------+----+--------------+
# | 1| Apple|5000| A|
# | 3|Orange|3000| B|
# | 2|Banana|4000| A|
# +-------+------+----+--------------+
joined = today.join(yesterday, [today.item_id== yesterday.item_id] , how = 'inner' )
Then apply the filter for which the classification are not matching
fitlered = joined.filter(joined.today_classification != joined.yesterday_classification)
filtered dataframe is your required
Related
I need sums of respondents who answered a given choice for each choice in multiple choice test. I have data of the following format with, say 3 people and 100 questions:
+---------+---------+ --------------+--------------+ --------------+--------------+
| misc_1 | misc_2 ... Answer_A_1 | Answer_A_2 ... Answer_D_99 | Answer_D_100 |
+---------+---------+ --------------+--------------+ --------------+--------------+
| James| 2345 ... 0 1 ... 0 | 1 |
| Anna| 5434 ... 1 0 ... 0 | 1 |
| Robert| 7890 ... 0 1 ... 1 | 0 |
+---------+---------+ --------------+--------------+ --------------+--------------+
And I would like to get the sums of each answer choice selected in a dataframe to this effect:
+---+---+---+---+----------+
| A | B | C | D | Question
+---+---+---+---+----------+
| 1 | 0 | 1 | 1 | 1 |
| 2 | 1 | 0 | 1 | 2 |
| 0 | 3 | 0 | 0 | 3 |
: : : : :
: : : : :
| 1 | 0 | 0 | 2 | 100 |
+---+---+---+---+----------+
I tried the following:
from pyspark.sql import SparkSession, functions as F
def getSums(df):
choices = ['A', 'B', 'C', 'D']
arg = {}
answers = [column for column in df.columns if column.startswith("Ans")]
for a in answers:
arg[a] = 'sum'
sums = sums.select(*(F.col(i).alias(i.replace("(",'_').replace(')','')) for i in sums.columns))
sums = df.agg(arg).withColumn('idx', F.lit(None))
s = [f",'{l}'"+f",{column}" for column in sums.columns for l in choices if f"_{l}_" in column]
unpivotExpr = "stack(4"+''.join(map(str,s))+") as (A,B,C,D)"
unpivotDF = sums.select('idx',F.expr(unpivotExpr))
result = unpivotDF
return result
I changed the names assuming the parentheses from .agg() would cause syntax error
The error is at unpivotDF = sums.select('idx',F.expr(unpivotExpr)). I misunderstood how the stack() function worked and assumed it would pivot the columns listed and rename them whatever was in the parentheses.
I get the following error:
AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 200 aliases but got A,B,C,D
Any alternative approaches or solutions without pyspark.pandas would be greatly appreciated.
The logic is:
Sum columns
Collect scores as array: "A" = ["A1", "A2", "A3"]. Repeat to "B", "C" & "D"
Zip as [{"A1", "B1", "C1", "D1"}, {"A2", "B2", "C2", "D2"}, ...]
Explode to separate rows for each question
Split by fields "A", "B", "C", "D"
Note - Change input parameters as per your requirement.
# ACTION: Change input parameters
total_ques = 3
que_cats = ["A", "B", "C", "D"]
import pyspark.sql.functions as F
# Sum columns
result_df = df.select([F.sum(x).alias(x) for x in df.columns if x not in ["misc.1", "misc.2"]])
# Collect scores as array: "A" = ["A1", "A2", "A3"]. Repeat to "B", "C" & "D".
for c in que_cats:
col_list = [x for x in result_df.columns if f"Answer_{c}_" in x]
result_df = result_df.withColumn(c, F.array(col_list))
result_df = result_df.select(que_cats)
result_df = result_df.withColumn("Question", F.array([F.lit(i) for i in range(1,total_ques+1)]))
# Zip as [{"A1", "B1", "C1", "D1"}, {"A2", "B2", "C2", "D2"}, ...]
final_cols = result_df.columns
result_df = result_df.select(F.arrays_zip(*final_cols).alias("zipped"))
# Explode to separate rows for each question
result_df = result_df.select(F.explode("zipped").alias("zipped"))
# Split by fields "A", "B", "C", "D"
for c in final_cols:
result_df = result_df.withColumn(c, result_df.zipped.getField(c))
result_df = result_df.select(final_cols)
Output:
+---+---+---+---+--------+
|A |B |C |D |Question|
+---+---+---+---+--------+
|0 |3 |0 |3 |1 |
|3 |0 |3 |0 |2 |
|3 |0 |3 |3 |3 |
+---+---+---+---+--------+
Sample dataset used:
df = spark.createDataFrame(data=[
["James",0,1,1,1,0,0,0,1,1,1,0,1],
["Anna",0,1,1,1,0,0,0,1,1,1,0,1],
["Robert",0,1,1,1,0,0,0,1,1,1,0,1],
], schema=["misc.1","Answer_A_1","Answer_A_2","Answer_A_3","Answer_B_1","Answer_B_2","Answer_B_3","Answer_C_1","Answer_C_2","Answer_C_3","Answer_D_1","Answer_D_2","Answer_D_3"])
I have a dataset ds like this:
ds.show():
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate lines (note: not the same as distinct(), I do not want to still have a distinct line, but to remove both lines) per keys (id1,id2,id3), the expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove line 1 and line 2 because we have 2 values for the key group.
I try to achieve this using:
ds.groupBy(id1,id2,id3).distinct()
But it's not working.
You can use window function with filter on count as below
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
Output:
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+
I have the Data like this.
+------+------+------+----------+----------+----------+----------+----------+----------+
| Col1 | Col2 | Col3 | Col1_cnt | Col2_cnt | Col3_cnt | Col1_wts | Col2_wts | Col3_wts |
+------+------+------+----------+----------+----------+----------+----------+----------+
| AAA | VVVV | SSSS | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| BBB | BBBB | TTTT | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| CCC | DDDD | YYYY | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
+------+------+------+----------+----------+----------+----------+----------+----------+
I have tried but I am not getting any help here.
val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z")
I want the output in the form of below table
+-----------+---------+---------+
| Cols_name | Col_cnt | Col_wts |
+-----------+---------+---------+
| Col1 | 3 | 0.5 |
| Col2 | 4 | 0.4 |
| Col3 | 5 | 0.6 |
+-----------+---------+---------+
Here's a general approach for transposing a DataFrame:
For each of the pivot columns (say c1, c2, c3), combine the column name and associated value columns into a struct (e.g. struct(lit(c1), c1_cnt, c1_wts))
Put all these struct-typed columns into an array which is then explode-ed into rows of struct columns
Group by the pivot column name to aggregate the associated struct elements
The following sample code has been generalized to handle an arbitrary list of columns to be transposed:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("AAA", "VVVV", "SSSS", 3, 4, 5, 0.5, 0.4, 0.6),
("BBB", "BBBB", "TTTT", 3, 4, 5, 0.5, 0.4, 0.6),
("CCC", "DDDD", "YYYY", 3, 4, 5, 0.5, 0.4, 0.6)
).toDF("c1", "c2", "c3", "c1_cnt", "c2_cnt", "c3_cnt", "c1_wts", "c2_wts", "c3_wts")
val pivotCols = Seq("c1", "c2", "c3")
val valueColSfx = Seq("_cnt", "_wts")
val arrStructs = pivotCols.map{ c => struct(
Seq(lit(c).as("_pvt")) ++
valueColSfx.map((c, _)).map{ case (p, s) => col(p + s).as(s) }: _*
).as(c + "_struct")
}
val valueColAgg = valueColSfx.map(s => first($"struct_col.$s").as(s + "_first"))
df.
select(array(arrStructs: _*).as("arr_structs")).
withColumn("struct_col", explode($"arr_structs")).
groupBy($"struct_col._pvt").agg(valueColAgg.head, valueColAgg.tail: _*).
show
// +----+----------+----------+
// |_pvt|_cnt_first|_wts_first|
// +----+----------+----------+
// | c1| 3| 0.5|
// | c3| 5| 0.6|
// | c2| 4| 0.4|
// +----+----------+----------+
Note that function first is used in the above example, but it could be any other aggregate function (e.g. avg, max, collect_list) depending on the specific business requirement.
Sorry for the noob question, I have a dataframe in SparkSQL like this:
id | name | data
----------------
1 | Mary | ABCD
2 | Joey | DOGE
3 | Lane | POOP
4 | Jack | MEGA
5 | Lynn | ARGH
I want to know how to do two things:
1) use a scala function on one or more columns to produce another column
2) use a scala function on one or more columns to replace a column
Examples:
1) Create a new boolean column that tells whether the data starts with A:
id | name | data | startsWithA
------------------------------
1 | Mary | ABCD | true
2 | Joey | DOGE | false
3 | Lane | POOP | false
4 | Jack | MEGA | false
5 | Lynn | ARGH | true
2) Replace the data column with its lowercase counterpart:
id | name | data
----------------
1 | Mary | abcd
2 | Joey | doge
3 | Lane | poop
4 | Jack | mega
5 | Lynn | argh
What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.
You can use withColumn to add new column or to replace the existing column
as
val df = Seq(
(1, "Mary", "ABCD"),
(2, "Joey", "DOGE"),
(3, "Lane", "POOP"),
(4, "Jack", "MEGA"),
(5, "Lynn", "ARGH")
).toDF("id", "name", "data")
val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
.withColumn("data", lower($"data"))
If you want separate dataframe then
val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))
withColumn replaces the old column if the same column name is provided and creates a new column if new column name is provided.
Output:
+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1 |Mary|abcd|true |
|2 |Joey|doge|false |
|3 |Lane|poop|false |
|4 |Jack|mega|false |
|5 |Lynn|argh|true |
+---+----+----+-----------+
Given an array of objects, we can turn that into recordset very easily with jsonb_to_recordset.
select * from jsonb_to_recordset($$[
{"name": "name01", "age": 12},
{"name": "name02", "age": 14},
{"name": "name03", "age": 16},
{"name": "name04", "age": 18}
]$$) as (name text, age int)
name |age |
-------|----|
name01 |12 |
name02 |14 |
name03 |16 |
name04 |18 |
But what can we do if we prefer source data in the form of array of array ? How to transform below query to yield similar result to the above ?
select array['name', 'age'] "labels"
, x.value "values"
from jsonb_array_elements($$[
["name01", 12],
["name02", 14],
["name03", 16],
["name04", 18]
]$$) x
labels |values |
-----------|---------------|
{name,age} |["name01", 12] |
{name,age} |["name02", 14] |
{name,age} |["name03", 16] |
{name,age} |["name04", 18] |
You could use ->>:
select x.value->>0 AS name,
x.value->>1 AS age
from jsonb_array_elements($$[
["name01", 12],
["name02", 14],
["name03", 16],
["name04", 18]
]$$) x;
Output:
+--------+-----+
| name | age |
+--------+-----+
| name01 | 12 |
| name02 | 14 |
| name03 | 16 |
| name04 | 18 |
+--------+-----+
DBFiddle Demo