I need sums of respondents who answered a given choice for each choice in multiple choice test. I have data of the following format with, say 3 people and 100 questions:
+---------+---------+ --------------+--------------+ --------------+--------------+
| misc_1 | misc_2 ... Answer_A_1 | Answer_A_2 ... Answer_D_99 | Answer_D_100 |
+---------+---------+ --------------+--------------+ --------------+--------------+
| James| 2345 ... 0 1 ... 0 | 1 |
| Anna| 5434 ... 1 0 ... 0 | 1 |
| Robert| 7890 ... 0 1 ... 1 | 0 |
+---------+---------+ --------------+--------------+ --------------+--------------+
And I would like to get the sums of each answer choice selected in a dataframe to this effect:
+---+---+---+---+----------+
| A | B | C | D | Question
+---+---+---+---+----------+
| 1 | 0 | 1 | 1 | 1 |
| 2 | 1 | 0 | 1 | 2 |
| 0 | 3 | 0 | 0 | 3 |
: : : : :
: : : : :
| 1 | 0 | 0 | 2 | 100 |
+---+---+---+---+----------+
I tried the following:
from pyspark.sql import SparkSession, functions as F
def getSums(df):
choices = ['A', 'B', 'C', 'D']
arg = {}
answers = [column for column in df.columns if column.startswith("Ans")]
for a in answers:
arg[a] = 'sum'
sums = sums.select(*(F.col(i).alias(i.replace("(",'_').replace(')','')) for i in sums.columns))
sums = df.agg(arg).withColumn('idx', F.lit(None))
s = [f",'{l}'"+f",{column}" for column in sums.columns for l in choices if f"_{l}_" in column]
unpivotExpr = "stack(4"+''.join(map(str,s))+") as (A,B,C,D)"
unpivotDF = sums.select('idx',F.expr(unpivotExpr))
result = unpivotDF
return result
I changed the names assuming the parentheses from .agg() would cause syntax error
The error is at unpivotDF = sums.select('idx',F.expr(unpivotExpr)). I misunderstood how the stack() function worked and assumed it would pivot the columns listed and rename them whatever was in the parentheses.
I get the following error:
AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 200 aliases but got A,B,C,D
Any alternative approaches or solutions without pyspark.pandas would be greatly appreciated.
The logic is:
Sum columns
Collect scores as array: "A" = ["A1", "A2", "A3"]. Repeat to "B", "C" & "D"
Zip as [{"A1", "B1", "C1", "D1"}, {"A2", "B2", "C2", "D2"}, ...]
Explode to separate rows for each question
Split by fields "A", "B", "C", "D"
Note - Change input parameters as per your requirement.
# ACTION: Change input parameters
total_ques = 3
que_cats = ["A", "B", "C", "D"]
import pyspark.sql.functions as F
# Sum columns
result_df = df.select([F.sum(x).alias(x) for x in df.columns if x not in ["misc.1", "misc.2"]])
# Collect scores as array: "A" = ["A1", "A2", "A3"]. Repeat to "B", "C" & "D".
for c in que_cats:
col_list = [x for x in result_df.columns if f"Answer_{c}_" in x]
result_df = result_df.withColumn(c, F.array(col_list))
result_df = result_df.select(que_cats)
result_df = result_df.withColumn("Question", F.array([F.lit(i) for i in range(1,total_ques+1)]))
# Zip as [{"A1", "B1", "C1", "D1"}, {"A2", "B2", "C2", "D2"}, ...]
final_cols = result_df.columns
result_df = result_df.select(F.arrays_zip(*final_cols).alias("zipped"))
# Explode to separate rows for each question
result_df = result_df.select(F.explode("zipped").alias("zipped"))
# Split by fields "A", "B", "C", "D"
for c in final_cols:
result_df = result_df.withColumn(c, result_df.zipped.getField(c))
result_df = result_df.select(final_cols)
Output:
+---+---+---+---+--------+
|A |B |C |D |Question|
+---+---+---+---+--------+
|0 |3 |0 |3 |1 |
|3 |0 |3 |0 |2 |
|3 |0 |3 |3 |3 |
+---+---+---+---+--------+
Sample dataset used:
df = spark.createDataFrame(data=[
["James",0,1,1,1,0,0,0,1,1,1,0,1],
["Anna",0,1,1,1,0,0,0,1,1,1,0,1],
["Robert",0,1,1,1,0,0,0,1,1,1,0,1],
], schema=["misc.1","Answer_A_1","Answer_A_2","Answer_A_3","Answer_B_1","Answer_B_2","Answer_B_3","Answer_C_1","Answer_C_2","Answer_C_3","Answer_D_1","Answer_D_2","Answer_D_3"])
Related
I have a dataset ds like this:
ds.show():
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate lines (note: not the same as distinct(), I do not want to still have a distinct line, but to remove both lines) per keys (id1,id2,id3), the expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove line 1 and line 2 because we have 2 values for the key group.
I try to achieve this using:
ds.groupBy(id1,id2,id3).distinct()
But it's not working.
You can use window function with filter on count as below
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
Output:
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+
I have a dataframe, df2 such as:
ID | data
--------
1 | New
3 | New
5 | New
and a main dataframe, df1:
ID | data | more
----------------
1 | OLD | a
2 | OLD | b
3 | OLD | c
4 | OLD | d
5 | OLD | e
I want to achieve something of the sort:
ID | data | more
----------------
1 | NEW | a
2 | OLD | b
3 | NEW | c
4 | OLD | d
5 | NEW | e
I want to update df1 based on df2, keeping the original values of df1 when they dont exist in df2.
Is there a fast way to do this than using isin? Isin is very slow when df1 and df2 are both very large.
With left join and "coalesce":
val df1 = Seq(
(1, "OLD", "a"),
(2, "OLD", "b"),
(3, "OLD", "c"),
(4, "OLD", "d"),
(5, "OLD", "e")).toDF("ID", "data", "more")
val df2 = Seq(
(1, "New"),
(3, "New"),
(5, "New")).toDF("ID", "data")
// action
val result = df1.alias("df1")
.join(
df2.alias("df2"),$"df2.ID" === $"df1.ID", "left")
.select($"df1.ID",
coalesce($"df2.data", $"df1.data").alias("data"),
$"more")
Output:
+---+----+----+
|ID |data|more|
+---+----+----+
|1 |New |a |
|2 |OLD |b |
|3 |New |c |
|4 |OLD |d |
|5 |New |e |
+---+----+----+
I have the Data like this.
+------+------+------+----------+----------+----------+----------+----------+----------+
| Col1 | Col2 | Col3 | Col1_cnt | Col2_cnt | Col3_cnt | Col1_wts | Col2_wts | Col3_wts |
+------+------+------+----------+----------+----------+----------+----------+----------+
| AAA | VVVV | SSSS | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| BBB | BBBB | TTTT | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| CCC | DDDD | YYYY | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
+------+------+------+----------+----------+----------+----------+----------+----------+
I have tried but I am not getting any help here.
val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z")
I want the output in the form of below table
+-----------+---------+---------+
| Cols_name | Col_cnt | Col_wts |
+-----------+---------+---------+
| Col1 | 3 | 0.5 |
| Col2 | 4 | 0.4 |
| Col3 | 5 | 0.6 |
+-----------+---------+---------+
Here's a general approach for transposing a DataFrame:
For each of the pivot columns (say c1, c2, c3), combine the column name and associated value columns into a struct (e.g. struct(lit(c1), c1_cnt, c1_wts))
Put all these struct-typed columns into an array which is then explode-ed into rows of struct columns
Group by the pivot column name to aggregate the associated struct elements
The following sample code has been generalized to handle an arbitrary list of columns to be transposed:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("AAA", "VVVV", "SSSS", 3, 4, 5, 0.5, 0.4, 0.6),
("BBB", "BBBB", "TTTT", 3, 4, 5, 0.5, 0.4, 0.6),
("CCC", "DDDD", "YYYY", 3, 4, 5, 0.5, 0.4, 0.6)
).toDF("c1", "c2", "c3", "c1_cnt", "c2_cnt", "c3_cnt", "c1_wts", "c2_wts", "c3_wts")
val pivotCols = Seq("c1", "c2", "c3")
val valueColSfx = Seq("_cnt", "_wts")
val arrStructs = pivotCols.map{ c => struct(
Seq(lit(c).as("_pvt")) ++
valueColSfx.map((c, _)).map{ case (p, s) => col(p + s).as(s) }: _*
).as(c + "_struct")
}
val valueColAgg = valueColSfx.map(s => first($"struct_col.$s").as(s + "_first"))
df.
select(array(arrStructs: _*).as("arr_structs")).
withColumn("struct_col", explode($"arr_structs")).
groupBy($"struct_col._pvt").agg(valueColAgg.head, valueColAgg.tail: _*).
show
// +----+----------+----------+
// |_pvt|_cnt_first|_wts_first|
// +----+----------+----------+
// | c1| 3| 0.5|
// | c3| 5| 0.6|
// | c2| 4| 0.4|
// +----+----------+----------+
Note that function first is used in the above example, but it could be any other aggregate function (e.g. avg, max, collect_list) depending on the specific business requirement.
I have this example dataframe:
id | A | B | C | D
1 |NULL | 1 | 1 |NULL
2 | 1 | 1 | 1 | 1
3 | 1 |NULL |NULL |NULL
and I want to change to this format:
id | newColumn
1 | {"B", "C"}
2 | {"A","B","C","D"}
3 | {"A"}
In other words, I want to make a new column with a list containing the column names where the row values are not null.
How can I do this in Spark using Scala?
First, get the column names where there is an actual value and not null. This can be done with a function such as:
val notNullColNames = Seq("A", "B", "C", "D").map(c => when(col(c).isNotNull, c))
To create an array of values normally array is used, however, this will still give back a null when the input is null. Instead, one solution is to use concat_ws and split to remove any null values:
df.select($"id", split(concat_ws(",", notNullColNames:_*), ",").as("newColumn"))
For the example input, this will output:
+---+------------+
| id| newColumn|
+---+------------+
| 1| [B, C]|
| 2|[A, B, C, D]|
| 3| [A]|
+---+------------+
I have two spark dataframe,dfA and dfB.
I want to filter dfA by dfB's each row, which means if dfB have 10000 rows, i need to filter dfA 10000 times with 10000 different filter conditions generated by dfB. Then, after each filter i need to collect the filter result as a column in dfB.
dfA dfB
+------+---------+---------+ +-----+-------------+--------------+
| id | value1 | value2 | | id | min_value1 | max_value1 |
+------+---------+---------+ +-----+-------------+--------------+
| 1 | 0 | 4345 | | 1 | 0 | 3 |
| 1 | 1 | 3434 | | 1 | 5 | 9 |
| 1 | 2 | 4676 | | 2 | 1 | 4 |
| 1 | 3 | 3454 | | 2 | 6 | 8 |
| 1 | 4 | 9765 | +-----+-------------+--------------+
| 1 | 5 | 5778 | ....more rows, nearly 10000 rows.
| 1 | 6 | 5674 |
| 1 | 7 | 3456 |
| 1 | 8 | 6590 |
| 1 | 9 | 5461 |
| 1 | 10 | 4656 |
| 2 | 0 | 2324 |
| 2 | 1 | 2343 |
| 2 | 2 | 4946 |
| 2 | 3 | 4353 |
| 2 | 4 | 4354 |
| 2 | 5 | 3234 |
| 2 | 6 | 8695 |
| 2 | 7 | 6587 |
| 2 | 8 | 5688 |
+------+---------+---------+
......more rows,nearly one billons rows
so my expected result is
resultDF
+-----+-------------+--------------+----------------------------+
| id | min_value1 | max_value1 | results |
+-----+-------------+--------------+----------------------------+
| 1 | 0 | 3 | [4345,3434,4676,3454] |
| 1 | 5 | 9 | [5778,5674,3456,6590,5461] |
| 2 | 1 | 4 | [2343,4946,4353,4354] |
| 2 | 6 | 8 | [8695,6587,5688] |
+-----+-------------+--------------+----------------------------+
My stupid solutions is
def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
val dfa = dfA.filter("id ="+ id)
val dfb = dfB.filter("id ="+ id)
val arr = dfb.groupBy("id")
.agg(collect_list(struct("min_value1","max_value1"))
.collect()
val rangArray = arr(0)(1).asInstanceOf[Seq[Row]] // get range array of id
// initial a resultDF to store each query's results
val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val s = "value1 between "+min_value1+" and "+ max_value1
var resultDF = dfa.filter(s).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
for( i <-1 to timePairArr.length-1){
val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
val tempResultDF = dfa.filter(query).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
resultDF = resultDF.union(tempResultDF)
}
return resultDF
}
def myFunction():DataFrame = {
val dfA = spark.read.parquet(routeA)
val dfB = spark.read.parquet(routeB)
val idArrays = dfB.select("id").distinct().collect()
// initial result
var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
//tranverse all id
for(i<-1 to idArrays.length-1){
val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
resultDF = resultDF.union(tempDF)
}
return resultDF
}
Maybe you don't want to see my brute force code.it's idea is
finalResult = null;
for each id in dfB:
for query condition of this id:
tempResult = query dfA
union tempResult to finalResult
I've tried my algorithms, it cost almost 50 hours.
Does anybody has a more efficient way ? Very thanks.
Assuming that your DFB is small dataset, I am trying to give the below solution.
Try using a Broadcast Join like below
import org.apache.spark.sql.functions.broadcast
dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));
BroadcastJoin is like a Map Side Join. This will materialize the smaller data to all the mappers. This will improve the performance by omitting the required sort-and-shuffle phase during a reduce step.
Some points i would like you to avoid:
Never use collect(). When a collect operation is issued on a RDD, the dataset is copied to the driver.
If your data is too big you might get memory out of bounds exception.
Try using take() or takeSample() instead.
It is obvious that when two dataframes/datasets are involved in calculation then join should be performed. So join is a must step for you. But when should you join is the important question.
I would suggest to aggregate and reduce rows in dataframes as much as possible before joining as it would reduce shuffling.
In your case you can reduce only dfA as you need exact dfB with a column added from dfA meeting the condition
So you can groupBy id and aggregate dfA so that you get one row of each id, then you can perform the join. And then you can use a udf function for your logic of calculation
comments are provided for clarity and explanation
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))
dfA.groupBy("id") //grouping by id
.agg(collect_list(struct("value1", "value2")).as("collection")) //collecting all the value1 and value2 as structs
.join(dfB, Seq("id"), "right") //joining both dataframes with id
.select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results")) //calling the udf function defined above
which should give you
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results |
+---+----------+----------+------------------------------+
|1 |0 |3 |[4345, 3434, 4676, 3454] |
|1 |5 |9 |[5778, 5674, 3456, 6590, 5461]|
|2 |1 |4 |[2343, 4946, 4353, 4354] |
|2 |6 |8 |[8695, 6587, 5688] |
+---+----------+----------+------------------------------+
I hope the answer is helpful