How to create a column with the maximum number in each row of another column in PySpark? - pyspark

I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?

If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+

Related

Weighted mean median quartiles in Spark

I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+

Dynamically selecting columns with parameter using Scala - Spark

I need to dynamically select columns from one of the two tables I am joining. The name of one of the columns to be selected is passed to a variable. Here are the details.
The table names are passed to variables. So is the join_id and the join_type.
//Creating scala variables for each table
var table_name_a = dbutils.widgets.get("table_name_a")
var table_name_b = dbutils.widgets.get("table_name_b")
//Create scala variable for Join Id
var join_id = dbutils.widgets.get("table_name_b") + "Id"
// Define join type
var join_type = dbutils.widgets.get("join_type")
Then, I join the tables. I want to select all columns from table A and only two columns from table B: one column is called "Description" no matter what table B is passed in the parameter above; the second column has the same name of the table B, e.g., if table B's name is Employee, I want to select a column named "Employee" from table B. The code below selects all columns from table A and the Description column from table B (aliased). But I still need to select another column from table B that has the same name as the table. I don't know in advance how many columns table B has in total nor column order or their names - since Table B is passed as a parameter.
// Joining Tables
var df_joined_tables = df_a
.join(df_b,
df_a(join_id)===df_b(join_id),
join_type
).select($"df_a.*",$"df_b.Description".alias(table_name_b + " Description"))
My question is: How do I pass the variable table_name_b as a column I am trying to select from table B?
I tried the code below which is obviously wrong because in "$"df_b.table_name_b" the "table_name_b" is supposed to be the content of the parameter and not the name of the columns itself.
var df_joined_tables = df_a
.join(df_b,
df_a(join_id)===df_b(join_id),
join_type
).select($"df_a.*",$"df_b.Description".alias(table_name_b + " Description"),$"df_b.table_name_b")
Then I tried the code below and it gives the error: "value table_name_b is not a member of org.apache.spark.sql.DataFrame"
var df_joined_tables = df_a
.join(df_b,
df_a(join_id)===df_b(join_id),
join_type
).select($"df_a.*",$"df_b.Description".alias(table_name_b + " Description"),df_b.table_name_b)
How do I pass the variable table_name_b as a column I need to select from table B?
You can build a List[org.apache.spark.sql.Column] and use it on your select function like the below example:
// sample input:
val df = Seq(
("A", 1, 6, 7),
("B", 2, 7, 6),
("C", 3, 8, 5),
("D", 4, 9, 4),
("E", 5, 8, 3)
).toDF("name", "col1", "col2", "col3")
df.printSchema()
val columnNames = List("col1", "col2") // string column names from your params
val columnsToSelect = columnNames.map(col(_)) // convert the required column names from string to column type
df.select(columnsToSelect: _*).show() // using the list of columns
// output:
+----+----+
|col1|col2|
+----+----+
| 1| 6|
| 2| 7|
| 3| 8|
| 4| 9|
| 5| 8|
+----+----+
Similarly can be applied for join's
Update
Adding another example:
val aliasTableA = "tableA"
val aliasTableB = "tableB"
val joinField = "name"
val df1 = Seq(
("A", 1, 6, 7),
("B", 2, 7, 6),
("C", 3, 8, 5),
("D", 4, 9, 4),
("E", 5, 8, 3)
).toDF("name", "col1", "col2", "col3")
val df2 = Seq(
("A", 11, 61, 71),
("B", 21, 71, 61),
("C", 31, 81, 51)
).toDF("name", "col_1", "col_2", "col_3")
df1.alias(aliasTableA)
.join(df2.alias(aliasTableB), Seq(joinField))
.selectExpr(s"${aliasTableA}.*", s"${aliasTableB}.col_1", s"${aliasTableB}.col_2").show()
// output:
+----+----+----+----+-----+-----+
|name|col1|col2|col3|col_1|col_2|
+----+----+----+----+-----+-----+
| A| 1| 6| 7| 11| 61|
| B| 2| 7| 6| 21| 71|
| C| 3| 8| 5| 31| 81|
+----+----+----+----+-----+-----+

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

I have a dataframe like below
c1 Value
A Array[47,97,33,94,6]
A Array[59,98,24,83,3]
A Array[77,63,93,86,62]
B Array[86,71,72,23,27]
B Array[74,69,72,93,7]
B Array[58,99,90,93,41]
C Array[40,13,85,75,90]
C Array[39,13,33,29,14]
C Array[99,88,57,69,49]
I need an output as below.
c1 Value
A Array[183,258,150,263,71]
B Array[218,239,234,209,75]
C Array[178,114,175,173,153]
Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner .
Please help, I couldn't find any way of doing this in google .
It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.
Let's first generate some data:
val df = spark.range(6)
.select('id % 3 as "c1",
array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1| Value|
+---+---------------+
| 0|[7, 4, 7, 4, 0]|
| 1|[3, 3, 2, 8, 5]|
| 2|[2, 1, 0, 4, 4]|
| 0|[0, 4, 2, 1, 8]|
| 1|[1, 5, 7, 4, 3]|
| 2|[2, 5, 0, 2, 2]|
+---+---------------+
Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:
val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
.groupBy("c1")
.agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
.show()
+---+------------------+
| c1| Value|
+---+------------------+
| 0|[11, 18, 15, 8, 9]|
| 1| [2, 10, 5, 7, 4]|
| 2|[7, 14, 15, 10, 4]|
+---+------------------+

pyspark Transpose dataframe

I have a dataframe as given below
ID, Code_Num, Code, Code1, Code2, Code3
10, 1, A1005*B1003, A1005, B1003, null
12, 2, A1007*D1008*C1004, A1007, D1008, C1004
I need help on transposing the above dataset, and output should be displayed as below.
ID, Code_Num, Code, Code_T
10, 1, A1005*B1003, A1005
10, 1, A1005*B1003, B1003
12, 2, A1007*D1008*C1004, A1007
12, 2, A1007*D1008*C1004, D1008
12, 2, A1007*D1008*C1004, C1004
Step 1: Creating the DataFrame.
values = [(10, 'A1005*B1003', 'A1005', 'B1003', None),(12, 'A1007*D1008*C1004', 'A1007', 'D1008', 'C1004')]
df = sqlContext.createDataFrame(values,['ID','Code','Code1','Code2','Code3'])
df.show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Step 2: Explode the DataFrame -
def to_transpose(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_transpose(df, ["ID","Code"]).drop('key').withColumnRenamed("val","Code_T")
df.show()
+---+-----------------+------+
| ID| Code|Code_T|
+---+-----------------+------+
| 10| A1005*B1003| A1005|
| 10| A1005*B1003| B1003|
| 10| A1005*B1003| null|
| 12|A1007*D1008*C1004| A1007|
| 12|A1007*D1008*C1004| D1008|
| 12|A1007*D1008*C1004| C1004|
+---+-----------------+------+
In case you only want non-Null values in column Code_T, just run the statement below -
df = df.where(col('Code_T').isNotNull())

Spark Dataframe Arraytype columns

I would like to create a new column on a dataframe, which is the result of applying a function to an arraytype column.
Something like this:
df = df.withColumn("max_$colname", max(col(colname)))
where each row of the column holds an array of values?
The functions in spark.sql.function appear to work on a column basis only.
You can apply user defined functions on the array column.
1.DataFrame
+------------------+
| arr|
+------------------+
| [1, 2, 3, 4, 5]|
|[4, 5, 6, 7, 8, 9]|
+------------------+
2.Creating UDF
import org.apache.spark.sql.functions._
def max(arr: TraversableOnce[Int])=arr.toList.max
val maxUDF=udf(max(_:Traversable[Int]))
3.Applying UDF in query
df.withColumn("arrMax",maxUDF(df("arr"))).show
4.Result
+------------------+------+
| arr|arrMax|
+------------------+------+
| [1, 2, 3, 4, 5]| 5|
|[4, 5, 6, 7, 8, 9]| 9|
+------------------+------+