Adding hours to timestamp in pyspark dynamically - pyspark

import pyspark.sql.functions as F
from datetime import datetime
data = [
(1, datetime(2017, 3, 12, 3, 19, 58), 'Raising',2),
(2, datetime(2017, 3, 12, 3, 21, 30), 'sleeping',1),
(3, datetime(2017, 3, 12, 3, 29, 40), 'walking',3),
(4, datetime(2017, 3, 12, 3, 31, 23), 'talking',5),
(5, datetime(2017, 3, 12, 4, 19, 47), 'eating',6),
(6, datetime(2017, 3, 12, 4, 33, 51), 'working',7),
]
df.show()
| id| testing_time|test_name|shift|
| 1|2017-03-12 03:19:58| Raising| 2|
| 2|2017-03-12 03:21:30| sleeping| 1|
| 3|2017-03-12 03:29:40| walking| 3|
| 4|2017-03-12 03:31:23| talking| 5|
| 5|2017-03-12 04:19:47| eating| 6|
| 6|2017-03-12 04:33:51| working| 7|
Now I want to add shift (hours) to the testing time. Can anybody help me out with a quick solution?

You can use something like below. You need to convert shift field to seconds so I multiply it with 3600
>>> df.withColumn("testing_time", (F.unix_timestamp("testing_time") + F.col("shift")*3600).cast('timestamp')).show()
+---+-------------------+---------+-----+
| id| testing_time|test_name|shift|
+---+-------------------+---------+-----+
| 1|2017-03-12 05:19:58| Raising| 2|
| 2|2017-03-12 04:21:30| sleeping| 1|
| 3|2017-03-12 06:29:40| walking| 3|
| 4|2017-03-12 08:31:23| talking| 5|
| 5|2017-03-12 10:19:47| eating| 6|
| 6|2017-03-12 11:33:51| working| 7|
+---+-------------------+---------+-----+

Related

How to calculate the ApproxQuanitiles from list of Integers into Spark DataFrame column using scala

I have a spark DataFrame with a column containing several arrays of Integers with varying lengths. I will need to create a new column to find the Quantiles for each of these.
This is the input DataFrame :
+---------+------------------------+
|Comm |List_Nb_total_operations|
+---------+------------------------+
| comm1| [1, 1, 2, 3, 4]|
| comm4| [2, 2]|
| comm3| [2, 2]|
| comm0| [1, 1, 1, 2, 2, 2, 3,3]|
| comm2| [1, 1, 1, 2, 3]|
+---------+------------------------+
This is the desired result :
+---------+------------------------+----+----+
|Comm |List_Nb_total_operations|QT25|QT75|
+---------+------------------------+----+----+
| comm1| [1, 1, 2, 3, 4]| 1| 3|
| comm4| [2, 2]| 2| 2|
| comm3| [2, 2]| 2| 2|
| comm0| [1, 1, 1, 2, 2, 2, 3,3]| 1| 3|
| comm2| [1, 1, 1, 2, 3]| 1| 2|
+---------+------------------------+----+----+
The function you want to use is percentile_approx (since Spark 3.1):
val df = Seq(
("comm1", Seq(1,1,2,3,4)),
("comm4", Seq(2,2)),
("comm3", Seq(2,2)),
("comm0", Seq(1,1,1,2,2,2,3,3)),
("comm2", Seq(1,1,1,2,3))
).toDF("Comm", "ops")
val dfQ = df.select(
col("Comm"),
explode(col("ops")) as "ops")
.groupBy("Comm")
.agg(
percentile_approx($"ops", lit(0.25), lit(100)) as "q25",
percentile_approx($"ops", lit(0.75), lit(100)) as "q75"
)
val dfWithQ = df.join(dfQ, Seq("Comm"))
The documentation has more information regarding tuning the parameters for accuracy.
Thank you for your help. I've found an other solution that works very well in my case:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
def percentile_approxx(col: Column, percentage: Column, accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
val perc_df = df.groupBy("Comm").agg(percentile_approxx(col("ops"), lit(0.75), lit(100)))

Count of occurences of multiple values in array of string column in spark <2.2 and scala

I need to find a count of occurrences of specific elements present in array, we can use array_contains function but I am looking for another solution that can work below spark 2.2
Input:
+----+------------------+
|col1| array_col2|
+----+------------------+
| x| [1, 2, 3, 7, 7]|
| z|[3, 2, 8, 9, 4, 9]|
| a| [4, 5, 2, 8]|
+----+------------------+
result1 -> count of occuriencies of 1,2 in a given array column array_col2
result2 -> count of occuriencies of 3,7,9 in a given array column array_col2
Expected Output:
+----+------------------+----------+----------+
|col1| array_col2| result1| result2|
+----+------------------+----------+----------+
| x| [1, 2, 3, 7, 7]| 2| 3|
| z|[3, 2, 8, 9, 4, 9]| 1| 3|
| a| [4, 5, 2, 8]| 1| 0|
+----+------------------+----------+----------+
You can use UDF :
val count_occ = udf((s: Seq[Int], f: Seq[Int]) => s.filter(f.contains(_)).size)
val df1 = df.withColumn(
"result1",
count_occ($"array_col2", array(lit(1), lit(2)))
).withColumn(
"result2",
count_occ($"array_col2", array(lit(3), lit(7), lit(9)))
)
df1.show
//+----+------------------+-------+-------+
//|col1| array_col2|result1|result2|
//+----+------------------+-------+-------+
//| x| [1, 2, 3, 7, 7]| 2| 3|
//| z|[3, 2, 8, 9, 4, 9]| 1| 3|
//| a| [4, 5, 2, 8]| 1| 0|
//+----+------------------+-------+-------+
You can also explode the array then groupby and count :
val df1 = df.withColumn(
"col2",
explode($"array_col2")
).groupBy("col1", "array_col2").agg(
count(when($"col2".isin(1, 2), 1)).as("result1"),
count(when($"col2".isin(3, 7, 9), 1)).as("result2")
)

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

I have a dataframe like below
c1 Value
A Array[47,97,33,94,6]
A Array[59,98,24,83,3]
A Array[77,63,93,86,62]
B Array[86,71,72,23,27]
B Array[74,69,72,93,7]
B Array[58,99,90,93,41]
C Array[40,13,85,75,90]
C Array[39,13,33,29,14]
C Array[99,88,57,69,49]
I need an output as below.
c1 Value
A Array[183,258,150,263,71]
B Array[218,239,234,209,75]
C Array[178,114,175,173,153]
Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner .
Please help, I couldn't find any way of doing this in google .
It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.
Let's first generate some data:
val df = spark.range(6)
.select('id % 3 as "c1",
array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1| Value|
+---+---------------+
| 0|[7, 4, 7, 4, 0]|
| 1|[3, 3, 2, 8, 5]|
| 2|[2, 1, 0, 4, 4]|
| 0|[0, 4, 2, 1, 8]|
| 1|[1, 5, 7, 4, 3]|
| 2|[2, 5, 0, 2, 2]|
+---+---------------+
Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:
val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
.groupBy("c1")
.agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
.show()
+---+------------------+
| c1| Value|
+---+------------------+
| 0|[11, 18, 15, 8, 9]|
| 1| [2, 10, 5, 7, 4]|
| 2|[7, 14, 15, 10, 4]|
+---+------------------+

Flatten Group By in Pyspark

I have a pyspark dataframe. For example,
d= hiveContext.createDataFrame([("A", 1), ("B", 2), ("D", 3), ("D", 3), ("A", 4), ("D", 3)],["Col1", "Col2"])
+----+----+
|Col1|Col2|
+----+----+
| A| 1|
| B| 2|
| D| 3|
| D| 3|
| A| 4|
| D| 3|
+----+----+
I want to group by Col1 and then create a list of Col2. I need to flatten the groups. I do have a lot of columns.
+----+----------+
|Col1| Col2|
+----+----------+
| A| [1,4] |
| B| [2] |
| D| [3,3,3]|
+----+----------+
You can do a groupBy() and use collect_list() as your aggregate function:
import pyspark.sql.functions as f
d.groupBy('Col1').agg(f.collect_list('Col2').alias('Col2')).show()
#+----+---------+
#|Col1| Col2|
#+----+---------+
#| B| [2]|
#| D|[3, 3, 3]|
#| A| [1, 4]|
#+----+---------+
Update
If you had multiple columns to combine, you could use collect_list() on each, and the combine the resulting lists using struct() and udf(). Consider the following example:
Create Dummy Data
from operator import add
import pyspark.sql.functions as f
# create example dataframe
d = sqlcx.createDataFrame(
[
("A", 1, 10),
("B", 2, 20),
("D", 3, 30),
("D", 3, 10),
("A", 4, 20),
("D", 3, 30)
],
["Col1", "Col2", "Col3"]
)
Collect Desired Columns into lists
Suppose you had a list of columns you wanted to collect into a list. You could do the following:
cols_to_combine = ['Col2', 'Col3']
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine]).show()
#+----+---------+------------+
#|Col1| Col2| Col3|
#+----+---------+------------+
#| B| [2]| [20]|
#| D|[3, 3, 3]|[30, 10, 30]|
#| A| [4, 1]| [20, 10]|
#+----+---------+------------+
Combine Resultant Lists into one Column
Now we want to combine the list columns into one list. If we use struct(), we will get the following:
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', f.struct(*cols_to_combine).alias('Combined'))\
.show(truncate=False)
#+----+------------------------------------------------+
#|Col1|Combined |
#+----+------------------------------------------------+
#|B |[WrappedArray(2),WrappedArray(20)] |
#|D |[WrappedArray(3, 3, 3),WrappedArray(10, 30, 30)]|
#|A |[WrappedArray(1, 4),WrappedArray(10, 20)] |
#+----+------------------------------------------------+
Flatten Wrapped Arrays
Almost there. We just need to combine the WrappedArrays. We can achieve this with a udf():
combine_wrapped_arrays = f.udf(lambda val: reduce(add, val), ArrayType(IntegerType()))
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', combine_wrapped_arrays(f.struct(*cols_to_combine)).alias('Combined'))\
.show(truncate=False)
#+----+---------------------+
#|Col1|Combined |
#+----+---------------------+
#|B |[2, 20] |
#|D |[3, 3, 3, 30, 10, 30]|
#|A |[1, 4, 10, 20] |
#+----+---------------------+
References
Pyspark Merge WrappedArrays Within a Dataframe
Update 2
A simpler way, without having to deal with WrappedArrays:
from operator import add
combine_udf = lambda cols: f.udf(
lambda *args: reduce(add, args),
ArrayType(IntegerType())
)
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', combine_udf(cols_to_combine)(*cols_to_combine).alias('Combined'))\
.show(truncate=False)
#+----+---------------------+
#|Col1|Combined |
#+----+---------------------+
#|B |[2, 20] |
#|D |[3, 3, 3, 30, 10, 30]|
#|A |[1, 4, 10, 20] |
#+----+---------------------+
Note: This last step only works if the datatypes for all of the columns are the same. You can not use this function to combine wrapped arrays with mixed types.
from spark 2.4 you can use pyspark.sql.functions.flatten
import pyspark.sql.functions as f
df.groupBy('Col1').agg(f.flatten(f.collect_list('Col2')).alias('Col2')).show()

Multiple columns aggregation in Spark/Scala [duplicate]

This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 5 years ago.
I have a Spark Dataset with numerous columns:
val df = Seq(
("a", 2, 3, 5, 3, 4, 2, 6, 7, 3),
("a", 1, 1, 2, 4, 5, 7, 3, 5, 2),
("b", 5, 7, 3, 6, 8, 8, 9, 4, 2),
("b", 2, 2, 3, 5, 6, 3, 2, 4, 8),
("b", 2, 5, 5, 4, 3, 6, 7, 8, 8),
("c", 1, 2, 3, 4, 5, 6, 7, 8, 9)
).toDF("id", "p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8", "p9")
Now I'd like to do a groupBy over id and get the sum of each p-column for each id.
Currently I'm doing the following:
val dfg =
df.groupBy("id")
.agg(
sum($"p1").alias("p1"),
sum($"p2").alias("p2"),
sum($"p3").alias("p3"),
sum($"p4").alias("p4"),
sum($"p5").alias("p5"),
sum($"p6").alias("p6"),
sum($"p7").alias("p7"),
sum($"p8").alias("p8"),
sum($"p9").alias("p9")
)
Which produces the (correct) output:
+---+---+---+---+---+---+---+---+---+---+
| id| p1| p2| p3| p4| p5| p6| p7| p8| p9|
+---+---+---+---+---+---+---+---+---+---+
| c| 1| 2| 3| 4| 5| 6| 7| 8| 9|
| b| 9| 14| 11| 15| 17| 17| 18| 16| 18|
| a| 3| 4| 7| 7| 9| 9| 9| 12| 5|
+---+---+---+---+---+---+---+---+---+---+
Question is, in reality I have several dozens p-columns like that and I'd like to be able to write the aggregation in a more concise way.
Based on the answers to this question, I've tried to do the following:
val pcols = List.range(1, 10)
val ops = pcols.map(k => sum(df(s"p$k")).alias(s"p$k"))
val dfg =
df.groupBy("id")
.agg(ops: _*) // does not compile — agg does not accept *-parameters
Unfortunately, unlike select(), agg() does not seem to accept *-parameters and so this doesn't work, producing a compile-time no ': _*' annotation allowed here error.
agg has this signature: def agg(expr: Column, exprs: Column*): DataFrame
So try this:
df.groupBy("id")
.agg(ops.head,ops.tail:_*)