Alternative to GroupBy for Pyspark Dataframe? - pyspark

I have a dataset like this:
timestamp vars
2 [1,2]
2 [1,2]
3 [1,2,3]
3 [1,2]
And I want a dataframe like this. Basically each value in the above dataframe is an index and the frequency of that value is the value at that index. This computation is done over every unique timestamp.
timestamp vars
2 [0, 2, 2]
3 [0,2,2,1]
Right now, I'm grouping by timestamp, and aggregrating/flattening vars (to get something like (1,2,1,2 for timestamp 2 or 1,2,3,1,2 for timestamp 3) and then I have a udf that uses collections.Counter to to get a key->value dict. I then turn this dict into the format I want.
The groupBy/agg can get arbitrarily large (arrays size can be in the millions) and this seems like a good usecase for the Window function, but I'm not sure how to put it all together.
Thinks it's also worth mentioning that I've tried repartioning, and converting to an RDD and using groupByKey. Both are arbitrarily slow (>24 hours) on large datasets.

Edit: As discussed in comments, the issue for the original methods could be from count using the filter or aggregate functions which triggers unnecessary data scans. Below we explode the arrays and do the aggregation(count) before creating the final array column:
from pyspark.sql.functions import collect_list, struct
df = spark.createDataFrame([(2,[1,2]), (2,[1,2]), (3,[1,2,3]), (3,[1,2])],['timestamp', 'vars'])
df.selectExpr("timestamp", "explode(vars) as var") \
.groupby('timestamp','var') \
.count() \
.groupby("timestamp") \
.agg(collect_list(struct("var","count")).alias("data")) \
.selectExpr(
"timestamp",
"transform(data, x -> x.var) as indices",
"transform(data, x -> x.count) as values"
).selectExpr(
"timestamp",
"transform(sequence(0, array_max(indices)), i -> IFNULL(values[array_position(indices,i)-1],0)) as new_vars"
).show(truncate=False)
+---------+------------+
|timestamp|new_vars |
+---------+------------+
|3 |[0, 2, 2, 1]|
|2 |[0, 2, 2] |
+---------+------------+
Where:
(1) we explode the array and do count() for each timestamp + var
(2) groupby timestamp and create an array of structs containing two fields var and count
(3) convert the array of structs into two arrays: indices and values (similar to what we define SparseVector)
(4) transform the sequence sequence(0, array_max(indices)), for each i in the sequence, use array_position to find the index of i in indices array and then retrieve the value from values array at the same position, see below:
IFNULL(values[array_position(indices,i)-1],0)
notice that the function array_position uses 1-based index and array indexing is 0-based, thus we have a -1 in the above expression.
Old methods:
(1) Use transform + filter/size
from pyspark.sql.functions import flatten, collect_list
df.groupby('timestamp').agg(flatten(collect_list('vars')).alias('data')) \
.selectExpr(
"timestamp",
"transform(sequence(0, array_max(data)), x -> size(filter(data, y -> y = x))) as vars"
).show(truncate=False)
+---------+------------+
|timestamp|vars |
+---------+------------+
|3 |[0, 2, 2, 1]|
|2 |[0, 2, 2] |
+---------+------------+
(2) Use aggregate function:
df.groupby('timestamp').agg(flatten(collect_list('vars')).alias('data')) \
.selectExpr("timestamp", """
aggregate(
data,
/* use an array as zero_value, size = array_max(data))+1 and all values are zero */
array_repeat(0, int(array_max(data))+1),
/* increment the ith value of the Array by 1 if i == y */
(acc, y) -> transform(acc, (x,i) -> IF(i=y, x+1, x))
) as vars
""").show(truncate=False)

Related

Spark: apply sliding() to each row without UDF

I have a Dataframe with several columns. The i-th column contains strings. I want to apply the string sliding(n) function to each string in the column. Is there a way to do so without using user-defined functions?
Example:
My dataframe is
var df = Seq((0, "hello"), (1, "hola")).toDF("id", "text")
I want to apply the sliding(3) function to each element of column "text" to obtain a dataframe corresponding to
Seq(
(0, ("hel", "ell", "llo"))
(1, ("hol", "ola"))
)
How can I do this?
For spark version >= 2.4.0, this can be done using the inbuilt functions array_repeat, transform and substring.
import org.apache.spark.sql.functions.{array_repeat, transform, substring}
//Repeat the array `n` times
val repeated_df = df.withColumn("tmp",array_repeat($"text",length($"text")-3+1))
//Get the slices with transform higher order function
val res = repeated_df.withColumn("str_slices",
expr("transform(tmp,(x,i) -> substring(x from i+1 for 3))")
)
//res.show()
+---+-----+---------------------+---------------+
|id |text |tmp |str_slices |
+---+-----+---------------------+---------------+
|0 |hello|[hello, hello, hello]|[hel, ell, llo]|
|1 |hola |[hola, hola] |[hol, ola] |
+---+-----+---------------------+---------------+

Flatten array of arrays (different dimensions) of a sql.dataframe.DataFrame in pyspark

I have a pyspark.sql.dataframe.DataFrame which is something like this:
+---------------------------+--------------------+--------------------+
|collect_list(results) | userid | page |
+---------------------------+--------------------+--------------------+
| [[[roundtrip, fal...|13482f06-9185-47f...|1429d15b-91d0-44b...|
+---------------------------+--------------------+--------------------+
Inside the collect_list(results) column there is an array with len = 2, and the elements are also arrays (the first one has a len = 1, and the second one a len = 9).
Is there a way to flatten this array of arrays into a unique array with len = 10 using pyspark?
Thanks!
You can flatten an array of array using pyspark.sql.functions.flatten. Documentation here. For example this will create a new column called results with the flatten results assuming your dataframe variable is called df.
import pyspark.sql.functions as F
...
df.withColumn('results', F.flatten('collect_list(results)')
For a version that works before Spark 2.4 (but not before 1.3), you could try to explode the dataset you obtained before grouping, thereby unnesting one level of the array, then call groupBy and collect_list. Like this:
from pyspark.sql.functions import collect_list, explode
df = spark.createDataFrame([("foo", [1,]), ("foo", [2, 3])], schema=("foo", "bar"))
df.show()
# +---+------+
# |foo| bar|
# +---+------+
# |foo| [1]|
# |foo|[2, 3]|
# +---+------+
(df.select(
df.foo,
explode(df.bar))
.groupBy("foo")
.agg(collect_list("col"))
.show())
# +---+-----------------+
# |foo|collect_list(col)|
# +---+-----------------+
# |foo| [1, 2, 3]|
# +---+-----------------+

Scala/Spark - How to get first elements of all sub-arrays

I have the following DataFrame in a Spark (I'm using Scala):
[[1003014, 0.95266926], [15, 0.9484202], [754, 0.94236785], [1029530, 0.880922], [3066, 0.7085166], [1066440, 0.69400793], [1045811, 0.663178], [1020059, 0.6274495], [1233982, 0.6112905], [1007801, 0.60937023], [1239278, 0.60044676], [1000088, 0.5789191], [1056268, 0.5747936], [1307569, 0.5676605], [10334513, 0.56592846], [930, 0.5446228], [1170206, 0.52525467], [300, 0.52473146], [2105178, 0.4972785], [1088572, 0.4815367]]
I want to get a Dataframe with only first Ints of each sub-array, something like:
[1003014, 15, 754, 1029530, 3066, 1066440, ...]
Keeping hence only the x[0] of each sub-array x of the Array listed above.
I'm new to Scala, and couldn't find the right anonymous map function.
Thanks in advance for any help
For Spark >= 2.4, you can use Higher-Order Function transform with lambda function to extract the first element of each value array.
scala> df.show(false)
+----------------------------------------------------------------------------------------+
|arrays |
+----------------------------------------------------------------------------------------+
|[[1003014.0, 0.95266926], [15.0, 0.9484202], [754.0, 0.94236785], [1029530.0, 0.880922]]|
+----------------------------------------------------------------------------------------+
scala> df.select(expr("transform(arrays, x -> x[0])").alias("first_array_elements")).show(false)
+-----------------------------------+
|first_array_elements |
+-----------------------------------+
|[1003014.0, 15.0, 754.0, 1029530.0]|
+-----------------------------------+
Spark < 2.4
Explode the initial array and then aggregate with collect_list to collect the first element of each sub array:
df.withColumn("exploded_array", explode(col("arrays")))
.agg(collect_list(col("exploded_array")(0)))
.show(false)
EDIT:
In case the array contains structs and not sub-arrays, just change the accessing method using dots for struct elements:
val transfrom_expr = "transform(arrays, x -> x.canonical_id)"
df.select(expr(transfrom_expr).alias("first_array_elements")).show(false)
Using Spark 2.4:
val df = Seq(
Seq(Seq(1.0,2.0),Seq(3.0,4.0))
).toDF("arrs")
df.show()
+--------------------+
| arrs|
+--------------------+
|[[1.0, 2.0], [3.0...|
+--------------------+
df
.select(expr("transform(arrs, x -> x[0])").as("arr_first"))
.show()
+----------+
| arr_first|
+----------+
|[1.0, 3.0]|
+----------+

How to join and reduce two datasets with arrays?

I need an idea for how to join two datasets with millions of arrays. Each dataset will have Longs numbered 1-10,000,000. But with different groupings in each one ex. [1,2] [3, 4] and [1], [2, 3], [4] output should be [1,2,3,4]
I need some way to join these sets efficiently.
I have tried an approach where I explode and group by multiple times, finally sorting and distincting the arrays. This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
Any ideas on how to use another approach like a reducer or aggregation to solve this problem more efficiently.
The following is a scala code example. However, I would need an approach that works in java as well.
val rdd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11]}"""))
val rdd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}"""))
val srdd1 = spark.read.json(rdd1)
val srdd2 = spark.read.json(rdd2)
Dataset 1:
+---------+
|groupings|
+---------+
|[1, 2, 3]|
|[4, 5, 6]|
|[7, 8, 9]|
| [10]|
| [11]|
+---------+
Dataset 2:
+---------+
|groupings|
+---------+
| [1]|
|[2, 3, 4]|
| [7, 8]|
| [9]|
| [10, 11]|
+---------+
Output should be
+------------------+
| groupings|
+------------------+
|[1, 2, 3, 4, 5, 6]|
| [7, 8, 9]|
| [10, 11]|
+------------------+
Update:
This was my original code, which I had problems running, #AyanGuha had me thinking that perhaps it would be simpler to just use a series of joins instead, I am testing that now and will post a solution if it works out.
srdd1.union(srdd2).withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.select(callUDF("sortLongArray", col("groupings")).alias("groupings"))
.distinct()
What this code showed was that after 3 iterations the data coalesced, ideally then 3 joins would do the same.
Update 2:
Looks like I have a new working version, still seems inefficient but I think this will be handled better by spark.
val ardd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11,12]}""", """{"groupings":[13,14]}"""))
val ardd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}""", """{"groupings":[12,13]}""", """{"groupings":[14, 15]}"""))
var srdd1 = spark.read.json(ardd1)
var srdd2 = spark.read.json(ardd2)
val addUDF = udf((x: Seq[Long], y: Seq[Long]) => if(y == null) x else (x ++ y).distinct.sorted)
val encompassUDF = udf((x: Seq[Long], y: Seq[Long]) => if(x.size == y.size) false else (x diff y).size == 0)
val arrayContainsAndDiffUDF = udf((x: Seq[Long], y: Seq[Long]) => (x.intersect(y).size > 0) && (y diff x).size > 0)
var rdd1 = srdd1
var rdd2 = srdd2.withColumnRenamed("groupings", "groupings2")
for (i <- 1 to 3){
rdd1 = rdd1.join(rdd2, arrayContainsAndDiffUDF(col("groupings"), col("groupings2")), "left")
.select(addUDF(col("groupings"), col("groupings2")).alias("groupings"))
.distinct
.alias("rdd1")
rdd2 = rdd1.select(col("groupings").alias("groupings2")).alias("rdd2")
}
rdd1.join(rdd2, encompassUDF(col("groupings"), col("groupings2")), "leftanti")
.show(10, false)
Outputs:
+------------------------+
|groupings |
+------------------------+
|[10, 11, 12, 13, 14, 15]|
|[1, 2, 3, 4, 5, 6] |
|[7, 8, 9] |
+------------------------+
I will try this at scale and see what I get.
This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
I don't think you have other options than explodeing the arrays, join followed by distinct. Spark is fairly good at such computations and tries doing them as much using internal binary rows as possible. The datasets are compressed and often comparisons are done at byte level (outside JVM)
That's just a matter of enough memory to hold all the elements which may not that big deal.
I'd recommend giving your solution a try and check out the physical plan and the stats. It could in the end turn out to be the only available solution.
Here is an alternate solution using ARRAY data type that is supported as part of HiveQL. This will at least make your coding simple [i.e. building out the logic]. Code below assumes that the raw data is in a text file.
Step 1. Create table
create table array_table1(array_col1 array<int>) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED
BY'\n' STORED AS text;
Step 2: Load data into both tables
LOAD DATA INPATH('/path/to/file') OVERWRITE INTO TABLE array_table1
Step 3: Apply sql functions to get results
select distinct(explode(array_col1)) from array_table1 union
select distinct(explode(array_col2)) from array_table2
I am not clear on what is the final output you are looking for, from the example. Is it just union of all distinct numbers - or are they supposed to have a grouping? But anyway, with the tables created you can use a combination of distinct, explode(), left anti join and union - to get the expected results.
You may want to optimize this code to filter the final data set again for duplicates.
Hope that helps!
OK I finally figured it out.
First of all with my array joins I was doing something very wrong, which I overlooked initially.
When joining two arrays with a equivalency. EX. does [1,2,3] equal [1,2,3]? the arrays are hashed. I was doing an intersection match using a UDF. Given x in [1,2,3] is any x in [1, 2, 3, 4, 5]. This cannot be hashed and therefore requires a plan which will check every row with every row.
So to do this you have to explode both arrays first, then join them.
You can then apply other criteria. For example I saved time by only joining arrays which were not equal and when summed were less then the other.
Example with a self join:
rdd2 = rdd2.withColumn("single", explode(col("grouping"))) // Explode the grouping
temp = rdd2.withColumnRenamed("grouping", "grouping2").alias("temp") // Alias for self join
rdd2 = rdd2.join(temp, rdd2.col("single").equalTo(temp.col("single")) // Compare singles which will be hashed
.and(col("grouping").notEqual(col("grouping2"))) // Apply further conditions
.and(callUDF("lessThanArray", col("grouping"), col("grouping2"))) // Make it so only [1,2,3] [4,5,6] is joined and not the duplicate [4,5,6] [1,2,3], for efficiency
, "left") // Left so that the efficiency criteria do not drop rows
I then grouped by the grouping which was joined against aggregated the groupings from the self join.
rdd2.groupBy("grouping")
.agg(callUDF("list_agg",col('grouping2')).alias('grouping2')) // List agg is a UserDefinedAggregateFunction which aggregates lists into a distinct list
.select(callUDF("addArray", col("grouping"), col("grouping2")).alias("grouping")) // AddArray is a udf which concats and distincts 2 arrays
grouping grouping2
[1,2,3] [3,4,5]
[1,2,3] [2,6,7]
[1,2,3] [2,8,9]
becomes just
[1,2,3] [2,3,4,5,6,7,8,9]
after addArray
[1,2,3,4,5,6,7,8,9]
I then iterated that code 3 times, which seems to make everything coalesce and threw in a distinct for good measure.
Notice from the original question I had two datasets, for my specific problem I discovered some assumptions about the first and second set. The first set I could assume had no duplicates as it was a master list, the second set had duplicates hence I only needed to apply the above code to the second set, then join it with the first. I would assume if both sets had duplicates they could be unioned together first.

How to concatenate multiple columns into single column (with no prior knowledge on their number)?

Let say I have the following dataframe:
agentName|original_dt|parsed_dt| user|text|
+----------+-----------+---------+-------+----+
|qwertyuiop| 0| 0|16102.0| 0|
I wish to create a new dataframe with one more column that has the concatenation of all the elements of the row:
agentName|original_dt|parsed_dt| user|text| newCol
+----------+-----------+---------+-------+----+
|qwertyuiop| 0| 0|16102.0| 0| [qwertyuiop, 0,0, 16102, 0]
Note: This is a just an example. The number of columns and names of them is not known. It is dynamic.
TL;DR Use struct function with Dataset.columns operator.
Quoting the scaladoc of struct function:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
There are two variants: string-based for column names or using Column expressions (that gives you more flexibility on the calculation you want to apply on the concatenated columns).
From Dataset.columns:
columns: Array[String] Returns all column names as an array.
Your case would then look as follows:
scala> df.withColumn("newCol",
struct(df.columns.head, df.columns.tail: _*)).
show(false)
+----------+-----------+---------+-------+----+--------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+--------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop,0,0,16102.0,0]|
+----------+-----------+---------+-------+----+--------------------------+
I think this works perfect for your case
here is with an example
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
("qwertyuiop", 0, 0, 16102.0, 0)
)
).toDF("agentName","original_dt","parsed_dt","user","text")
val result = data.withColumn("newCol", split(concat_ws(";", data.schema.fieldNames.map(c=> col(c)):_*), ";"))
result.show()
+----------+-----------+---------+-------+----+------------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+------------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop, 0, 0, 16102.0, 0]|
+----------+-----------+---------+-------+----+------------------------------+
Hope this helped!
In general, you can merge multiple dataframe columns into one using array.
df.select($"*",array($"col1",$"col2").as("newCol")) \\$"*" will capture all existing columns
Here is the one line solution for your case:
df.select($"*",array($"agentName",$"original_dt",$"parsed_dt",$"user", $"text").as("newCol"))
You can use udf function to concat all the columns into one. All you have to do is define a udf function and pass all the columns you want to concat to the udf function and call the udf function using .withColumn function of dataframe
Or
You can use concat_ws(java.lang.String sep, Column... exprs) function available for dataframe.
var df = Seq(("qwertyuiop",0,0,16102.0,0))
.toDF("agentName","original_dt","parsed_dt","user","text")
df.withColumn("newCol", concat_ws(",",$"agentName",$"original_dt",$"parsed_dt",$"user",$"text"))
df.show(false)
Will give you output as
+----------+-----------+---------+-------+----+------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+------------------------+
|qwertyuiop|0 |0 |16102.0|0 |qwertyuiop,0,0,16102.0,0|
+----------+-----------+---------+-------+----+------------------------+
That will get you the result you want
There may be syntax errors in my answer. This is useful if you are using java<8 and spark<2.
String columns=null
For ( String columnName : dataframe.columns())
{
Columns = columns == null ? columnName : columns+"," + columnName;
}
SqlContext.sql(" select *, concat_ws('|', " +columns+ ") as complete_record " +
"from data frame ").show();