Multiple aggregation over multiple columns - scala

I want to write a UDF over a data frame that operates as comparing values of particular row against the values from same group, where the grouping is by multiple keys. As UDFs operate on a single row, I want to write a query that returns values from same group in as a new column value.
For example over this
Input:
id
categoryAB
categoryXY
value1
value2
1
A
X
0.2
True
2
A
X
0.3
False
3
A
X
0.2
True
4
B
X
0.4
True
5
B
X
0.1
True
6
B
Y
0.5
False
I can add
group1: aggregation of value1s from the same <categroyAB, categroyXY> group
group2: aggregation of value1s from the same <categroyAB, categroyXY> group i.e. same grouping.
Expected result:
id
categoryAB
categoryXY
value1
value2
group1
group2
1
A
X
0.2
True
[0.2, 0.3, 0.2]
[True, False, True]
2
A
X
0.3
False
[0.2, 0.3, 0.2]
[True, False, True]
3
A
X
0.2
True
[0.2, 0.3, 0.2]
[True, False, True]
4
B
X
0.4
True
[0.4, 0.1]
[True, True]
5
B
X
0.1
True
[0.4, 0.1]
[True, True]
6
B
Y
0.5
False
[0.5]
[False]
To be more clear about grouping, there are 3 groups in this example
<A,X> with rows 1, 2 and 3
<B,X> with rows 4 and 5
<B,Y> with row 6
I need to implement it in Scala with Spark SQL structures and functions but a generic SQL answer could be guiding.

There might be a more optimized method, but here how I usually do:
val df = Seq(
(1, "A", "X", 0.2, true),
(2, "A", "X", 0.3, false),
(3, "A", "X", 0.2, true),
(4, "B", "X", 0.4, true),
(5, "B", "X", 0.1, true),
(6, "B", "Y", 0.5, false)
).toDF("id", "categoryAB", "categoryXY", "value1", "value2")
df.join(
df.groupBy("categoryAB", "categoryXY")
.agg(
collect_list('value1) as "group1",
collect_list('value2) as "group2"
),
Seq("categoryAB", "categoryXY")
).show()
The idea is that I compute separately the aggregation on categoryAB and categoryXY, and then I join the new dataframe to the original one (make sure that df is cached if it is the result of heavy computations as otherwise it will be computed twice).

Related

Concatenate two arrays element with all possible combinations

I have a hive table with
| row | column |
| --------------------------- | ---------------------------|
| null | ["black", "blue", "orange"]
| ["mom", "dad", "sister"] | ["amazon", "fiipkart", "meesho", "jiomart", ""]
Using Spark SQL, I would like to create a new column with an array of all possible combinations:
| row | column | output |
| ---------------------------|------------------|-----------------------------------|
| null |["b", "s", "m"] |["b", "s", "m"] |
| ["1", "2"] |["a", "b",""] |["1_a", "1_b","1","2_a", "2_b","2"]|
Two ways to implement this:
The first way includes array transformations:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create an array of 1s that is equal to the multiplying of row size and column size
.withColumn("repeated", array_repeat(lit(1), size(col("row")) * size(col("column"))))
// we create indexes according to the sizes
.withColumn("indexes", expr("transform(repeated, (x, i) -> array(i % size(row), i % size(column)))"))
// we concat the elements
.withColumn("concat", expr("transform(indexes, (x, i) -> concat_ws('_', row[x[0]], column[x[1]]))"))
// we remove underscores before and after the name (if found)
.withColumn("output", expr("transform(concat, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Output:
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|row |column |repeated |indexes |concat |output |
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|[] |[b, s, m]|[1, 1, 1] |[[0, 0], [0, 1], [0, 2]] |[_b, _s, _m] |[b, s, m] |
|[1, 2]|[a, b, ] |[1, 1, 1, 1, 1, 1]|[[0, 0], [1, 1], [0, 2], [1, 0], [0, 1], [1, 2]]|[1_a, 2_b, 1_, 2_a, 1_b, 2_]|[1_a, 2_b, 1, 2_a, 1_b, 2]|
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
The second way includes explode and other functions:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create a unique ID to group by later and collect
.withColumn("id", monotonically_increasing_id())
// we explode the columns and the rows
.withColumn("column", explode(col("column")))
.withColumn("row", explode(col("row")))
// we combine the output with underscores as separators
.withColumn("output", concat_ws("_", col("row"), col("column")))
// we group by again and collect set
.groupBy("id").agg(
collect_set("row").as("row"),
collect_set("column").as("column"),
collect_set("output").as("output")
)
.drop("id")
// we replace whatever ends with _ and starst with _ (_1, or 1_)
.withColumn("output", expr("transform(output, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Final output:
+------+---------+--------------------------+
|row |column |output |
+------+---------+--------------------------+
|[1, 2]|[b, a, ] |[2, 1_a, 2_a, 1, 1_b, 2_b]|
|[] |[s, m, b]|[s, m, b] |
+------+---------+--------------------------+
I left other columns in case you want to see what is happening, good luck!
One straight forward and easy solution is to use a custom UDF function:
from pyspark.sql.functions import udf, col
def combinations(a, b):
c = []
for x in a:
for y in b:
if not x:
c.append(y)
elif not y:
c.append(x)
else:
c.append(f"{x}_{y}")
return c
udf_combination = udf(combinations)
df = spark.createDataFrame([
[["1", "2"], ["a", "b", ""]]
], ["row", "column"])
df.withColumn("res", udf_combination(col("row"), col("column")))
# +------+--------+--------------------------+
# |row |column |res |
# +------+--------+--------------------------+
# |[1, 2]|[a, b, ]|[1_a, 1_b, 1, 2_a, 2_b, 2]|
# +------+--------+--------------------------+

PySpark: Count pair frequency occurences

Let's say I have a dataset as follows:
1: a, b, c
2: a, d, c
3: c, d, e
I want to write a Pyspark code to count the occurrences of each of the pairs such as (a,b), (a,c), (b,c) etc.
Expected output:
(a,b) 1
(b,c) 1
(c,d) 2
etc..
Note that, (c,d) and (d,c) should be the same instant.
How should I go about it?
Till now, I have written the code to read the data from textfile as follows -
sc = SparkContext("local", "bp")
spark = SparkSession(sc)
data = sc.textFile('doc.txt')
dataFlatMap = data.flatMap(lambda x: x.split(" "))
Any pointers would be appreciated.
I relied on the answer in this question - How to create a Pyspark Dataframe of combinations from list column
Below is the code that creates a udf where itertools.combinations function is applied to the list of items. The combinations in udf are sorted to avoid double counting occurrences such as ("a", "b") and ("b", "a"). Once you get combinations, you can groupBy and count rows. You may want to count distinct rows in case list elements are repeating, like ("a", "a", "b"), but this depends on your requirements.
import pyspark.sql.functions as F
import itertools
from pyspark.sql.types import *
data = [(1, ["a", "b", "c"]), (2, ["a", "d", "c"]), (3, ["c", "d", "e"])]
df = spark.createDataFrame(data, schema = ["id", "arr"])
# df is
# id arr
# 1 ["a", "b", "c"]
# 2 ["a", "d", "c"]
# 3 ["c", "d", "e"]
#udf(returnType=ArrayType(ArrayType(StringType())))
def combinations_udf(arr):
x = (list(itertools.combinations(arr, 2)))
return [ sorted([y[0], y[1]]) for y in x ]
df1 = (df.withColumn("combinations", F.explode(combinations_udf1("arr"))))
df_ans =(df1
.groupBy("combinations")
.agg(F.countDistinct("id").alias("count"))
.orderBy(F.desc("count")))
For the given dataframe df, df_ans is

Combine two lists with one different element

I'm new in Scala and Spark and i don't know how to do this.
I have preprocessed a CSV file, resulting in an RDD that contains lists with this format:
List("2014-01-01T23:56:06.0", NaN, 1, NaN)
List("2014-01-01T23:56:06.0", NaN, NaN, 2)
All lists have the same number of elements.
What I want to do is to combine the lists having the same first element (the timestamp). For example, I want this two example lists to produce only one List, with the following values:
List("2014-01-01T23:56:06.0", NaN, 1, 2)
Thanks for your help :)
# Below can help you in achieving your target
val input_rdd1 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "NaN", "1", "NaN")))
val input_rdd2 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "NaN", "NaN", "2")))
//added one more row for your data
val input_rdd3 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "2", "NaN", "NaN")))
val input_df1 = input_rdd1.toDF("col1", "col2", "col3", "col4")
val input_df2 = input_rdd2.toDF("col1", "col2", "col3", "col4")
val input_df3 = input_rdd3.toDF("col1", "col2", "col3", "col4")
val output_df = input_df1.union(input_df2).union(input_df3).groupBy($"col1").agg(min($"col2").as("col2"), min($"col3").as("col3"), min($"col4").as("col4"))
output_df.show
output:
+--------------------+----+----+----+
| col1|col2|col3|col4|
+--------------------+----+----+----+
|2014-01-01T23:56:...| 2| 1| 2|
+--------------------+----+----+----+
If array tail values are doubles, can be implemented in this way (as sachav suggests):
val original = sparkContext.parallelize(
Seq(
List("2014-01-01T23:56:06.0", NaN, 1.0, NaN),
List("2014-01-01T23:56:06.0", NaN, NaN, 2.0)
)
)
val result = original
.map(v => v.head -> v.tail)
.reduceByKey(
(acc, curr) => acc.zip(curr).map({ case (left, right) => if (left.asInstanceOf[Double].isNaN) right else left }))
.map(v => v._1 :: v._2)
result.foreach(println)
Output is:
List(2014-01-01T23:56:06.0, NaN, 1.0, 2.0)

Reformatting Dataframe Containing Array to RowMatrix

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.
Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

How can I loop through a Spark data frame

How can I loop through a Spark data frame?
I have a data frame that consists of:
time, id, direction
10, 4, True //here 4 enters --> (4,)
20, 5, True //here 5 enters --> (4,5)
34, 5, False //here 5 leaves --> (4,)
67, 6, True //here 6 enters --> (4,6)
78, 6, False //here 6 leaves --> (4,)
99, 4, False //here 4 leaves --> ()
it is sorted by time and now I would like to step through and accumulate the valid ids. The ids enter on direction==True and exit on direction==False
so the resulting RDD should look like this
time, valid_ids
(10, (4,))
(20, (4,5))
(34, (4,))
(67, (4,6))
(78, (4,)
(99, ())
I know that this will not parallelize, but the df is not that big. So how could this be done in Spark/Scala?
If data is small ("but the df is not that big") I'd just collect and process using Scala collections. If types are as shown below:
df.printSchema
root
|-- time: integer (nullable = false)
|-- id: integer (nullable = false)
|-- direction: boolean (nullable = false)
you can collect:
val data = df.as[(Int, Int, Boolean)].collect.toSeq
and scanLeft:
val result = data.scanLeft((-1, Set[Int]())){
case ((_, acc), (time, value, true)) => (time, acc + value)
case ((_, acc), (time, value, false)) => (time, acc - value)
}.tail
Use of var is not recommended for scala developers but still I am posting answer using var
var collectArray = Array.empty[Int]
df.rdd.collect().map(row => {
if(row(2).toString.equalsIgnoreCase("true")) collectArray = collectArray :+ row(1).asInstanceOf[Int]
else collectArray = collectArray.drop(1)
(row(0), collectArray.toList)
})
this should give you result as
(10,List(4))
(20,List(4, 5))
(34,List(5))
(67,List(5, 6))
(78,List(6))
(99,List())
Suppose the name of the respective data frame is someDF, then do:
val df1 = someDF.rdd.collect.iterator;
while(df1.hasNext)
{
println(df1.next);
}