Does collect_list() maintain relative ordering of rows? - scala

Imagine that I have the following DataFrame df:
+---+-----------+------------+
| id|featureName|featureValue|
+---+-----------+------------+
|id1| a| 3|
|id1| b| 4|
|id2| a| 2|
|id2| c| 5|
|id3| d| 9|
+---+-----------+------------+
Imagine that I run:
df.groupBy("id")
.agg(collect_list($"featureIndex").as("idx"),
collect_list($"featureValue").as("val"))
Am I GUARANTEED that "idx" and "val" will be aggregated and keep their relative order? i.e.
GOOD GOOD BAD
+---+------+------+ +---+------+------+ +---+------+------+
| id| idx| val| | id| idx| val| | id| idx| val|
+---+------+------+ +---+------+------+ +---+------+------+
|id3| [d]| [9]| |id3| [d]| [9]| |id3| [d]| [9]|
|id1|[a, b]|[3, 4]| |id1|[b, a]|[4, 3]| |id1|[a, b]|[4, 3]|
|id2|[a, c]|[2, 5]| |id2|[c, a]|[5, 2]| |id2|[a, c]|[5, 2]|
+---+------+------+ +---+------+------+ +---+------+------+
NOTE: e.g. It's BAD because for id1 [a, b] should have been associated with [3, 4] (and not [4, 3]). Same for id2

I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed).
If you are concerned with the order, merge these two columns using struct function before doing groupBy.
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using struct):
monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

Related

Collect most occurring unique values across columns after a groupby in Spark

I have the following dataframe
val input = Seq(("ZZ","a","a","b","b"),
("ZZ","a","b","c","d"),
("YY","b","e",null,"f"),
("YY","b","b",null,"f"),
("XX","j","i","h",null))
.toDF("main","value1","value2","value3","value4")
input.show()
+----+------+------+------+------+
|main|value1|value2|value3|value4|
+----+------+------+------+------+
| ZZ| a| a| b| b|
| ZZ| a| b| c| d|
| YY| b| e| null| f|
| YY| b| b| null| f|
| XX| j| i| h| null|
+----+------+------+------+------+
I need to group by the main column and pick the two most occurring values from the remaining columns for each main value
I did the following
val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
val newdf2 = newdf.groupBy('main).agg(collect_set('values).alias("values"))
val newdf3 = newdf2.select('main, flatten($"values").alias("values"))
To get the data in the following form
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j, i, h,]|
+----+--------------------+
Now I need to pick the most occurring two items from the list as two columns. Dunno how to do that.
So, in this case the expected output should be
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| i|
+----+------+------+
null should not be counted and the final values should be null only if there are no other values to fill
Is this the best way to do things ? Is there a better way of doing it ?
You can use an udf to select the two values from the array that occur the most often.
input.withColumn("values", array("value1", "value2", "value3", "value4"))
.groupBy("main").agg(flatten(collect_list("values")).as("values"))
.withColumn("max", maxUdf('values)) //(1)
.cache() //(2)
.withColumn("value1", 'max.getItem(0))
.withColumn("value2", 'max.getItem(1))
.drop("values", "max")
.show(false)
with maxUdf being defined as
def getMax[T](array: Seq[T]) = {
array
.filter(_ != null) //remove null values
.groupBy(identity).mapValues(_.length) //count occurences of each value
.toSeq.sortWith(_._2 > _._2) //sort (3)
.map(_._1).take(2) //return the two (or one) most common values
}
val maxUdf = udf(getMax[String] _)
Remarks:
using an udf here means that the whole array with all entries for a single value of main has to fit into the memory of one Spark executor
cache is required here or the the udf will be called twice, once for value1 and once for value2
the sortWith here is stable but it might be necessary to add some extra logic to handle the situation if two elements have the same number of occurences (like i, j and h for the main value XX)
Here is my try without udf.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('main).orderBy('count.desc)
newdf3.withColumn("values", explode('values))
.groupBy('main, 'values).agg(count('values).as("count"))
.filter("values is not null")
.withColumn("target", concat(lit("value"), lit(row_number().over(w))))
.filter("target < 'value3'")
.groupBy('main).pivot('target).agg(first('values)).show
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| null|
+----+------+------+
The last row has the null value because I have modified your dataframe in this way,
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j,,,]| <- For null test
+----+--------------------+

How do I create a new column for my dataframe whose values are maps made up of values from different columns?

I've seen similar questions but haven't been able to find exactly what I need and have been struggling to figure out if I can manage to do what I want without using a UDF.
Say I start with this dataframe:
+---+---+---+
| pk| a| b|
+---+---+---+
| 1| 2| 1|
| 2| 4| 2|
+---+---+---+
I want the resulting dataframe to look like
+----------------+---+
| ab| pk|
+----------------+---+
|[A -> 2, B -> 1]| 1|
|[A -> 4, B -> 2]| 2|
+----------------+---+
Where A and B are names that correspond to a and b (I guess I can fix this with an alias, but currently now I'm using a UDF that returns a map of {'A': column a value, 'B': column b value})
Is there any way to accomplish this using create_map or otherwise without a UDF?
create_map takes arguments as key, value, key, value ..., for your case:
import pyspark.sql.functions as f
df.select(
f.create_map(f.lit('A'), f.col('a'), f.lit('B'), f.col('b')).alias('ab'),
f.col('pk')
).show()
+----------------+---+
| ab| pk|
+----------------+---+
|[A -> 2, B -> 1]| 1|
|[A -> 4, B -> 2]| 2|
+----------------+---+

How to add a new column with maximum value?

I have a Dataframe with 2 columns tag and value.
I want to add a new column that contains the max of value column. (It will be the same value for every row).
I tried to do something as follows, but it didn't work.
val df2 = df.withColumn("max",max($"value"))
How to add the max column to the dataset?
There are 3 ways to do it (one you already know from the other answer). I avoid collect since it's not really needed.
Here is the dataset with the maximum value 3 appearing twice.
val tags = Seq(
("tg1", 1), ("tg2", 2), ("tg1", 3), ("tg4", 4), ("tg3", 3)
).toDF("tag", "value")
scala> tags.show
+---+-----+
|tag|value|
+---+-----+
|tg1| 1|
|tg2| 2|
|tg1| 3| <-- maximum value
|tg4| 4|
|tg3| 3| <-- another maximum value
+---+-----+
Cartesian Join With "Max" Dataset
I'm going to use a cartesian join of the tags and a single-row dataset with the maximum value.
val maxDF = tags.select(max("value") as "max")
scala> maxDF.show
+---+
|max|
+---+
| 4|
+---+
val solution = tags.crossJoin(maxDF)
scala> solution.show
+---+-----+---+
|tag|value|max|
+---+-----+---+
|tg1| 1| 4|
|tg2| 2| 4|
|tg1| 3| 4|
|tg4| 4| 4|
|tg3| 3| 4|
+---+-----+---+
I'm not worried about the cartesian join here since it's just a single-row dataset.
Windowed Aggregation
My favorite windowed aggregation fits this problem so nicely. On the other hand, I don't really think that'd be the most effective approach due to the number of partitions in use, i.e. just 1, which gives the worst possible parallelism.
The trick is to use the aggregation function max over an empty window specification that informs Spark SQL to use all rows in any order.
val solution = tags.withColumn("max", max("value") over ())
scala> solution.show
18/05/31 21:59:40 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+-----+---+
|tag|value|max|
+---+-----+---+
|tg1| 1| 4|
|tg2| 2| 4|
|tg1| 3| 4|
|tg4| 4| 4|
|tg3| 3| 4|
+---+-----+---+
Please note the warning that says it all.
WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
I would not use this approach given the other solutions and am leaving it here for educational purposes.
If you want the maximum value of a columns for all rows, you are going to need to compare all the rows in some form. That means doing an an aggregation. withColumn only operates on a single row so you have no way to get the DataFrame max value.
The easiest way to do this is like below:
val data = Seq(("a", 1), ("b", 2), ("c", 3), ("d", 4))
val df = sc.parallelize(data).toDF("name", "value")
// first is an action, so this will execute spark stages to compute the value
val maxValue = df.groupBy().agg(max($"value")).first.getInt(0)
// Now you can add it to your original DF
val updatedDF = df.withColumn("max", lit(maxValue))
updatedDF.show
There is also one alternative to this that might be a little faster. If you don't need the max value until the end of your processsing (after you have already run a spark action) you can compute it by writing your own Spark Acccumulator instead that gathers the value while doing whatever other Spark Action work you have requested.
Max column value as additional column by window function
val tags = Seq(
("tg1", 1), ("tg2", 2), ("tg1", 3), ("tg4", 4), ("tg3", 3)
).toDF("tag", "value")
scala> tags.show
+---+-----+
|tag|value|
+---+-----+
|tg1| 1|
|tg2| 2|
|tg1| 3|
|tg4| 4|
|tg3| 3|
+---+-----+
scala> tags.withColumn("max", max("value").over(Window.partitionBy(lit("1")))).show
+---+-----+---+
|tag|value|max|
+---+-----+---+
|tg1| 1| 4|
|tg2| 2| 4|
|tg1| 3| 4|
|tg4| 4| 4|
|tg3| 3| 4|
+---+-----+---+

pyspark equivalent of pandas groupby('col1').col2.head()

I have a Spark Dataframe where for each set of rows with a given column value (col1), I want to grab a sample of the values in (col2). The number of rows for each possible value of col1 may vary widely, so i'm just looking for a set number, say 10, of each type.
There may be a better way to do this, but the natural approach seemed to be a df.groupby('col1')
in pandas, I could do df.groupby('col1').col2.head()
i understand that spark dataframes are not pandas dataframes, but this is a good analogy.
i suppose i could loop over all of col1 types as a filter, but that seems terribly icky.
any thoughts on how to do this? thanks.
Let me create a sample Spark dataframe with two columns.
df = SparkSQLContext.createDataFrame([[1, 'r1'],
[1, 'r2'],
[1, 'r2'],
[2, 'r1'],
[3, 'r1'],
[3, 'r2'],
[4, 'r1'],
[5, 'r1'],
[5, 'r2'],
[5, 'r1']], schema=['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1| r1|
| 1| r2|
| 1| r2|
| 2| r1|
| 3| r1|
| 3| r2|
| 4| r1|
| 5| r1|
| 5| r2|
| 5| r1|
+----+----+
After grouping by col1, we get GroupedData object (instead of Spark Dataframe). You can use aggregate functions like min, max, average. But getting a head() is little bit tricky. We need to convert GroupedData object back to Spark Dataframe. This can be done Using pyspark collect_list() aggregation function.
from pyspark.sql import functions
df1 = df.groupBy(['col1']).agg(functions.collect_list("col2")).show(n=3)
Output is:
+----+------------------+
|col1|collect_list(col2)|
+----+------------------+
| 5| [r1, r2, r1]|
| 1| [r1, r2, r2]|
| 3| [r1, r2]|
+----+------------------+
only showing top 3 rows

pyspark: SQL count() fails

I have a Spark dataframe that looks something like this
x |count
1 |3
3 |5
4 |3
Below is my spark code:
sdf.createOrReplaceTempView('sdf_view')
spark.sql('SELECT MAX(count), x FROM sdf_view')
This seems like a perfect SQL query and I'm wondering why this doesn't work with Spark. What I want to find is the maximum count along with the x corresponding to it.
Any leads appreciated.
The error message is:
AnalysisException: u"grouping expressions sequence is empty, and 'sdf_view.`x`' is not an aggregate function. Wrap '(max(sdf_view.`count`) AS `max(count)`)' in windowing function(s) or wrap 'sdf_view.`x`' in first() (or first_value) if you don't care which value you get.
I added another row:
x = [{"x": 1, "count": 3}, {"x": 3, "count": 5}, {"x": 4, "count": 3}, {"x": 4, "count": 60}]
sdf = spark.createDataFrame(x)
+-----+---+
|count| x|
+-----+---+
| 3| 1|
| 5| 3|
| 3| 4|
| 60| 4|
+-----+---+
Your SQL statement is odd and you need to say how you want to group things. I'm guessing you want to group the X's and get the max of each of the unique X's? In other words, do you want a max count for each of the unique X's?
y = spark.sql('SELECT MAX(count), x FROM sdf_view GROUP BY x ')
y.show()
+----------+---+
|max(count)| x|
+----------+---+
| 3| 1|
| 5| 3|
| 60| 4|
+----------+---+
Or Do you want to just find the highest count of them all?
y = spark.sql('SELECT MAX(count) FROM sdf_view')
y.show()
+----------+
|max(count)|
+----------+
| 60|
+----------+