Is there a function similar to the collect_list or collect_set to aggregate a column of maps into a single map in a (grouped) pyspark dataframe? For example, this function might have the following behavior:
>>>df.show()
+--+---------------------------------+
|id| map |
+--+---------------------------------+
| 1| Map(k1 -> v1)|
| 1| Map(k2 -> v2)|
| 1| Map(k3 -> v3)|
| 2| Map(k5 -> v5)|
| 3| Map(k6 -> v6)|
| 3| Map(k7 -> v7)|
+--+---------------------------------+
>>>df.groupBy('id').agg(collect_map('map')).show()
+--+----------------------------------+
|id| collect_map(map) |
+--+----------------------------------+
| 1| Map(k1 -> v1, k2 -> v2, k3 -> v3)|
| 2| Map(k5 -> v5)|
| 3| Map(k6 -> v6, k7 -> v7)|
+--+----------------------------------+
It probably wouldn't be too difficult to produce the desired result using one of the other collect_ aggregations and a udf, but it seems like something like this should already exist.
I know it is probably poor form to provide an answer to your own question before others have had a chance to answer, but in case someone is looking for a udf based version, here is one possible answer.
from pyspark.sql.functions import udf,collect_list
from pyspark.sql.types import MapType,StringType
combineMap=udf(lambda maps: {key:f[key] for f in maps for key in f},
MapType(StringType(),StringType()))
df.groupBy('id')\
.agg(collect_list('map')\
.alias('maps'))\
.select('id',combineMap('maps').alias('combined_map')).show()
The suggested solution with concat_map deosn't work and this solution doesn't use UDFs.
For spark>=2.4
(df
.groupBy(f.col('id'))
.agg(f.collect_list(f.col('map')).alias('maps'),
.select('id',
f.expr('aggregate(slice(maps, 2, size(maps)), maps[0], (acc, element) -> map_concat(acc, element))').alias('mapsConcatenated')
)
)
collect_list ignores the null values so no need to worry about them when using map_concat in aggregate function.
it's map_concat in the pyspark version >= 2.4
Related
I have two dataframes, one looks like this
+------------------------------------------------------------+
|docs |
+------------------------------------------------------------+
|{doc1.txt -> 1, doc2.txt -> 3, doc3.txt -> 5, doc4.txt -> 1}|
|{doc1.txt -> 2, doc2.txt -> 2, doc3.txt -> 4} |
|{doc1.txt -> 3, doc2.txt -> 2, doc4.txt -> 2} |
+------------------------------------------------------------+
and the other like
+--------------+----------+
| Document|doc_length|
+--------------+----------+
| doc1.txt| 0|
| doc2.txt| 0|
| doc3.txt| 0|
| doc3.txt| 0|
| doc4.txt| 0|
+-------------------------+
for sake of example the documents are in order, but in my use case I cannot expect them to be.
now I want to iterate through the first dataframe and update the values in the second as I go. I got a loop like this
df1.foreach(r =>
for (keyValPair <- r(0).asInstanceOf[Map[String, Long]]) {
// Something needs to happen here
} )
In every iteration I want to take take the key of the key-value-pair to select a specific row in the second dataframe and then add the value to the doc_length, so my final output of df2.show() would look like
EDIT: Later down the line I probably want to do other more complicated mathematical operations here then just summing all the values up, that's why I was trying to use the structure described above
+--------------+----------+
| Document|doc_length|
+--------------+----------+
| doc1.txt| 6|
| doc2.txt| 7|
| doc3.txt| 9|
| doc4.txt| 0|
+-------------------------+
This doesn't look like it should be too hard, but I don't know how I can access specific rows of a dataframe, by using a specific column as a key, and change them
You can explode the map column and group by key to sum up the lengths:
val df2 = df.select(explode(col("val")))
.groupBy(col("key").as("document"))
.agg(sum("value").as("doc_length"))
df2.show
+--------+----------+
|document|doc_length|
+--------+----------+
|doc1.txt| 6|
|doc4.txt| 3|
|doc3.txt| 9|
|doc2.txt| 7|
+--------+----------+
I have 2 dataframes as below,
val x = Seq((Seq(4,5),"XXX"),(Seq(7),"XYX")).toDF("X","NAME")
val y = Seq((5)).toDF("Y")
I want to join the two dataframes by looking up the value from y and searching the Seq/Array in x.select("X") if exists then join the complete Row with y
How can I acheive this is Spark?
Cheers!
Spark 2.4.3 you could use higher-order function spark
scala> val x = Seq((Seq(4,5),"XXX"),(Seq(7),"XYX")).toDF("X","NAME")
scala> val y = Seq((5)).toDF("Y")
scala> x.join(y,expr("array_contains(X, y)"),"left").show
+------+----+----+
| X|NAME| Y|
+------+----+----+
|[4, 5]| XXX| 5|
| [7]| XYX|null|
+------+----+----+
please confirm that's what you want to achieve?
You can use an UDF for the join, works for all spark versions:
val array_contains = udf((arr:Seq[Int],element:Int) => arr.contains(element))
x
.join(y, array_contains($"X",$"Y"),"left")
.show()
Another approach you can use is to explode your array into rows with the new temporary column. If you run the following code:
x.withColumn("temp", explode('X)).show()
it would show:
+------+----+----+
| X|NAME|temp|
+------+----+----+
|[4, 5]| XXX| 4|
|[4, 5]| XXX| 5|
| [7]| XYX| 7|
+------+----+----+
As you can see you can now just do join using temp and Y columns (and then drop temp):
x.withColumn("temp", explode('X))
.join(y, 'temp === 'Y)
.drop('temp)
This may fail by creating duplicate rows if X contains duplicates. In this case, you'd have to additionally call distinct:
x.withColumn("temp", explode('X))
.distinct()
.join(y, 'temp === 'Y, "left")
.drop('temp)
Since this approach is using spark native methods it will be a little bit faster than one using UDF, but arguably is less elegant.
I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.
I'm trying to add a new column to a DataFrame. The value of this column is the value of another column whose name depends on other columns from the same DataFrame.
For instance, given this:
+---+---+----+----+
| A| B| A_1| B_2|
+---+---+----+----+
| A| 1| 0.1| 0.3|
| B| 2| 0.2| 0.4|
+---+---+----+----+
I'd like to obtain this:
+---+---+----+----+----+
| A| B| A_1| B_2| C|
+---+---+----+----+----+
| A| 1| 0.1| 0.3| 0.1|
| B| 2| 0.2| 0.4| 0.4|
+---+---+----+----+----+
That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.
I know that I can add a new column based on another and a constant like this:
df.withColumn("C", $"B" + 1)
I also know that the name of the column can come from a variable like this:
val name = "A_1"
df.withColumn("C", col(name) + 1)
However, what I'd like to do is something like this:
df.withColumn("C", col(s"${col("A")}_${col("B")}"))
Which doesn't work.
NOTE: I'm coding in Scala 2.11 and Spark 2.2.
You can achieve your requirement by writing a udf function. I am suggesting udf, as your requirement is to process dataframe row by row contradicting to inbuilt functions which functions column by column.
But before that you would need array of column names
val columns = df.columns
Then write a udf function as
import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))
where
A is the first column value
B is the second column value
array is the Array of all the columns values
Now just call the udf function using withColumn api
df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)
You should get your desired output dataframe.
You can select from a map. Define map which translates name to column value:
import org.apache.spark.sql.functions.{col, concat_ws, lit, map}
val dataMap = map(
df.columns.diff(Seq("A", "B")).flatMap(c => lit(c) :: col(c) :: Nil): _*
)
df.select(dataMap).show(false)
+---------------------------+
|map(A_1, A_1, B_2, B_2) |
+---------------------------+
|Map(A_1 -> 0.1, B_2 -> 0.3)|
|Map(A_1 -> 0.2, B_2 -> 0.4)|
+---------------------------+
and select from it with apply:
df.withColumn("C", dataMap(concat_ws("_", $"A", $"B"))).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
You can also try mapping, but I suspect it won't perform well with very wide data:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val outputEncoder = RowEncoder(df.schema.add(StructField("C", DoubleType)))
df.map(row => {
val a = row.getAs[String]("A")
val b = row.getAs[String]("B")
val key = s"${a}_${b}"
Row.fromSeq(row.toSeq :+ row.getAs[Double](key))
})(outputEncoder).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
and in general I wouldn't recommend this approach.
If data comes from csv you might consider skipping default csv reader and use custom logic to push column selection directly into parsing process. With pseudocode:
spark.read.text(...).map { line => {
val a = ??? // parse A
val b = ??? // parse B
val c = ??? // find c, based on a and b
(a, b, c)
}}
Let's say I have a table like:
id,date,value
1,2017-02-12,3
2,2017-03-18,2
1,2017-03-20,5
1,2017-04-01,1
3,2017-04-01,3
2,2017-04-10,2
I already have this as a dataframe (it comes from a Hive table)
Now, I want an output that looks like (logically):
id, count($"date">"2017-03"), sum($"value" where $"date">"2017-03"), count($"date">"2017-02"), sum($"value" where $"date">"2017-02")
I've tried to express this in a single agg(), but I just can't figure out how to do the inner conditionals. I know how to filter ahead of the aggregation, but that doesn't do what I need with the two different sub-ranges.
// doesn't do the right thing
myDF.where($"date">"2017-03")
.groupBy("id")
.agg(sum("value") as "value_03", count("value") as "count_03")
.where($"date">"2017-04")
.agg(sum("value") as "value_04", count("value") as "value_04")
In SQL I would have put all the aggregation into a single SELECT statement with conditionals inside the count/sum clauses. How do I do something similar with DataFrames in Spark with Scala?
The closest I can think of is calculating membership for each tuple in each of the windows before the groupBy(), and summing over that membership times value (and straight sum for count.) It seems like there should be a better way to express this with conditionals inside the agg(), but I can't find it.
In SQL I would have put all the aggregation into a single SELECT statement with conditionals inside the count/sum clauses.
You can do exactly the same thing here:
import org.apache.spark.sql.functions.{sum, when}
myDF
.groupBy($"id")
.agg(
sum(when($"date" > "2017-03", $"value")).alias("value3"),
sum(when($"date" > "2017-04", $"value")).alias("value4")
)
+---+------+------+
| id|value3|value4|
+---+------+------+
| 1| 6| 1|
| 3| 3| 3|
| 2| 4| 2|
+---+------+------+