I would like to do a "filldown" type operation on a dataframe in order to remove nulls and make sure the last row is a kind of summary row, containing the last known values for each column based on the timestamp, grouped by the itemId. As I'm using Azure Synapse Notebooks the language can be Scala, Pyspark, SparkSQL or even c#. However the problem here is that the real solution has up to millions of rows and hundreds of columns, so I need a dynamic solution that can take advantage of Spark. We can provision a big cluster to how to make sure we take good advantage of it?
Sample data:
// Assign sample data to dataframe
val df = Seq(
( 1, "10/01/2021", 1, "abc", null ),
( 2, "11/01/2021", 1, null, "bbb" ),
( 3, "12/01/2021", 1, "ccc", null ),
( 4, "13/01/2021", 1, null, "ddd" ),
( 5, "10/01/2021", 2, "eee", "fff" ),
( 6, "11/01/2021", 2, null, null ),
( 7, "12/01/2021", 2, null, null )
).
toDF("eventId", "timestamp", "itemId", "attrib1", "attrib2")
df.show
Expected results with rows 4 and 7 as summary rows:
+-------+----------+------+-------+-------+
|eventId| timestamp|itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
| 1|10/01/2021| 1| abc| null|
| 2|11/01/2021| 1| abc| bbb|
| 3|12/01/2021| 1| ccc| bbb|
| 4|13/01/2021| 1| ccc| ddd|
| 5|10/01/2021| 2| eee| fff|
| 6|11/01/2021| 2| eee| fff|
| 7|12/01/2021| 2| eee| fff|
+-------+----------+------+-------+-------+
I have reviewed this option but had trouble adapting it for my use case.
Spark / Scala: forward fill with last observation
I have a kind of working SparkSQL solution but it will be very verbose for the high volume of columns, hoping for something easier to maintain:
%%sql
WITH cte (
SELECT
eventId,
itemId,
ROW_NUMBER() OVER( PARTITION BY itemId ORDER BY timestamp ) AS rn,
attrib1,
attrib2
FROM df
)
SELECT
eventId,
itemId,
CASE rn WHEN 1 THEN attrib1
ELSE COALESCE( attrib1, LAST_VALUE(attrib1, true) OVER( PARTITION BY itemId ) )
END AS attrib1_xlast,
CASE rn WHEN 1 THEN attrib2
ELSE COALESCE( attrib2, LAST_VALUE(attrib2, true) OVER( PARTITION BY itemId ) )
END AS attrib2_xlast
FROM cte
ORDER BY eventId
For many columns you could create an expression as below
val window = Window.partitionBy($"itemId").orderBy($"timestamp")
// Instead of selecting columns you could create a list of columns
val expr = df.columns
.map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))
df.select(expr: _*).show(false)
Update:
val mainColumns = df.columns.filterNot(_.startsWith("attrib"))
val aggColumns = df.columns.diff(mainColumns).map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))
df.select(( mainColumns.map(col) ++ aggColumns): _*).show(false)
Result:
+-------+----------+------+-------+-------+
|eventId|timestamp |itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
|1 |10/01/2021|1 |abc |null |
|2 |11/01/2021|1 |abc |bbb |
|3 |12/01/2021|1 |ccc |bbb |
|4 |13/01/2021|1 |ccc |ddd |
|5 |10/01/2021|2 |eee |fff |
|6 |11/01/2021|2 |eee |fff |
|7 |12/01/2021|2 |eee |fff |
+-------+----------+------+-------+-------+
Related
I have a data frame that looks something along the lines of:
+-----+-----+------+-----+
|col1 |col2 |col3 |col4 |
+-----+-----+------+-----+
|1.1 |2.3 |10.0 |1 |
|2.2 |1.5 |5.0 |1 |
|3.3 |1.3 |1.5 |1 |
|4.4 |0.5 |7.0 |1 |
|5.5 |1.2 |8.1 |2 |
|6.6 |2.3 |8.2 |2 |
|7.7 |4.5 |10.3 |2 |
+-----+-----+------+-----+
I would like to subtract each row from the row above but only if they have the same entry in col4, so 2-1, 3-2 but not 5-4. Also col4 should not be changed, so the result would be
+-----+-----+------+------+
|col1 |col2 |col3 |col4 |
+-----+-----+------+------+
|1.1 |-0.8 |-5.0 |1 |
|1.1 |-0.2 |-3.5 |1 |
|1.1 |-0.8 |5.5 |1 |
|1.1 |1.1 |0.1 |2 |
|1.1 |2.2 |2.1 |2 |
+-----+-----+------+------+
This sounds like it'd be simple, but I can't seem to figure it out
You could accomplish this using spark-sql i.e. creating a temporary view with your dataframe and applying the following sql. It uses window functions LAG to subtract the previous row value ordered by col1 and partitioned by col4. The first row value in each group partitioned by col4 is identified using row_number and filtered.
df.createOrReplaceTempView('my_temp_view')
results = sparkSession.sql('<insert sql below here>')
SELECT
col1,
col2,
col3,
col4
FROM (
SELECT
(col1 - (LAG(col1,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col1,
(col2 - (LAG(col2,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col2,
(col3 - (LAG(col3,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col3,
col4,
ROW_NUMBER() OVER (PARTITION BY col4 ORDER BY col1) rn
FROM
my_temp_view
) t
WHERE rn <> 1
db-fiddle
Here just the idea with a self-JOIN based on RDD with zipWithIndex and back to DF - some overhead, that you can tailor, z being your col4.
At scale I am not sure about the performance that Catalyst Optimizer will apply, I looked at .explain(true); not convinced entirely, but I find it hard to interpret the output sometimes. Ordering of data is guaranteed.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType, ArrayType, LongType}
val df = sc.parallelize(Seq( (1.0, 2.0, 1), (0.0, -1.0, 1), (3.0, 4.0, 1), (6.0, -2.3, 4))).toDF("x", "y", "z")
val newSchema = StructType(df.schema.fields ++ Array(StructField("rowid", LongType, false)))
val rddWithId = df.rdd.zipWithIndex
val dfZippedWithId = spark.createDataFrame(rddWithId.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(index))}, newSchema)
dfZippedWithId.show(false)
dfZippedWithId.printSchema()
val res = dfZippedWithId.as("dfZ1").join(dfZippedWithId.as("dfZ2"), $"dfZ1.z" === $"dfZ2.z" &&
$"dfZ1.rowid" === $"dfZ2.rowid" -1
,"inner")
.withColumn("newx", $"dfZ2.x" - $"dfZ1.x")//.explain(true)
res.show(false)
returns the input:
+---+----+---+-----+
|x |y |z |rowid|
+---+----+---+-----+
|1.0|2.0 |1 |0 |
|0.0|-1.0|1 |1 |
|3.0|4.0 |1 |2 |
|6.0|-2.3|4 |3 |
+---+----+---+-----+
and the result which you can tailor by selecting and adding extra calculations:
+---+----+---+-----+---+----+---+-----+----+
|x |y |z |rowid|x |y |z |rowid|newx|
+---+----+---+-----+---+----+---+-----+----+
|1.0|2.0 |1 |0 |0.0|-1.0|1 |1 |-1.0|
|0.0|-1.0|1 |1 |3.0|4.0 |1 |2 |3.0 |
+---+----+---+-----+---+----+---+-----+----+
I have the following dataframe:
val df1 = Seq(("Roger","Rabbit", "ABC123"), ("Roger","Rabit", "ABC123"),("Roger","Rabbit", "ABC123"), ("Trevor","Philips","XYZ987"), ("Trevor","Philips","XYZ987")).toDF("first_name", "last_name", "record")
+----------+---------+------+
|first_name|last_name|record|
+----------+---------+------+
|Roger |Rabbit |ABC123|
|Roger |Rabit |ABC123|
|Roger |Rabbit |ABC123|
|Trevor |Philips |XYZ987|
|Trevor |Philips |XYZ987|
+----------+---------+------+
I want to group records in this dataframe by the column record. And then I want to look for anomalies in the fields first_name and last_name, which should remain constant for all records with same record value.
The best approach I found so far is using approx_count_distinct:
val wind_person = Window.partitionBy("record")
df1.withColumn("unique_fields",cconcat($"first_name",$"last_name"))
.withColumn("anomaly",capprox_count_distinct($"unique_fields") over wind_person)
.show(false)
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabit |ABC123|RogerRabit |2 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
+----------+---------+------+-------------+-------+
Where an anomaly is detected is anomaly column is greater than 1.
The problem is with approx_count_distinct we get just an approximation, and I am not sure how much confident we can be that it will always return an accurate count.
Some extra information:
The Dataframe may contain over 500M records
The Dataframe is previously repartitioned based on record column
For each different value of record, no more than 15 rows will be there
Is is safe to use approx_count_distinct in this scenario with a 100% accuracy or are there better window functions in spark to achieve this?
You can use collect_set of unique_fields over the window wind_person and get it's size which is equivalent to the count distinct of that field :
df1.withColumn("unique_fields", concat($"first_name", $"last_name"))
.withColumn("anomaly", size(collect_set($"unique_fields").over(wind_person)))
.show
//+----------+---------+------+-------------+-------+
//|first_name|last_name|record|unique_fields|anomaly|
//+----------+---------+------+-------------+-------+
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Roger |Rabit |ABC123|RogerRabit |2 |
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//+----------+---------+------+-------------+-------+
You can get the exact countDistinct over a Window using some dense_rank operations:
val df2 = df1.withColumn(
"unique_fields",
concat($"first_name",$"last_name")
).withColumn(
"anomaly",
dense_rank().over(Window.partitionBy("record").orderBy("unique_fields")) +
dense_rank().over(Window.partitionBy("record").orderBy(desc("unique_fields")))
- 1
)
df2.show
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
| Roger| Rabit|ABC123| RogerRabit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
+----------+---------+------+-------------+-------+
I am using spark 2.3 in my scala application. I have a dataframe which create from spark sql that name is sqlDF in the sample code which I shared. I have a string list that has the items below
List[] stringList items
-9,-8,-7,-6
I want to replace all values that match with this lists item in all columns in dataframe to 0.
Initial dataframe
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |-6 |1
-7 |-8 |-7
It must return to
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |0 |1
0 |0 |0
For this I am itarating the query below for all columns (more than 500) in sqlDF.
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList:_*), 0).otherwise(col(currColumnName)))
But getting the error below, by the way if I choose only one column for iterating it works, but if I run the code above for 500 columns iteration it fails
Exception in thread "streaming-job-executor-0"
java.lang.StackOverflowError at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
What is the thing that I am missing?
Here is a different approach applying left anti join between columnX and X where X is your list of items transferred into a dataframe. The left anti join will return all the items not present in X, the results we concatenate them all together through an outer join (which can be replaced with left join for better performance, this though will exclude records with all zeros i.e id == 3) based on the id assigned with monotonically_increasing_id:
import org.apache.spark.sql.functions.{monotonically_increasing_id, col}
val df = Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7))
.toDF("c1", "c2", "c3")
.withColumn("id", monotonically_increasing_id())
val exdf = Seq(-9, -8, -7, -6).toDF("x")
df.columns.map{ c =>
df.select("id", c).join(exdf, col(c) === $"x", "left_anti")
}
.reduce((df1, df2) => df1.join(df2, Seq("id"), "outer"))
.na.fill(0)
.show
Output:
+---+---+---+---+
| id| c1| c2| c3|
+---+---+---+---+
| 0| 1| 1| 1|
| 1| 2| -5| 1|
| 3| 0| 0| 0|
| 2| 6| 0| 1|
+---+---+---+---+
foldLeft works perfect for your case here as below
val df = spark.sparkContext.parallelize(Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7)
)).toDF("a", "b", "c")
val list = Seq(-7, -8, -9)
val resultDF = df.columns.foldLeft(df) { (acc, name) => {
acc.withColumn(name, when(col(name).isin(list: _*), 0).otherwise(col(name)))
}
}
Output:
+---+---+---+
|a |b |c |
+---+---+---+
|1 |1 |1 |
|2 |-5 |1 |
|6 |-6 |1 |
|0 |0 |0 |
+---+---+---+
I would suggest you to broadcast the list of String :
val stringList=sc.broadcast(<Your List of List[String]>)
After that use this :
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList.value:_*), 0).otherwise(col(currColumnName)))
Make sure your currColumnName also is in String Format. Comparison should be String to String
I want to compare two columns in a Spark DataFrame: if the value of a column (attr_value) is found in values of another (attr_valuelist) I want only that value to be kept. Otherwise, the column value should be null.
For example, given the following input
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
I would expect the following output
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes
2 1 test1 No No
3 2 test2 value1 Value1
I assume, given your sample input, that the column with the search item contains a string while the search target is a sequence of strings. Also, I assume you're interested in case-insensitive search.
This is going to be the input (I added a column that would have yielded a null to test the behavior of the UDF I wrote):
+---+---+--------+----------+----------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+----------------------+
|1 |2 |test |Yes |[Yes, No] |
|2 |1 |test1 |No |[Yes, No] |
|3 |2 |test2 |value1 |[val1, Value1, value2]|
|3 |2 |test2 |value1 |[val1, value2] |
+---+---+--------+----------+----------------------+
You can solve your problem with a very simple UDF.
val find = udf {
(item: String, collection: Seq[String]) =>
collection.find(_.toLowerCase == item.toLowerCase)
}
val df = spark.createDataFrame(Seq(
(1, 2, "test", "Yes", Seq("Yes", "No")),
(2, 1, "test1", "No", Seq("Yes", "No")),
(3, 2, "test2", "value1", Seq("val1", "Value1", "value2")),
(3, 2, "test2", "value1", Seq("val1", "value2"))
)).toDF("id1", "id2", "attrname", "attr_value", "attr_valuelist")
df.select(
$"id1", $"id2", $"attrname", $"attr_value",
find($"attr_value", $"attr_valuelist") as "attr_valuelist")
showing the output of the last command would yield the following output:
+---+---+--------+----------+--------------+
|id1|id2|attrname|attr_value|attr_valuelist|
+---+---+--------+----------+--------------+
| 1| 2| test| Yes| Yes|
| 2| 1| test1| No| No|
| 3| 2| test2| value1| Value1|
| 3| 2| test2| value1| null|
+---+---+--------+----------+--------------+
You can execute this code in any spark-shell. If you are using this from a job you are submitting to a cluster, remember to import spark.implicits._.
can you try this code. I think it will work with that SQL contains case when.
val emptyRDD = sc.emptyRDD[Row]
var emptyDataframe = sqlContext.createDataFrame(emptyRDD, your_dataframe.schema)
your_dataframe.createOrReplaceTempView("tbl")
emptyDataframe = sqlContext.sql("select id1, id2, attrname, attr_value, case when
attr_valuelist like concat('%', attr_value, '%') then attr_value else
null end as attr_valuelist from tbl")
emptyDataframe.show
I have tried numerous approaches to turn the following:
Gender, Age, Value
1, 20, 21
2, 23 22
1, 26, 23
2, 29, 24
into
Male_Age, Male_Value, Female_Age, Female_Value
20 21 23 22
26 23 29 24
What i need to do is group by gender and instead of using an aggregate like (sum, count, avg) I need to create List[age] and List[value]. This should be possible because i am using a Dataset which allows functional operations.
If the number of rows for males and females are not the same, the columns should be filled with nulls.
One approach I tried was to make a new a new dataframe using the columns of other dataframes like so:
df
.select(male.select("sex").where('sex === 1).col("sex"),
female.select("sex").where('sex === 2).col("sex"))
However, this bizarrely produces output like so:
sex, sex,
1, 1
2, 2
1, 1
2, 2
I can't see how that is possible.
I also tried using pivot, but it forces me to aggregate after the group by:
df.withColumn("sex2", df.col("sex"))
.groupBy("sex")
.pivot("sex2")
.agg(
sum('value').as("mean"),
stddev('value).as("std. dev") )
.show()
|sex| 1.0_mean| 1.0_std. dev| 2.0_mean| 2.0_std. dev|
|1.0|0.4926065526| 1.8110632697| | |
|2.0| | |0.951250372|1.75060275400785|
The following code does what I need in Oracle SQL, so it should possible in Spark SQL too I reckon...
drop table mytable
CREATE TABLE mytable
( gender number(10) NOT NULL,
age number(10) NOT NULL,
value number(10) );
insert into mytable values (1,20,21);
insert into mytable values(2,23,22);
insert into mytable values (1,26,23);
insert into mytable values (2,29,24);
insert into mytable values (1,30,25);
select * from mytable;
SELECT A.VALUE AS MALE,
B.VALUE AS FEMALE
FROM
(select value, rownum RN from mytable where gender = 1) A
FULL OUTER JOIN
(select value, rownum RN from mytable where gender = 2) B
ON A.RN = B.RN
The following should give you the result.
val df = Seq(
(1, 20, 21),
(2, 23, 22),
(1, 26, 23),
(2, 29, 24)
).toDF("Gender", "Age", "Value")
scala> df.show
+------+---+-----+
|Gender|Age|Value|
+------+---+-----+
| 1| 20| 21|
| 2| 23| 22|
| 1| 26| 23|
| 2| 29| 24|
+------+---+-----+
// Gender 1 = Male
// Gender 2 = Female
import org.apache.spark.sql.expressions.Window
val byGender = Window.partitionBy("gender").orderBy("gender")
val males = df
.filter("gender = 1")
.select($"age" as "male_age",
$"value" as "male_value",
row_number() over byGender as "RN")
scala> males.show
+--------+----------+---+
|male_age|male_value| RN|
+--------+----------+---+
| 20| 21| 1|
| 26| 23| 2|
+--------+----------+---+
val females = df
.filter("gender = 2")
.select($"age" as "female_age",
$"value" as "female_value",
row_number() over byGender as "RN")
scala> females.show
+----------+------------+---+
|female_age|female_value| RN|
+----------+------------+---+
| 23| 22| 1|
| 29| 24| 2|
+----------+------------+---+
scala> males.join(females, Seq("RN"), "outer").show
+---+--------+----------+----------+------------+
| RN|male_age|male_value|female_age|female_value|
+---+--------+----------+----------+------------+
| 1| 20| 21| 23| 22|
| 2| 26| 23| 29| 24|
+---+--------+----------+----------+------------+
Given a DataFrame called df with columns gender, age, and value, you can do this:
df.groupBy($"gender")
.agg(collect_list($"age"), collect_list($"value")).rdd.map { row =>
val ages: Seq[Int] = row.getSeq(1)
val values: Seq[Int] = row.getSeq(2)
(row.getInt(0), ages.head, ages.last, values.head, values.last)
}.toDF("gender", "male_age", "female_age", "male_value", "female_value")
This uses the collect_list aggregating function in the very helpful Spark functions library to aggregate the values you want. (As you can see, there is also a collect_set as well.)
After that, I don't know of any higher-level DataFrame functions to expand those columnar arrays into individual columns of their own, so I fall back to the lower-level RDD API our ancestors used. I simply expand everything into a Tuple and then turn it back into a DataFrame. The commenters above mention corner cases I have not addressed; using functions like headOption and tailOption might be useful there. But this should be enough to get you moving.