split content of column into lines in pyspark - pyspark

I have a dataframe df:
+------+----------+--------------------+
|SiteID| LastRecID| Col_to_split|
+------+----------+--------------------+
| 2|1056962584|[214, 207, 206, 205]|
| 2|1056967423| [213, 208]|
| 2|1056870114| [213, 202, 199]|
| 2|1056876861|[203, 213, 212, 1...|
I want to split the column into lines like this:
+----------+-------------+-------------+
| RecID| index| Value|
+----------+-------------+-------------+
|1056962584| 0| 214|
|1056962584| 1| 207|
|1056962584| 2| 206|
|1056962584| 3| 205|
|1056967423| 0| 213|
|1056967423| 1| 208|
|1056870114| 0| 213|
|1056870114| 1| 202|
|1056870114| 2| 199|
|1056876861| 0| 203|
|1056876861| 1| 213|
|1056876861| 2| 212|
|1056876861| 3| 1..|
|1056876861| etc...| etc...|
Value contains the value from the list.
Index contains the index of the value in the list.
How can I do that using PySpark ?

As of Spark 2.1.0, you can use posexplode which unnest array column and output the index for each element as well, (used data from #Herve):
import pyspark.sql.functions as F
df.select(
F.col("LastRecID").alias("RecID"),
F.posexplode(F.col("coltosplit")).alias("index", "value")
).show()
+-----+-----+-----+
|RecID|index|value|
+-----+-----+-----+
|10526| 0| 214|
|10526| 1| 207|
|10526| 2| 206|
|10526| 3| 205|
|10896| 0| 213|
|10896| 1| 208|
+-----+-----+-----+

I quickly tried with Spark 2.0
You can change the query a little bit if you want to order differently.
d = [{'SiteID': '2', 'LastRecId': 10526, 'coltosplit': [214,207,206,205]}, {'SiteID': '2', 'LastRecId': 10896, 'coltosplit': [213,208]}]
df = spark.createDataFrame(d)
+---------+------+--------------------+
|LastRecId|SiteID| coltosplit|
+---------+------+--------------------+
| 10526| 2|[214, 207, 206, 205]|
| 10896| 2| [213, 208]|
+---------+------+--------------------+
query = """
select LastRecId as RecID,
(row_number() over (partition by LastRecId order by 1)) - 1 as index,
t as Value
from test
LATERAL VIEW explode(coltosplit) test AS t
"""
df.createTempView("test")
spark.sql(query).show()
+-----+-----+-----+
|RecID|index|Value|
+-----+-----+-----+
|10896| 0| 213|
|10896| 1| 208|
|10526| 0| 214|
|10526| 1| 207|
|10526| 2| 206|
|10526| 3| 205|
+-----+-----+-----+
So basically I just explode the list into a new column. And apply row number on this column.
Hope this helps

Related

How to do a groupBy by a given column but still keep all the rows of the original DataFrame?

I want to do a groupBy and aggregate by a given column in PySpark but I still want to keep all the rows from the original DataFrame.
For example lets say we have the following DataFrame and we want to do a max on the "value" column then we would get the result below.
Original DataFrame
+--+-----+
|id|value|
+--+-----+
| 1| 1|
| 1| 2|
| 2| 3|
| 2| 4|
+--+-----+
Result
+--+-----+---+
|id|value|max|
+--+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+--+-----+---+
You can do it simply by joining aggregated dataframe with original dataframe
aggregated_df = (
df
.groupby('id')
.agg(F.max('value').alias('max'))
)
max_value_df = (
df
.join(aggregated_df, 'id')
)
Use window function
df.withColumn('max', max('value').over(Window.partitionBy('id'))).show()
+---+-----+---+
| id|value|max|
+---+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+---+-----+---+

Spark Dataframe: Group and rank rows on a certain column value

I am trying to rank a column when the "ID" column numbering starts from 1 to max and then resets from 1.
So, the first three rows have a continuous numbering on "ID"; hence these should be grouped with group rank =1. Rows four and five are in another group, group rank = 2.
The rows are sorted by "rownum" column. I am aware of the row_number window function but I don't think I can apply for this use case as there is no constant window. I can only think of looping through each row in the dataframe but not sure how I can update a column when number resets to 1.
val df = Seq(
(1, 1 ),
(2, 2 ),
(3, 3 ),
(4, 1),
(5, 2),
(6, 1),
(7, 1),
(8, 2)
).toDF("rownum", "ID")
df.show()
Expected result is below:
You can do it with 2 window-functions, the first one to flag the state, the second one to calculate a running sum:
df
.withColumn("increase", $"ID" > lag($"ID",1).over(Window.orderBy($"rownum")))
.withColumn("group_rank_of_ID",sum(when($"increase",lit(0)).otherwise(lit(1))).over(Window.orderBy($"rownum")))
.drop($"increase")
.show()
gives:
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
As #Prithvi noted, we can use lead here.
The tricky part is in order to use window function such as lead, we need to at least provide the order.
Consider
val nextID = lag('ID, 1, -1) over Window.orderBy('rownum)
val isNewGroup = 'ID <= nextID cast "integer"
val group_rank_of_ID = sum(isNewGroup) over Window.orderBy('rownum)
/* you can try
df.withColumn("intermediate", nextID).show
// ^^^^^^^-- can be `isNewGroup`, or other vals
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 0|
| 2| 2| 0|
| 3| 3| 0|
| 4| 1| 1|
| 5| 2| 1|
| 6| 1| 2|
| 7| 1| 3|
| 8| 2| 3|
+------+---+----------------+
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID + 1).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
*/

PySpark: How to groupby with Or in columns

I want to groupby in PySpark, but the value can appear in more than a columns, so if it appear in any of the selected column it will be grouped by.
For example, if I have this table in Pyspark:
I want to sum the visits and investments for each ID, so that the result would be:
Note that the ID1 was the sum of the rows 0,1,3 which have the ID1 in one of the first three columns [ID1 Visits = 500 + 100 + 200 = 800].
The ID2 was the sum of the rows 1,2, etc
OBS 1: For the sake of simplicity my example was a simple dataframe, but in real is a much larger df with a lot of rows and a lot of variables, and other operations, not just "sum".
This can't be worked on pandas, because is too large. Should be in PySpark
OBS2: For ilustration I printed in pandas the tables, but in real it is in the PySpark
I appreciate all the help and thank you very much in advance
First of all let's create our test dataframe.
>>> import pandas as pd
>>> data = {
"ID1": [1, 2, 5, 1],
"ID2": [1, 1, 3, 3],
"ID3": [4, 3, 2, 4],
"Visits": [500, 100, 200, 200],
"Investment": [1000, 200, 400, 200]
}
>>> df = spark.createDataFrame(pd.DataFrame(data))
>>> df.show()
+---+---+---+------+----------+
|ID1|ID2|ID3|Visits|Investment|
+---+---+---+------+----------+
| 1| 1| 4| 500| 1000|
| 2| 1| 3| 100| 200|
| 5| 3| 2| 200| 400|
| 1| 3| 4| 200| 200|
+---+---+---+------+----------+
Once we have DataFrame that we can operate on we have to define a function which will return list of unique IDs from columns ID1, ID2 and ID3.
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import ArrayType, IntegerType
>>> #F.udf(returnType=ArrayType(IntegerType()))
... def ids_list(*cols):
... return list(set(cols))
Now it's time to apply our udf on a DataFrame.
>>> df = df.withColumn('ids', ids_list('ID1', 'ID2', 'ID3'))
>>> df.show()
+---+---+---+------+----------+---------+
|ID1|ID2|ID3|Visits|Investment| ids|
+---+---+---+------+----------+---------+
| 1| 1| 4| 500| 1000| [1, 4]|
| 2| 1| 3| 100| 200|[1, 2, 3]|
| 5| 3| 2| 200| 400|[2, 3, 5]|
| 1| 3| 4| 200| 200|[1, 3, 4]|
+---+---+---+------+----------+---------+
To make use of ids column we have to explode it into separate rows and drop ids column.
>>> df = df.withColumn("ID", F.explode('ids')).drop('ids')
>>> df.show()
+---+---+---+------+----------+---+
|ID1|ID2|ID3|Visits|Investment| ID|
+---+---+---+------+----------+---+
| 1| 1| 4| 500| 1000| 1|
| 1| 1| 4| 500| 1000| 4|
| 2| 1| 3| 100| 200| 1|
| 2| 1| 3| 100| 200| 2|
| 2| 1| 3| 100| 200| 3|
| 5| 3| 2| 200| 400| 2|
| 5| 3| 2| 200| 400| 3|
| 5| 3| 2| 200| 400| 5|
| 1| 3| 4| 200| 200| 1|
| 1| 3| 4| 200| 200| 3|
| 1| 3| 4| 200| 200| 4|
+---+---+---+------+----------+---+
Finally we have to group our DataFrame by ID column and calculate sums. Final result is ordered by ID.
>>> final_df = (
... df.groupBy('ID')
... .agg( F.sum('Visits'), F.sum('Investment') )
... .orderBy('ID')
... )
>>> final_df.show()
+---+-----------+---------------+
| ID|sum(Visits)|sum(Investment)|
+---+-----------+---------------+
| 1| 800| 1400|
| 2| 300| 600|
| 3| 500| 800|
| 4| 700| 1200|
| 5| 200| 400|
+---+-----------+---------------+
I hope you make it useful.
You can do something like below:
Create array of all id columns- > ids column below
explode ids column
Now you will get duplicates, to avoid duplicate aggregation use distinct
Finally groupBy ids column and perform all your aggregations
Note: : If your dataset can have exact duplicate rows then add one columns with df.withColumn('uid', f.monotonically_increasing_id()) before creating array otherwise distinct will drop it.
Example for your dataset:
import pyspark.sql.functions as f
df.withColumn('ids', f.explode(f.array('id1','id2','id3'))).distinct().groupBy('ids').agg(f.sum('visits'), f.sum('investments')).orderBy('ids').show()
+---+-----------+----------------+
|ids|sum(visits)|sum(investments)|
+---+-----------+----------------+
| 1| 800| 1400|
| 2| 300| 600|
| 3| 500| 800|
| 4| 700| 1200|
| 5| 200| 400|
+---+-----------+----------------+

how to create a dataframe based on the first appearing date and based on additional columns each id column

i try to create a dataframe with following condition:
I have multiple IDs, multiple columns with defaults (0 or 1) and a startdate column. I would like to get a dataframe with the appearing defaults based on the first startdate (default_date) and each id.
the orginal df looks like this:
+----+-----+-----+-----+-----------+
|id |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 1| 2019-01-31|
| 02| 1| 1| 0| 2018-12-31|
| 03| 1| 1| 1| 2018-10-31|
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 0| 2018-07-31|
| 03| 1| 1| 1| 2019-05-31|
this is how i would like to have it:
+----+-----+-----+-----+-----------+
|id |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 1| 2018-07-31|
i tried following code:
val w = Window.partitionBy($"id").orderBy($"date".asc)
val reult = join3.withColumn("rn", row_number.over(w)).where($"def_a" === 1 || $"def_b" === 1 ||$"def_c" === 1).filter($"rn" >= 1).drop("rn")
result.show
I would be grateful for any help
This should work for you. First assign the min date to the original df then join the new df2 with df.
import org.apache.spark.sql.expressions.Window
val df = Seq(
(1,1,0,1,"2019-01-31"),
(2,1,1,0,"2018-12-31"),
(3,1,1,1,"2018-10-31"),
(1,1,0,1,"2018-09-30"),
(2,1,1,0,"2018-08-31"),
(3,1,1,0,"2018-07-31"),
(3,1,1,1,"2019-05-31"))
.toDF("id" ,"def_a" , "def_b", "deb_c", "date")
val w = Window.partitionBy($"id").orderBy($"date".asc)
val df2 = df.withColumn("date", $"date".cast("date"))
.withColumn("min_date", min($"date").over(w))
.select("id", "min_date")
.distinct()
df.join(df2, df("id") === df2("id") && df("date") === df2("min_date"))
.select(df("*"))
.show
And the output should be:
+---+-----+-----+-----+----------+
| id|def_a|def_b|deb_c| date|
+---+-----+-----+-----+----------+
| 1| 1| 0| 1|2018-09-30|
| 2| 1| 1| 0|2018-08-31|
| 3| 1| 1| 0|2018-07-31|
+---+-----+-----+-----+----------+
By the way I believe you had a little mistake on your expected results. It is (3, 1, 1, 0, 2018-07-31) not (3, 1, 1, 1, 2018-07-31)

Pivot scala dataframe with conditional counting

I would like to aggregate this DataFrame and count the number of observations with a value less than or equal to the "BUCKET" field for each level. For example:
val myDF = Seq(
("foo", 0),
("foo", 0),
("bar", 0),
("foo", 1),
("foo", 1),
("bar", 1),
("foo", 2),
("bar", 2),
("foo", 3),
("bar", 3)).toDF("COL1", "BUCKET")
myDF.show
+----+------+
|COL1|BUCKET|
+----+------+
| foo| 0|
| foo| 0|
| bar| 0|
| foo| 1|
| foo| 1|
| bar| 1|
| foo| 2|
| bar| 2|
| foo| 3|
| bar| 3|
+----+------+
I can count the number of observations matching each bucket value using this code:
myDF.groupBy("COL1").pivot("BUCKET").count.show
+----+---+---+---+---+
|COL1| 0| 1| 2| 3|
+----+---+---+---+---+
| bar| 1| 1| 1| 1|
| foo| 2| 2| 1| 1|
+----+---+---+---+---+
But I want to count the number of rows with a value in the "BUCKET" field which is less than or equal to the final header after pivoting, like this:
+----+---+---+---+---+
|COL1| 0| 1| 2| 3|
+----+---+---+---+---+
| bar| 1| 2| 3| 4|
| foo| 2| 4| 5| 6|
+----+---+---+---+---+
You can achieve this using a window function, as follows:
import org.apache.spark.sql.expressions.Window.partitionBy
import org.apache.spark.sql.functions.first
myDF.
select(
$"COL1",
$"BUCKET",
count($"BUCKET").over(partitionBy($"COL1").orderBy($"BUCKET")).as("ROLLING_COUNT")).
groupBy($"COL1").pivot("BUCKET").agg(first("ROLLING_COUNT")).
show()
+----+---+---+---+---+
|COL1| 0| 1| 2| 3|
+----+---+---+---+---+
| bar| 1| 2| 3| 4|
| foo| 2| 4| 5| 6|
+----+---+---+---+---+
What you are specifying here is that you want to perform a count of your observations, partitioned in windows as determined by a key (COL1 in this case). By specifying an ordering, you are also making the count rolling over the window, thus obtaining the results you want then to be pivoted in your end results.
This is the result of applying the window function:
myDF.
select(
$"COL1",
$"BUCKET",
count($"BUCKET").over(partitionBy($"COL1").orderBy($"BUCKET")).as("ROLLING_COUNT")).
show()
+----+------+-------------+
|COL1|BUCKET|ROLLING_COUNT|
+----+------+-------------+
| bar| 0| 1|
| bar| 1| 2|
| bar| 2| 3|
| bar| 3| 4|
| foo| 0| 2|
| foo| 0| 2|
| foo| 1| 4|
| foo| 1| 4|
| foo| 2| 5|
| foo| 3| 6|
+----+------+-------------+
Finally, by grouping by COL1, pivoting over BUCKET and only getting the first result of the rolling count (anyone would be good as all of them are applied to the whole window), you finally obtain the result you were looking for.
In a way, window functions are very similar to aggregations over groupings, but are more flexible and powerful. This just scratches the surface of window functions and you can dig a little bit deeper by having a look at this introductory reading.
Here's one approach to get the rolling counts by traversing the pivoted BUCKET value columns using foldLeft to aggregate the counts. Note that a tuple of (DataFrame, Int) is used for foldLeft to transform the DataFrame as well as store the count in the previous iteration:
val pivotedDF = myDF.groupBy($"COL1").pivot("BUCKET").count
val buckets = pivotedDF.columns.filter(_ != "COL1")
buckets.drop(1).foldLeft((pivotedDF, buckets.head))( (acc, c) =>
( acc._1.withColumn(c, col(acc._2) + col(c)), c )
)._1.show
// +----+---+---+---+---+
// |COL1| 0| 1| 2| 3|
// +----+---+---+---+---+
// | bar| 1| 2| 3| 4|
// | foo| 2| 4| 5| 6|
// +----+---+---+---+---+