Pyspark Split Columns - pyspark

from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat",'Combined')
agg = ''
agg = 'Cat'
tdf = (sc.parallelize
([
row(1,1,'12/10/2016',"A",'Water^World'),
row(1,2,None,'A','Sea^Born'),
row(2,1,'14/10/2016','B','Germ^Any'),
row(3,3,'!~2016/2/276','B','Fin^Land'),
row(None,1,'26/09/2016','A','South^Korea'),
row(1,1,'12/10/2016',"A",'North^America'),
row(1,2,None,'A','South^America'),
row(2,1,'14/10/2016','B','New^Zealand'),
row(None,None,'!~2016/2/276','B','South^Africa'),
row(None,1,'26/09/2016','A','Saudi^Arabia')
]).toDF())
cols = F.split(tdf['Combined'], '^')
tdf = tdf.withColumn('column1', cols.getItem(0))
tdf = tdf.withColumn('column2', cols.getItem(1))
tdf.show(truncate = False )
Above is my sample code.
For some reason it is not splitting the column by ^ character.
Any advice?

The pattern is a regular expression, see split; and ^ is an anchor that matches the beginning of string in regex, to match literally, you need to escape it:
cols = F.split(tdf['Combined'], r'\^')
tdf = tdf.withColumn('column1', cols.getItem(0))
tdf = tdf.withColumn('column2', cols.getItem(1))
tdf.show(truncate = False)
+----+----+------------+---+-------------+-------+-------+
|UK_1|UK_2|Date |Cat|Combined |column1|column2|
+----+----+------------+---+-------------+-------+-------+
|1 |1 |12/10/2016 |A |Water^World |Water |World |
|1 |2 |null |A |Sea^Born |Sea |Born |
|2 |1 |14/10/2016 |B |Germ^Any |Germ |Any |
|3 |3 |!~2016/2/276|B |Fin^Land |Fin |Land |
|null|1 |26/09/2016 |A |South^Korea |South |Korea |
|1 |1 |12/10/2016 |A |North^America|North |America|
|1 |2 |null |A |South^America|South |America|
|2 |1 |14/10/2016 |B |New^Zealand |New |Zealand|
|null|null|!~2016/2/276|B |South^Africa |South |Africa |
|null|1 |26/09/2016 |A |Saudi^Arabia |Saudi |Arabia |
+----+----+------------+---+-------------+-------+-------+

Try with '\^'. Also, it will be the same situation when you used '.' dot as the denominator.

Related

spark sql max function not producing right value

I'm trying to find the max of a column grouped by spark partition id. I'm getting the wrong value when applying the max function though. Here is the code:
val partitionCol = uuid()
val localRankCol = "test"
df = df.withColumn(partitionCol, spark_partition_id)
val windowSpec = WindowSpec.partitionBy(partitionCol).orderBy(sortExprs:_*)
val rankDF = df.withColumn(localRankCol, dense_rank().over(windowSpec))
val rankRangeDF = rankDF.agg(max(localRankCol))
rankRangeDF.show(false)
sortExprs is applying an ascending sort on sales.
And the result with some dummy data is (partitionCol is 5th column):
+--------------+------+-----+---------------------------------+--------------------------------+----+
|title |region|sales|r6bea781150fa46e3a0ed761758a50dea|5683151561af407282380e6cf25f87b5|test|
+--------------+------+-----+---------------------------------+--------------------------------+----+
|Die Hard |US |100.0|1 |0 |1 |
|Rambo |US |100.0|1 |0 |1 |
|Die Hard |AU |200.0|1 |0 |2 |
|House of Cards|EU |400.0|1 |0 |3 |
|Summer Break |US |400.0|1 |0 |3 |
|Rambo |EU |100.0|1 |1 |1 |
|Summer Break |APAC |200.0|1 |1 |2 |
|Rambo |APAC |300.0|1 |1 |3 |
|House of Cards|US |500.0|1 |1 |4 |
+--------------+------+-----+---------------------------------+--------------------------------+----+
+---------+
|max(test)|
+---------+
|5 |
+---------+
"test" column has a max value of 4 but 5 is being returned.

Replace the values based on condition spark

I have dataset I want to replace the result column based on the least value of quantity by grouping id,date
id,date,quantity,result
1,2016-01-01,245,1
1,2016-01-01,345,3
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1
Here the output, replace the quantity which has least value in that groupby(id,date). Here ordering of rows doesn't matter, any order it can be.
id,date,quantity,result
1,2016-01-01,245,2
1,2016-01-01,345,2
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1
Use the Window and get the maximum by max.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('id', 'date')
df.withColumn('result', f.when(f.col('quantity') == f.min('quantity').over(w), f.col('result'))) \
.withColumn('result', f.max('result').over(w)).show(10, False)
+---+----------+--------+------+
|id |date |quantity|result|
+---+----------+--------+------+
|1 |2016-01-02|120 |5 |
|1 |2016-01-01|245 |2 |
|1 |2016-01-01|345 |2 |
|1 |2016-01-01|123 |2 |
|2 |2016-01-02|453 |1 |
|2 |2016-01-01|567 |1 |
|2 |2016-01-01|568 |1 |
+---+----------+--------+------+

getting duplicate count but retaining duplicate rows in pyspark

I am trying to find the duplicate count of rows in a pyspark dataframe. I found a similar answer here
but it only outputs a binary flag. I would like to have the actual count for each row.
To use the orignal post's example, if I have a dataframe like so:
+--+--+--+--+
|a |b |c |d |
+--+--+--+--+
|1 |0 |1 |2 |
|0 |2 |0 |1 |
|1 |0 |1 |2 |
|0 |4 |3 |1 |
|1 |0 |1 |2 |
+--+--+--+--+
I would like to result in something like:
+--+--+--+--+--+--+--+--+
|a |b |c |d |row_count |
+--+--+--+--+--+--+--+--+
|1 |0 |1 |2 |3 |
|0 |2 |0 |1 |0 |
|1 |0 |1 |2 |3 |
|0 |4 |3 |1 |0 |
|1 |0 |1 |2 |3 |
+--+--+--+--+--+--+--+--+
Is this possible?
Thank You
Assuming df is your input dataframe:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import *
w = (Window.partitionBy([F.col("a"), F.col("b"), F.col("c"), F.col("D")]))
df=df.select(F.col("a"), F.col("b"), F.col("c"), F.col("D"), F.count(F.col("a")).over(w).alias("row_count"))
If, as per your example, you want to replace every count 1 with 0 do:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import *
w = (Window.partitionBy([F.col("a"), F.col("b"), F.col("c"), F.col("D")]))
df=df.select(F.col("a"), F.col("b"), F.col("c"), F.col("D"), F.count(F.col("a")).over(w).alias("row_count")).select("a", "b", "c", "d", F.when(F.col("row_count")==F.lit(1), F.lit(0)). otherwise(F.col("row_count")).alias("row_count"))

How to iterate grouped rows to produce multiple rows in spark structured streaming?

I have the input data set like:
id operation value
1 null 1
1 discard 0
2 null 1
2 null 2
2 max 0
3 null 1
3 null 1
3 list 0
I want to group the input and produce rows according to "operation" column.
for group 1, operation="discard", then the output is null,
for group 2, operation="max", the output is:
2 null 2
for group 3, operation="list", the output is:
3 null 1
3 null 1
So finally the output is like:
id operation value
2 null 2
3 null 1
3 null 1
Is there a solution for this?
I know there is a similar question how-to-iterate-grouped-data-in-spark
But the differences compared to that are:
I want to produce more than one row for each grouped data. Possible
and how?
I want my logic to be easily extended for more operation to be added in future. So User-defined aggregate functions (aka UDAF) is
the only possible solution?
Update 1:
Thank stack0114106, then more details according to his answer, e.g. for id=1, operation="max", I want to iterate all the item with id=2, and find the max value, rather than assign a hard-coded value, that's why I want to iterate the rows in each group. Below is a updated example:
The input:
scala> val df = Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id"
,"operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]
scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|0 |null |1 |
|0 |discard |0 |
|1 |null |1 |
|1 |null |2 |
|1 |max |0 |
|2 |null |1 |
|2 |null |3 |
|2 |max |0 |
|3 |null |1 |
|3 |null |1 |
|3 |list |0 |
+---+---------+-----+
The expected output:
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1 |null |2 |
|2 |null |3 |
|3 |null |1 |
|3 |null |1 |
+---+---------+-----+
group everything collecting the values, then write logic for each operation :
import org.apache.spark.sql.functions._
val grouped=df.groupBy($"id").agg(max($"operation").as("op"),collect_list($"value").as("vals"))
val maxs=grouped.filter($"op"==="max").withColumn("val",explode($"vals")).groupBy($"id").agg(max("val").as("value"))
val lists=grouped.filter($"op"==="list").withColumn("value",explode($"vals")).filter($"value"!==0).select($"id",$"value")
//we don't collect the "discard"
//and we can add additional subsets for new "operations"
val result=maxs.union(lists)
//if you need the null in "operation" column add it with withColumn
You can use flatMap operation on the dataframe and generate required rows based on the conditions that you mentioned. Check this out
scala> val df = Seq((1,null,1),(1,"discard",0),(2,null,1),(2,null,2),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]
scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1 |null |1 |
|1 |discard |0 |
|2 |null |1 |
|2 |null |2 |
|2 |max |0 |
|3 |null |1 |
|3 |null |1 |
|3 |list |0 |
+---+---------+-----+
scala> df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0
until s._1).map( i => (r.getInt(0),null,s._2) ) }).show(false)
+---+----+---+
|_1 |_2 |_3 |
+---+----+---+
|2 |null|2 |
|3 |null|1 |
|3 |null|1 |
+---+----+---+
Spark assigns _1,_2 etc.. so you can map them to actual names by assigning them as below
scala> val df2 = df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0 until s._1).map( i => (r.getInt(0),null,s._2) ) }).toDF("id","operation","value")
df2: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]
scala> df2.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|2 |null |2 |
|3 |null |1 |
|3 |null |1 |
+---+---------+-----+
scala>
EDIT1:
Since you need the max(value) for each id, you can use window functions and get the max value in a new column, then use the same technique and get the results. Check this out
scala> val df = Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]
scala> df.createOrReplaceTempView("michael")
scala> val df2 = spark.sql(""" select *, max(value) over(partition by id) mx from michael """)
df2: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 2 more fields]
scala> df2.show(false)
+---+---------+-----+---+
|id |operation|value|mx |
+---+---------+-----+---+
|1 |null |1 |2 |
|1 |null |2 |2 |
|1 |max |0 |2 |
|3 |null |1 |1 |
|3 |null |1 |1 |
|3 |list |0 |1 |
|2 |null |1 |3 |
|2 |null |3 |3 |
|2 |max |0 |3 |
|0 |null |1 |1 |
|0 |discard |0 |1 |
+---+---------+-----+---+
scala> val df3 = df2.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => 0 case "max" => 1 case "list" => 2 } ; (0 until s).map( i => (r.getInt(0),null,r.getInt(3) )) }).toDF("id","operation","value")
df3: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]
scala> df3.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1 |null |2 |
|3 |null |1 |
|3 |null |1 |
|2 |null |3 |
+---+---------+-----+
scala>

Filtering out rows of a table bassed on a column

I am trying to filter out table rows based in column value.
I have a dataframe:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |1 |
|3 |0 |
|4 |1 |
|4 |0 |
|4 |0 |
+---+-----+
I want to create a new dataframe deleting all rows with value!=0:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |0 |
|4 |0 |
|4 |0 |
+---+-----+
I figured the syntax should be something like this but couldn't get it right:
val newDataFrame = OldDataFrame.filter($"value"==0)
Correct way is as following. You just forgot to add one = sign
val newDataFrame = OldDataFrame.filter($"value" === 0)
Their are various ways by which you can do the filtering.
val newDataFrame = OldDataFrame.filter($"value"===0)
val newDataFrame = OldDataFrame.filter(OldDataFrame("value") === 0)
val newDataFrame = OldDataFrame.filter("value === 0")
You can also use where function as well instead of filter.