Pass Distinct value of one Dataframe into another Dataframe - pyspark

I want to take distinct value of column from DataFrame A and Pass that into DataFrame B's explode
function to create repeat rows (DataFrameB) for each distinct value.
distinctSet = targetDf.select('utilityId').distinct())
utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId()))
Function
assign_utilityId = psf.udf(
lambda id: [x for x in id],
ArrayType(LongType()))
How to pass distinctSet values to assign_utilityId
Update
+---------+
|utilityId|
+---------+
| 101|
| 101|
| 102|
+---------+
+-----+------+--------+
|index|status|timeSlot|
+-----+------+--------+
| 0| SUN| 0|
| 0| SUN| 1|
I want to take Unique value from Dataframe 1 and create new column in dataFrame 2. Like this
+-----+------+--------+--------+
|index|status|timeSlot|utilityId
+-----+------+--------+--------+
| 0| SUN| 0|101
| 0| SUN| 1|101
| 0| SUN| 0|102
| 0| SUN| 1|102

We don't need a udf for this. I have tried with some input,please check
>>> from pyspark.sql import function as F
>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
| 1|
| 2|
| 3|
| 2|
| 3|
+----+
>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 3|
| 3| 4|
+----+----+
>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]
>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2| col3|
+----+----+---------+
| 1| 2|[1, 2, 3]|
| 2| 3|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
| 1| 2| 1|
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 1|
| 2| 3| 2|
| 2| 3| 3|
| 3| 4| 1|
| 3| 4| 2|
| 3| 4| 3|
+----+----+--------+

df = sqlContext.createDataFrame(sc.parallelize([(101,),(101,),(102,)]),['utilityId'])
df2 = sqlContext.createDataFrame(sc.parallelize([(0,'SUN',0),(0,'SUN',1)]),['index','status','timeSlot'])
rdf = df.distinct()
>>> df2.join(rdf).show()
+-----+------+--------+---------+
|index|status|timeSlot|utilityId|
+-----+------+--------+---------+
| 0| SUN| 0| 101|
| 0| SUN| 0| 102|
| 0| SUN| 1| 101|
| 0| SUN| 1| 102|
+-----+------+--------+---------+

Related

create a new column to increment value when value resets to 1 in another column in pyspark

Logic and columnIn Pyspark DataFrame consider a column like [1,2,3,4,1,2,1,1,2,3,1,2,1,1,2]. Pyspark Column
create a new column to increment value when value resets to 1.
Expected output is[1,1,1,1,2,2,3,4,4,4,5,5,6,7,7]
i am bit new to pyspark, if anyone can help me it would be great for me.
written the logic as like below
def sequence(row_num):
results = [1, ]
flag = 1
for col in range(0, len(row_num)-1):
if row_num[col][0]>=row_num[col+1][0]:
flag+=1
results.append(flag)
return results
but not able to pass a column through udf. please help me in this
Your Dataframe:
df = spark.createDataFrame(
[
('1','a'),
('2','b'),
('3','c'),
('4','d'),
('1','e'),
('2','f'),
('1','g'),
('1','h'),
('2','i'),
('3','j'),
('1','k'),
('2','l'),
('1','m'),
('1','n'),
('2','o')
], ['group','label']
)
+-----+-----+
|group|label|
+-----+-----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 1| e|
| 2| f|
| 1| g|
| 1| h|
| 2| i|
| 3| j|
| 1| k|
| 2| l|
| 1| m|
| 1| n|
| 2| o|
+-----+-----+
You can create a flag and use a Window Function to calculate the cumulative sum. No need to use an UDF:
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.partitionBy().orderBy('label').rowsBetween(Window.unboundedPreceding, 0)
df\
.withColumn('Flag', F.when(F.col('group') == 1, 1).otherwise(0))\
.withColumn('Output', F.sum('Flag').over(w))\
.show()
+-----+-----+----+------+
|group|label|Flag|Output|
+-----+-----+----+------+
| 1| a| 1| 1|
| 2| b| 0| 1|
| 3| c| 0| 1|
| 4| d| 0| 1|
| 1| e| 1| 2|
| 2| f| 0| 2|
| 1| g| 1| 3|
| 1| h| 1| 4|
| 2| i| 0| 4|
| 3| j| 0| 4|
| 1| k| 1| 5|
| 2| l| 0| 5|
| 1| m| 1| 6|
| 1| n| 1| 7|
| 2| o| 0| 7|
+-----+-----+----+------+

Creating a column based on a list or dictionary

I am new to PySpark, I have a dataframe containing a customer id and text, with an associated value
+------+-----+------+
|id |text |value |
+------+-----+------+
| 1 | Cat| 5|
| 1 | Dog| 4|
| 2 | Oil| 1|
I would like to parse the text column, based on a list of keywords, and create a column that tells me wether a keyword is in the text field and extract the associated value, the expected result is this
List_keywords = ["Dog", Cat"]
Out
+------+-----+------+--------+---------+--------+---------+
|id |text |value |bool_Dog|value_Dog|bool_cat|value_cat|
+------+-----+------+--------+---------+--------+---------+
| 1 | Cat| 5|0 | 0| 1| 5|
| 1 | Dog| 4|1 | 4| 0| 0|
| 2 | Oil| 1|0 | 0| 0| 0|
What is the best way to do that? I was thinking of creating a list or a dictionary containing my keywords, and parsing it with a for loop, but I'm sure there is a better way to do that.
Please see the solution
import pyspark.sql.functions as F
data = [[1, 'Cat', 5], [1, 'Dog', 4], [2, 'Oil', 3]]
df = spark.createDataFrame(data, ['id', 'text', 'value'])
df.show()
+---+----+-----+
| id|text|value|
+---+----+-----+
| 1| Cat| 5|
| 1| Dog| 4|
| 2| Oil| 1|
+---+----+-----+
keywords = ['Dog','Cat']
(
df
.groupby('id', 'text', 'value')
.pivot('text', keywords)
.agg(
F.count('value').alias('bool'),
F.max('value').alias('value')
)
.fillna(0)
.sort('text')
).show()
+---+----+-----+--------+---------+--------+---------+
| id|text|value|Dog_bool|Dog_value|Cat_bool|Cat_value|
+---+----+-----+--------+---------+--------+---------+
| 1| Cat| 5| 0| 0| 1| 5|
| 1| Dog| 4| 1| 4| 0| 0|
| 2| Oil| 1| 0| 0| 0| 0|
+---+----+-----+--------+---------+--------+---------+

How to union 2 dataframe without creating additional rows?

I have 2 dataframes and I wanted to do .filter($"item" === "a") while keeping the "S/N" in number numbers.
I tried the following but it ended up with additional rows when I use union. Is there a way to union 2 dataframes without creating additional rows?
var DF1 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
var DF2 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
DF2 = DF2.filter($"item"==="a")
DF3=DF1.withColumn("item",lit(0)).withColumn("value",lit(0))
DF1.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| b| 3|
| 4| b| 4|
| 5| a| 2|
+---+----+-----+
DF2.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
+---+----+-----+
DF3.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+
DF2.union(someDF3).show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+
Left outer join your S/Ns with filtered dataframe, then use coalesce to get rid of nulls:
val DF3 = DF1.select("S/N")
val DF4 = (DF3.join(DF2, Seq("S/N"), joinType="leftouter")
.withColumn("item", coalesce($"item", lit(0)))
.withColumn("value", coalesce($"value", lit(0))))
DF4.show
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| 0| 0|
| 4| 0| 0|
| 5| a| 2|
+---+----+-----+

Row count broken up by a focal value

I have the following DataFrame in Spark using Scala:
val df = List(
("random", 0),
("words", 1),
("in", 1),
("a", 1),
("column", 1),
("are", 0),
("what", 0),
("have", 1),
("been", 1),
("placed", 0),
("here", 1),
("now", 1)
).toDF(Seq("words", "numbers"): _*)
df.show()
+------+-------+
| words|numbers|
+------+-------+
|random| 0|
| words| 1|
| in| 1|
| a| 1|
|column| 1|
| are| 0|
| what| 0|
| have| 1|
| been| 1|
|placed| 0|
| here| 1|
| now| 1|
+------+-------+
I'd like to add a column that contains the count of rows which is started over at every 0 in the numbers column. It would look like this:
+------+-------+-----+
| words|numbers|count|
+------+-------+-----+
|random| 0| 5|
| words| 1| 5|
| in| 1| 5|
| a| 1| 5|
|column| 1| 5|
| are| 0| 1|
| what| 0| 3|
| have| 1| 3|
| been| 1| 3|
|placed| 0| 3|
| here| 1| 3|
| now| 1| 3|
+------+-------+-----+
Here is a method using selectExpr with SQL window functions sum and count; sum of 1-numbers generates the group id which increases by 1 when a zero is encountered, then count the number of rows by this group id:
This might be inefficient since you don't have any partition column.
df.selectExpr(
"words", "numbers",
"count(*) over(partition by sum(1-numbers) over (order by monotonically_increasing_id())) as count"
).show
+------+-------+-----+
| words|numbers|count|
+------+-------+-----+
|random| 0| 5|
| words| 1| 5|
| in| 1| 5|
| a| 1| 5|
|column| 1| 5|
| are| 0| 1|
| what| 0| 3|
| have| 1| 3|
| been| 1| 3|
|placed| 0| 3|
| here| 1| 3|
| now| 1| 3|
+------+-------+-----+

Filtering rows based on subsequent row values in spark dataframe [duplicate]

I have a dataframe(spark):
id value
3 0
3 1
3 0
4 1
4 0
4 0
I want to create a new dataframe:
3 0
3 1
4 1
Need to remove all the rows after 1(value) for each id.I tried with window functions in spark dateframe(Scala). But couldn't able to find a solution.Seems to be I am going in a wrong direction.
I am looking for a solution in Scala.Thanks
Output using monotonically_increasing_id
scala> val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data: org.apache.spark.sql.DataFrame = [id: int, value: int]
scala> val minIdx = dataWithIndex.filter($"value" === 1).groupBy($"id").agg(min($"idx")).toDF("r_id", "min_idx")
minIdx: org.apache.spark.sql.DataFrame = [r_id: int, min_idx: bigint]
scala> dataWithIndex.join(minIdx,($"r_id" === $"id") && ($"idx" <= $"min_idx")).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
The solution wont work if we did a sorted transformation in the original dataframe. That time the monotonically_increasing_id() is generated based on original DF rather that sorted DF.I have missed that requirement before.
All suggestions are welcome.
One way is to use monotonically_increasing_id() and a self-join:
val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data.show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 3| 0|
| 4| 1|
| 4| 0|
| 4| 0|
+---+-----+
Now we generate a column named idx with an increasing Long:
val dataWithIndex = data.withColumn("idx", monotonically_increasing_id())
// dataWithIndex.cache()
Now we get the min(idx) for each id where value = 1:
val minIdx = dataWithIndex
.filter($"value" === 1)
.groupBy($"id")
.agg(min($"idx"))
.toDF("r_id", "min_idx")
Now we join the min(idx) back to the original DataFrame:
dataWithIndex.join(
minIdx,
($"r_id" === $"id") && ($"idx" <= $"min_idx")
).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
Note: monotonically_increasing_id() generates its value based on the partition of the row. This value may change each time dataWithIndex is re-evaluated. In my code above, because of lazy evaluation, it's only when I call the final show that monotonically_increasing_id() is evaluated.
If you want to force the value to stay the same, for example so you can use show to evaluate the above step-by-step, uncomment this line above:
// dataWithIndex.cache()
Hi I found the solution using Window and self join.
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
data.show
scala> data.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 1| 3|
| 3| 0| 1|
| 4| 1| 6|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
val sort_df=data.sort($"sorted")
scala> sort_df.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 1|
| 3| 0| 2|
| 3| 1| 3|
| 4| 0| 4|
| 4| 0| 5|
| 4| 1| 6|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
var window=Window.partitionBy("id").orderBy("$sorted")
val sort_idx=sort_df.select($"*",rowNumber.over(window).as("count_index"))
val minIdx=sort_idx.filter($"value"===1).groupBy("id").agg(min("count_index")).toDF("idx","min_idx")
val result_id=sort_idx.join(minIdx,($"id"===$"idx") &&($"count_index" <= $"min_idx"))
result_id.show
+---+-----+------+-----------+---+-------+
| id|value|sorted|count_index|idx|min_idx|
+---+-----+------+-----------+---+-------+
| 1| 0| 7| 1| 1| 2|
| 1| 1| 8| 2| 1| 2|
| 2| 1| 10| 1| 2| 1|
| 3| 0| 1| 1| 3| 3|
| 3| 0| 2| 2| 3| 3|
| 3| 1| 3| 3| 3| 3|
| 4| 0| 4| 1| 4| 3|
| 4| 0| 5| 2| 4| 3|
| 4| 1| 6| 3| 4| 3|
+---+-----+------+-----------+---+-------+
Still looking for a more optimized solutions.Thanks
You can simply use groupBy like this
val df2 = df1.groupBy("id","value").count().select("id","value")
Here your df1 is
id value
3 0
3 1
3 0
4 1
4 0
4 0
And resultant dataframe is df2 which is your expected output like this
id value
3 0
3 1
4 1
4 0
use isin method and filter as below:
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
val idFilter = List(1, 2)
data.filter($"id".isin(idFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
Ex: filter based on val
val valFilter = List(0)
data.filter($"value".isin(valFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 0| 1|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 0| 9|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+