How to join timestamp in range (range not exists)

How to join timestamp in range (range not exists) - pyspark

I want to join two dataframes by column timestamp df2.join(df1, how='left'). The next timestamp column df1 is the stop condition
Dataframes to join
df1 = spark.createDataFrame(
[(1, 110, 'walk', 'work', '2019-09-28 13:40:00'),
(2, 110, 'metro', 'work', '2019-09-28 14:00:00'),
(3, 110, 'walk', 'work', '2019-09-28 14:02:00'),
(4, 120, 'bus', 'home', '2019-09-28 17:00:00'),
(5, 120, 'metro', 'home', '2019-09-28 17:20:00'),
(6, 120, 'walk', 'home', '2019-09-28 17:45:00')],
['id', 'u_uuid', 'mode', 'place', 'timestamp']
)
df2 = spark.createDataFrame(
[(1, '2019-09-28 13:30:00'),
(2, '2019-09-28 13:35:00'),
(3, '2019-09-28 13:39:00'),
(4, '2019-09-28 13:50:00'),
(5, '2019-09-28 13:55:00'),
(6, '2019-09-28 14:01:00'),
(7, '2019-09-28 16:30:00'),
(8, '2019-09-28 16:40:00'),
(9, '2019-09-28 16:50:00'),
(10, '2019-09-28 17:25:00'),
(11, '2019-09-28 17:30:00'),
(12, '2019-09-28 17:35:00')],
['id', 'timestamp']
)
Goal

IIUC, One way to do is by using Window.
import pyspark.sql.functions as f
from pyspark.sql.window import Window
win_spec = Window.orderBy('timestamp')
# Window function without partitionBy has huge impact as it will bring all data into one partition. You might see executor OOM errors.
# Advise to add some partition column if you have big dataset
Window.partitionBy('SOME_COL').orderBy('timestamp')
Now Add start_timestamp column like below
df = df1.withColumn('start_timestamp', f.coalesce(f.lag('timestamp').over(win_spec),f.lit('1')))
# df.show()
# +---+------+-----+-----+-------------------+-------------------+
# | id|u_uuid| mode|place| timestamp| start_timestamp|
# +---+------+-----+-----+-------------------+-------------------+
# | 1| 110| walk| work|2019-09-28 13:40:00| 1|
# | 2| 110|metro| work|2019-09-28 14:00:00|2019-09-28 13:40:00|
# | 3| 110| walk| work|2019-09-28 14:02:00|2019-09-28 14:00:00|
# | 4| 120| bus| home|2019-09-28 17:00:00|2019-09-28 14:02:00|
# | 5| 120|metro| home|2019-09-28 17:20:00|2019-09-28 17:00:00|
# | 6| 120| walk| home|2019-09-28 17:45:00|2019-09-28 17:20:00|
# +---+------+-----+-----+-------------------+-------------------+
Now Join df with df2 Using left join
df.join(df2, df2['timestamp'].between(df['start_timestamp'], df['timestamp']), 'left')\
.where(df2['id'].isNotNull())\ # check below
.select(df['u_uuid'], df['mode'], df['place'], df['timestamp'].alias('df1.timestamp'), df2['timestamp'].alias('df2.timestamp'))\
.show()
# where clause is just to match goal output,
# there is no entry in df2 for 2019-09-28 17:00:00 to 2019-09-28 17:20:00 range
# Record: 120|metro| home|2019-09-28 17:20:00|2019-09-28 17:00:00
+------+-----+-----+-------------------+-------------------+
|u_uuid| mode|place| df1.timestamp| df2.timestamp|
+------+-----+-----+-------------------+-------------------+
| 110| walk| work|2019-09-28 13:40:00|2019-09-28 13:30:00|
| 110| walk| work|2019-09-28 13:40:00|2019-09-28 13:35:00|
| 110| walk| work|2019-09-28 13:40:00|2019-09-28 13:39:00|
| 110|metro| work|2019-09-28 14:00:00|2019-09-28 13:50:00|
| 110|metro| work|2019-09-28 14:00:00|2019-09-28 13:55:00|
| 110| walk| work|2019-09-28 14:02:00|2019-09-28 14:01:00|
| 120| bus| home|2019-09-28 17:00:00|2019-09-28 16:30:00|
| 120| bus| home|2019-09-28 17:00:00|2019-09-28 16:40:00|
| 120| bus| home|2019-09-28 17:00:00|2019-09-28 16:50:00|
| 120| walk| home|2019-09-28 17:45:00|2019-09-28 17:25:00|
| 120| walk| home|2019-09-28 17:45:00|2019-09-28 17:30:00|
| 120| walk| home|2019-09-28 17:45:00|2019-09-28 17:35:00|
+------+-----+-----+-------------------+-------------------+
Alternatively, you can use right join to avoid where cluase. Decide based on df1 and df2 size.
df.join(df2, df2['timestamp'].between(df['start_timestamp'], df['timestamp']), 'right')\
.select(df['u_uuid'], df['mode'], df['place'], df['timestamp'].alias('df1.timestamp'), df2['timestamp'].alias('df2.timestamp'))\
.show()

Related

How to find rows in df2 which are not available in df1?

How I can find the complement of a dataframe with respect of another dataframe?
In pandas it can be done by the following code:
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']
Example:
+---------+----+
| City|Temp|
+---------+----+
| New York| 59|
| Chicago| 29|
| Tokyo| 73|
| Paris| 56|
|New Delhi| 48|
+---------+----+
+---------+----+
| City|Temp|
+---------+----+
| London| 55|
| New York| 55|
| Tokyo| 73|
|New Delhi| 85|
| Paris| 56|
+---------+----+
Result:
+---------+----+----------+
| City|Temp|_merge |
+---------+----+----------+
| London| 55|right_only|
|New Delhi| 85|right_only|
| New York| 55|right_only|
+---------+----+----------+

You can use subtract.
df = df2.subtract(df1)
Result
+---------+----+
| City|Temp|
+---------+----+
| New York| 55|
| London| 55|
|New Delhi| 85|
+---------+----+

df1.join(df2, ['City', 'Temp'], 'outer').filter(" id1 IS NULL ")
dt1 = [
(0, 'New York', 59),
(1, 'Chicago', 29),
(2, 'Tokyo', 73),
(3, 'Paris', 56),
(4, 'New Delhi', 48),
]
df1 = spark.createDataFrame(dt1, ['id1','City', 'Temp'])
dt2 = [
(0, 'London', 55),
(1, 'New York', 55),
(2, 'Tokyo', 73),
(3, 'New Delhi', 85),
(4, 'Paris', 56),
]
df2 = spark.createDataFrame(dt2, ['id2','City', 'Temp'])
(
df1.join(df2, ['City', 'Temp'], 'outer')
.filter(" id1 IS NULL ")
.sort('id2')
.show(10, False)
)
# +---------+----+----+---+
# |City |Temp|id1 |id2|
# +---------+----+----+---+
# |London |55 |null|0 |
# |New York |55 |null|1 |
# |New Delhi|85 |null|3 |
# +---------+----+----+---+

You can also try "left_anti" join. Its Venn diagram looks like this:
And the code would look like this:
df = (
df2
.join(df1, ['City', 'Temp'], 'left_anti')
)
output:
+---------+----+
| City|Temp|
+---------+----+
| London| 55|
|New Delhi| 85|
| New York| 55|
+---------+----+

regarding train-test split of data in spark scala

I have a spark scala data frame like this
val df = Seq(
(10, 12),
(44, 14),
(32, 25),
(31, 24),
(75, 25),
(80, 20),
(35, 55),
(32, 25),
(67, 72),
(32, 21)
).toDF("x1","x2")
df.show()
+---+---+
| x1| x2|
+---+---+
| 10| 12|
| 44| 14|
| 32| 25|
| 31| 24|
| 75| 25|
| 80| 20|
| 35| 55|
| 32| 25|
| 67| 72|
| 32| 21|
+---+---+
I need to split this data as training and testing where training data would be the first 8 rows (80%) and testing data would be the last 2 rows (20%).
I tried , val Array(train, test) = df.randomSplit(Array(0.8, 0.2)) But it is select 8 rows randomly (instead of choosing first 8 rows) as training and others as testing
So can anyone suggest how to select the partitions as i mentioned above ?
Thank you

Maybe there is a better way but nothing else comes to my mind as you require data to be ordered.
val cnt = df.count
val testSize = (0.2 * cnt).toInt
val trainSize = cnt - testSize
val trainDf = df.sort(monotonically_increasing_id).limit(trainSize)
val testDf = df.sort(monotonically_increasing_id.desc).limit(testSize)

How to flag last rows from window using Pyspark

My goal is to create a new column is_end (when is last and the previous p_uuid isNull() then is_end=1 otherwise=0. I don't know how to combine When() and last() functions.
I tried several times to combine with windows but always errors :(
df = spark.createDataFrame([
(1, 110, None, '2019-09-28'),
(2, 110, None, '2019-09-28'),
(3, 110, 'aaa', '2019-09-28'),
(4, 110, None, '2019-09-17'),
(5, 110, None, '2019-09-17'),
(6, 110, 'bbb', '2019-09-17'),
(7, 110, None, '2019-09-01'),
(8, 110, None, '2019-09-01'),
(9, 110, None, '2019-09-01'),
(10, 110, None, '2019-09-01'),
(11, 110, 'ccc', '2019-09-01'),
(12, 110, None, '2019-09-01'),
(13, 110, None, '2019-09-01'),
(14, 110, None, '2019-09-01')
],
['idx', 'u_uuid', 'p_uuid', 'timestamp']
)
df.show()
My dataframe:
+---+------+------+----------+
|idx|u_uuid|p_uuid| timestamp|
+---+------+------+----------+
| 1| 110| null|2019-09-28|
| 2| 110| null|2019-09-28|
| 3| 110| aaa|2019-09-28|
| 4| 110| null|2019-09-17|
| 5| 110| null|2019-09-17|
| 6| 110| bbb|2019-09-17|
| 7| 110| null|2019-09-01|
| 8| 110| null|2019-09-01|
| 9| 110| null|2019-09-01|
| 10| 110| null|2019-09-01|
| 11| 110| ccc|2019-09-01|
| 12| 110| null|2019-09-01|
| 13| 110| null|2019-09-01|
| 14| 110| null|2019-09-01|
+---+------+------+----------+
w = Window.partitionBy("u_uuid").orderBy(col("timestamp"))
df.withColumn("p_uuid", when( lag(F.col("p_uuid").isNull()).over(w), 1).otherwise(0))
What I m looking for:
+---+------+------+----------+------+
|idx|u_uuid|p_uuid| timestamp|is_end|
+---+------+------+----------+------+
| 1| 110| null|2019-09-28| 0|
| 2| 110| null|2019-09-28| 0|
| 3| 110| aaa|2019-09-28| 0|
| 4| 110| null|2019-09-17| 0|
| 5| 110| null|2019-09-17| 0|
| 6| 110| bbb|2019-09-17| 0|
| 7| 110| null|2019-09-01| 0|
| 8| 110| null|2019-09-01| 0|
| 9| 110| null|2019-09-01| 0|
| 10| 110| null|2019-09-01| 0|
| 11| 110| ccc|2019-09-01| 0|
| 12| 110| null|2019-08-29| 1|
| 13| 110| null|2019-08-29| 1|
| 14| 110| null|2019-08-29| 1|

Bellow is pyspark sql associate to your case:
w = (Window
.partitionBy("u_uuid")
.orderBy("timestamp"))
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn("is_end", F.when(F.last("p_uuid", True).over(w).isNull() & F.col("p_uuid").isNull(), F.lit(1)).otherwise(F.lit(0)))\
.show()

Spark - Change value of records which belong to the long tail in a Dataset

I am trying to solve a data cleaning step in a Machine Learning problem where I should group all the elements in the long tail in a common category named "Others". For example, I have a dataframe like this:
val df = sc.parallelize(Seq(
(1, "ABC"),
(2, "ABC"),
(3, "123"),
(4, "FPK"),
(5, "FPK"),
(6, "ABC"),
(7, "ABC"),
(8, "980"),
(9, "abc"),
(10, "FPK")
)).toDF("n", "s")
I want to keep the categories "ABC" and "FPK" since they appear several times, but I don't want to have one different category for: 123,980,abc Since they appear just once. So What I would like to have instead is:
+---+------+
| n| s|
+---+------+
| 1| ABC|
| 2| ABC|
| 3|Others|
| 4| FPK|
| 5| FPK|
| 6| ABC|
| 7| ABC|
| 8|Others|
| 9|Others|
| 10| FPK|
+---+------+
To achieve this what I tried is this:
val newDF = df.withColumn("s",when($"s".isin("123","980","abc"),"Others").otherwise('s)
This works fine.
But I would like to programatically decide what categories belong to the long tail, in my case appear just once in the originall dataframe. So I wrote this to create a dataframe with those categories that only appear once:
val longTail = df.groupBy("s").agg(count("*").alias("cnt")).orderBy($"cnt".desc).filter($"cnt"<2)
+---+---+
| s|cnt|
+---+---+
|980| 1|
|abc| 1|
|123| 1|
+---+---+
Now I was trying to convert the values of the column "s" in this longTail dataset into a List to exchange it by the one I hardcoded before. So I tried with:
val ar = longTail.select("s").collect().map(_(0)).toList
ar: List[Any] = List(123, 980, abc)
But when I try to add the ar
val newDF = df.withColumn("s",when($"s".isin(ar),"Others").otherwise('s))
I get the following error:
java.lang.RuntimeException: Unsupported literal type class
scala.collection.immutable.$colon$colon List(123, 980, abc)
What am I missing?

This is the correct syntax :
scala> df.withColumn("s", when($"s".isin(ar : _*), "Others").otherwise('s)).show
+---+------+
| n| s|
+---+------+
| 1| ABC|
| 2| ABC|
| 3|Others|
| 4| FPK|
| 5| FPK|
| 6| ABC|
| 7| ABC|
| 8|Others|
| 9|Others|
| 10| FPK|
+---+------+
This is called a repeated parameter. cf here.

You don't have to go through all the hassles you've been going through, you can use window function to get the counts of each groups and check using when/otherwise function to populate Others or not as below
val df = sc.parallelize(Seq(
(1, "ABC"),
(2, "ABC"),
(3, "123"),
(4, "FPK"),
(5, "FPK"),
(6, "ABC"),
(7, "ABC"),
(8, "980"),
(9, "abc"),
(10, "FPK")
)).toDF("n", "s")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("s", when(count("s").over(Window.partitionBy("s").orderBy("n").rowsBetween(Long.MinValue, Long.MaxValue)) > 1, col("s")).otherwise("Others")).show(false)
which should give you
+---+------+
|n |s |
+---+------+
|4 |FPK |
|5 |FPK |
|10 |FPK |
|8 |Others|
|9 |Others|
|1 |ABC |
|2 |ABC |
|6 |ABC |
|7 |ABC |
|3 |Others|
+---+------+
I hope the answer is helpful

Spark dataset: return a HashMap of values having same key

+------+-----+
|userID|entID|
+------+-----+
| 0| 5|
| 0| 15|
| 1| 7|
| 1| 3|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 9|
| 3| 25|
+------+-----+
I want the result as {0->(5,15), 1->(7,3),..}
Any help would be appreciated.

Here is your table again:
val df = Seq(
(0, 5),
(0, 15),
(1, 7),
(1, 3),
(2, 3),
(2, 4),
(2, 5),
(2, 9),
(3, 25)
).toDF("userId", "entId")
df.show()
Outputs:
+------+-----+
|userId|entId|
+------+-----+
| 0| 5|
| 0| 15|
| 1| 7|
| 1| 3|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 9|
| 3| 25|
+------+-----+
Now you can group by userId and then collect endId to lists, aliasing the resulting column with lists as entIds:
import org.apache.spark.sql.functions._
val entIdsForUserId = df.
groupBy($"userId").
agg(collect_list($"entId").alias("entIds"))
entIdsForUserId.show()
Output:
+------+------------+
|userId| entIds|
+------+------------+
| 1| [7, 3]|
| 3| [25]|
| 2|[3, 4, 5, 9]|
| 0| [5, 15]|
+------+------------+
The order after groupBy is not specified. Depending on what you want to do with it, you could additionally sort it.
You can collect it into a single map on the master node:
val m = entIdsForUserId.
map(r => (r.getAs[Int](0), r.getAs[Seq[Int]](1))).
collect.toMap
this will give you:
Map(1 -> List(7, 3), 3 -> List(25), 2 -> List(3, 4, 5, 9), 0 -> List(5, 15))

One approach would be to convert the Dataset to a RDD and perform a groupByKey. To obtain the result as a Map, you'll need to collect the grouped RDD provided if the dataset isn't too big:
val ds = Seq(
(0, 5), (0, 15), (1, 7), (1, 3),
(2, 3), (2, 4), (2, 5), (2, 9), (3, 25)
).toDF("userID", "entID").as[(Int, Int)]
// ds: org.apache.spark.sql.Dataset[(Int, Int)] =[userID: int, entID: int]
val map = ds.rdd.groupByKey.collectAsMap
// map: scala.collection.Map[Int,Iterable[Int]] = Map(
// 2 -> CompactBuffer(3, 4, 5, 9), 1 -> CompactBuffer(7, 3),
// 3 -> CompactBuffer(25), 0 -> CompactBuffer(5, 15)
// )

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to join timestamp in range (range not exists) - pyspark

Related

How to find rows in df2 which are not available in df1?

regarding train-test split of data in spark scala

How to flag last rows from window using Pyspark

Spark - Change value of records which belong to the long tail in a Dataset

Spark dataset: return a HashMap of values having same key

Categories

Resources