I want to join two dataframes by column timestamp df2.join(df1, how='left'). The next timestamp column df1 is the stop condition
Dataframes to join
df1 = spark.createDataFrame(
[(1, 110, 'walk', 'work', '2019-09-28 13:40:00'),
(2, 110, 'metro', 'work', '2019-09-28 14:00:00'),
(3, 110, 'walk', 'work', '2019-09-28 14:02:00'),
(4, 120, 'bus', 'home', '2019-09-28 17:00:00'),
(5, 120, 'metro', 'home', '2019-09-28 17:20:00'),
(6, 120, 'walk', 'home', '2019-09-28 17:45:00')],
['id', 'u_uuid', 'mode', 'place', 'timestamp']
)
df2 = spark.createDataFrame(
[(1, '2019-09-28 13:30:00'),
(2, '2019-09-28 13:35:00'),
(3, '2019-09-28 13:39:00'),
(4, '2019-09-28 13:50:00'),
(5, '2019-09-28 13:55:00'),
(6, '2019-09-28 14:01:00'),
(7, '2019-09-28 16:30:00'),
(8, '2019-09-28 16:40:00'),
(9, '2019-09-28 16:50:00'),
(10, '2019-09-28 17:25:00'),
(11, '2019-09-28 17:30:00'),
(12, '2019-09-28 17:35:00')],
['id', 'timestamp']
)
Goal
IIUC, One way to do is by using Window.
import pyspark.sql.functions as f
from pyspark.sql.window import Window
win_spec = Window.orderBy('timestamp')
# Window function without partitionBy has huge impact as it will bring all data into one partition. You might see executor OOM errors.
# Advise to add some partition column if you have big dataset
Window.partitionBy('SOME_COL').orderBy('timestamp')
Now Add start_timestamp column like below
df = df1.withColumn('start_timestamp', f.coalesce(f.lag('timestamp').over(win_spec),f.lit('1')))
# df.show()
# +---+------+-----+-----+-------------------+-------------------+
# | id|u_uuid| mode|place| timestamp| start_timestamp|
# +---+------+-----+-----+-------------------+-------------------+
# | 1| 110| walk| work|2019-09-28 13:40:00| 1|
# | 2| 110|metro| work|2019-09-28 14:00:00|2019-09-28 13:40:00|
# | 3| 110| walk| work|2019-09-28 14:02:00|2019-09-28 14:00:00|
# | 4| 120| bus| home|2019-09-28 17:00:00|2019-09-28 14:02:00|
# | 5| 120|metro| home|2019-09-28 17:20:00|2019-09-28 17:00:00|
# | 6| 120| walk| home|2019-09-28 17:45:00|2019-09-28 17:20:00|
# +---+------+-----+-----+-------------------+-------------------+
Now Join df with df2 Using left join
df.join(df2, df2['timestamp'].between(df['start_timestamp'], df['timestamp']), 'left')\
.where(df2['id'].isNotNull())\ # check below
.select(df['u_uuid'], df['mode'], df['place'], df['timestamp'].alias('df1.timestamp'), df2['timestamp'].alias('df2.timestamp'))\
.show()
# where clause is just to match goal output,
# there is no entry in df2 for 2019-09-28 17:00:00 to 2019-09-28 17:20:00 range
# Record: 120|metro| home|2019-09-28 17:20:00|2019-09-28 17:00:00
+------+-----+-----+-------------------+-------------------+
|u_uuid| mode|place| df1.timestamp| df2.timestamp|
+------+-----+-----+-------------------+-------------------+
| 110| walk| work|2019-09-28 13:40:00|2019-09-28 13:30:00|
| 110| walk| work|2019-09-28 13:40:00|2019-09-28 13:35:00|
| 110| walk| work|2019-09-28 13:40:00|2019-09-28 13:39:00|
| 110|metro| work|2019-09-28 14:00:00|2019-09-28 13:50:00|
| 110|metro| work|2019-09-28 14:00:00|2019-09-28 13:55:00|
| 110| walk| work|2019-09-28 14:02:00|2019-09-28 14:01:00|
| 120| bus| home|2019-09-28 17:00:00|2019-09-28 16:30:00|
| 120| bus| home|2019-09-28 17:00:00|2019-09-28 16:40:00|
| 120| bus| home|2019-09-28 17:00:00|2019-09-28 16:50:00|
| 120| walk| home|2019-09-28 17:45:00|2019-09-28 17:25:00|
| 120| walk| home|2019-09-28 17:45:00|2019-09-28 17:30:00|
| 120| walk| home|2019-09-28 17:45:00|2019-09-28 17:35:00|
+------+-----+-----+-------------------+-------------------+
Alternatively, you can use right join to avoid where cluase. Decide based on df1 and df2 size.
df.join(df2, df2['timestamp'].between(df['start_timestamp'], df['timestamp']), 'right')\
.select(df['u_uuid'], df['mode'], df['place'], df['timestamp'].alias('df1.timestamp'), df2['timestamp'].alias('df2.timestamp'))\
.show()
Related
How I can find the complement of a dataframe with respect of another dataframe?
In pandas it can be done by the following code:
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']
Example:
+---------+----+
| City|Temp|
+---------+----+
| New York| 59|
| Chicago| 29|
| Tokyo| 73|
| Paris| 56|
|New Delhi| 48|
+---------+----+
+---------+----+
| City|Temp|
+---------+----+
| London| 55|
| New York| 55|
| Tokyo| 73|
|New Delhi| 85|
| Paris| 56|
+---------+----+
Result:
+---------+----+----------+
| City|Temp|_merge |
+---------+----+----------+
| London| 55|right_only|
|New Delhi| 85|right_only|
| New York| 55|right_only|
+---------+----+----------+
You can use subtract.
df = df2.subtract(df1)
Result
+---------+----+
| City|Temp|
+---------+----+
| New York| 55|
| London| 55|
|New Delhi| 85|
+---------+----+
df1.join(df2, ['City', 'Temp'], 'outer').filter(" id1 IS NULL ")
dt1 = [
(0, 'New York', 59),
(1, 'Chicago', 29),
(2, 'Tokyo', 73),
(3, 'Paris', 56),
(4, 'New Delhi', 48),
]
df1 = spark.createDataFrame(dt1, ['id1','City', 'Temp'])
dt2 = [
(0, 'London', 55),
(1, 'New York', 55),
(2, 'Tokyo', 73),
(3, 'New Delhi', 85),
(4, 'Paris', 56),
]
df2 = spark.createDataFrame(dt2, ['id2','City', 'Temp'])
(
df1.join(df2, ['City', 'Temp'], 'outer')
.filter(" id1 IS NULL ")
.sort('id2')
.show(10, False)
)
# +---------+----+----+---+
# |City |Temp|id1 |id2|
# +---------+----+----+---+
# |London |55 |null|0 |
# |New York |55 |null|1 |
# |New Delhi|85 |null|3 |
# +---------+----+----+---+
You can also try "left_anti" join. Its Venn diagram looks like this:
And the code would look like this:
df = (
df2
.join(df1, ['City', 'Temp'], 'left_anti')
)
output:
+---------+----+
| City|Temp|
+---------+----+
| London| 55|
|New Delhi| 85|
| New York| 55|
+---------+----+
I have a spark scala data frame like this
val df = Seq(
(10, 12),
(44, 14),
(32, 25),
(31, 24),
(75, 25),
(80, 20),
(35, 55),
(32, 25),
(67, 72),
(32, 21)
).toDF("x1","x2")
df.show()
+---+---+
| x1| x2|
+---+---+
| 10| 12|
| 44| 14|
| 32| 25|
| 31| 24|
| 75| 25|
| 80| 20|
| 35| 55|
| 32| 25|
| 67| 72|
| 32| 21|
+---+---+
I need to split this data as training and testing where training data would be the first 8 rows (80%) and testing data would be the last 2 rows (20%).
I tried , val Array(train, test) = df.randomSplit(Array(0.8, 0.2)) But it is select 8 rows randomly (instead of choosing first 8 rows) as training and others as testing
So can anyone suggest how to select the partitions as i mentioned above ?
Thank you
Maybe there is a better way but nothing else comes to my mind as you require data to be ordered.
val cnt = df.count
val testSize = (0.2 * cnt).toInt
val trainSize = cnt - testSize
val trainDf = df.sort(monotonically_increasing_id).limit(trainSize)
val testDf = df.sort(monotonically_increasing_id.desc).limit(testSize)
My goal is to create a new column is_end (when is last and the previous p_uuid isNull() then is_end=1 otherwise=0. I don't know how to combine When() and last() functions.
I tried several times to combine with windows but always errors :(
df = spark.createDataFrame([
(1, 110, None, '2019-09-28'),
(2, 110, None, '2019-09-28'),
(3, 110, 'aaa', '2019-09-28'),
(4, 110, None, '2019-09-17'),
(5, 110, None, '2019-09-17'),
(6, 110, 'bbb', '2019-09-17'),
(7, 110, None, '2019-09-01'),
(8, 110, None, '2019-09-01'),
(9, 110, None, '2019-09-01'),
(10, 110, None, '2019-09-01'),
(11, 110, 'ccc', '2019-09-01'),
(12, 110, None, '2019-09-01'),
(13, 110, None, '2019-09-01'),
(14, 110, None, '2019-09-01')
],
['idx', 'u_uuid', 'p_uuid', 'timestamp']
)
df.show()
My dataframe:
+---+------+------+----------+
|idx|u_uuid|p_uuid| timestamp|
+---+------+------+----------+
| 1| 110| null|2019-09-28|
| 2| 110| null|2019-09-28|
| 3| 110| aaa|2019-09-28|
| 4| 110| null|2019-09-17|
| 5| 110| null|2019-09-17|
| 6| 110| bbb|2019-09-17|
| 7| 110| null|2019-09-01|
| 8| 110| null|2019-09-01|
| 9| 110| null|2019-09-01|
| 10| 110| null|2019-09-01|
| 11| 110| ccc|2019-09-01|
| 12| 110| null|2019-09-01|
| 13| 110| null|2019-09-01|
| 14| 110| null|2019-09-01|
+---+------+------+----------+
w = Window.partitionBy("u_uuid").orderBy(col("timestamp"))
df.withColumn("p_uuid", when( lag(F.col("p_uuid").isNull()).over(w), 1).otherwise(0))
What I m looking for:
+---+------+------+----------+------+
|idx|u_uuid|p_uuid| timestamp|is_end|
+---+------+------+----------+------+
| 1| 110| null|2019-09-28| 0|
| 2| 110| null|2019-09-28| 0|
| 3| 110| aaa|2019-09-28| 0|
| 4| 110| null|2019-09-17| 0|
| 5| 110| null|2019-09-17| 0|
| 6| 110| bbb|2019-09-17| 0|
| 7| 110| null|2019-09-01| 0|
| 8| 110| null|2019-09-01| 0|
| 9| 110| null|2019-09-01| 0|
| 10| 110| null|2019-09-01| 0|
| 11| 110| ccc|2019-09-01| 0|
| 12| 110| null|2019-08-29| 1|
| 13| 110| null|2019-08-29| 1|
| 14| 110| null|2019-08-29| 1|
Bellow is pyspark sql associate to your case:
w = (Window
.partitionBy("u_uuid")
.orderBy("timestamp"))
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn("is_end", F.when(F.last("p_uuid", True).over(w).isNull() & F.col("p_uuid").isNull(), F.lit(1)).otherwise(F.lit(0)))\
.show()
I am trying to solve a data cleaning step in a Machine Learning problem where I should group all the elements in the long tail in a common category named "Others". For example, I have a dataframe like this:
val df = sc.parallelize(Seq(
(1, "ABC"),
(2, "ABC"),
(3, "123"),
(4, "FPK"),
(5, "FPK"),
(6, "ABC"),
(7, "ABC"),
(8, "980"),
(9, "abc"),
(10, "FPK")
)).toDF("n", "s")
I want to keep the categories "ABC" and "FPK" since they appear several times, but I don't want to have one different category for: 123,980,abc Since they appear just once. So What I would like to have instead is:
+---+------+
| n| s|
+---+------+
| 1| ABC|
| 2| ABC|
| 3|Others|
| 4| FPK|
| 5| FPK|
| 6| ABC|
| 7| ABC|
| 8|Others|
| 9|Others|
| 10| FPK|
+---+------+
To achieve this what I tried is this:
val newDF = df.withColumn("s",when($"s".isin("123","980","abc"),"Others").otherwise('s)
This works fine.
But I would like to programatically decide what categories belong to the long tail, in my case appear just once in the originall dataframe. So I wrote this to create a dataframe with those categories that only appear once:
val longTail = df.groupBy("s").agg(count("*").alias("cnt")).orderBy($"cnt".desc).filter($"cnt"<2)
+---+---+
| s|cnt|
+---+---+
|980| 1|
|abc| 1|
|123| 1|
+---+---+
Now I was trying to convert the values of the column "s" in this longTail dataset into a List to exchange it by the one I hardcoded before. So I tried with:
val ar = longTail.select("s").collect().map(_(0)).toList
ar: List[Any] = List(123, 980, abc)
But when I try to add the ar
val newDF = df.withColumn("s",when($"s".isin(ar),"Others").otherwise('s))
I get the following error:
java.lang.RuntimeException: Unsupported literal type class
scala.collection.immutable.$colon$colon List(123, 980, abc)
What am I missing?
This is the correct syntax :
scala> df.withColumn("s", when($"s".isin(ar : _*), "Others").otherwise('s)).show
+---+------+
| n| s|
+---+------+
| 1| ABC|
| 2| ABC|
| 3|Others|
| 4| FPK|
| 5| FPK|
| 6| ABC|
| 7| ABC|
| 8|Others|
| 9|Others|
| 10| FPK|
+---+------+
This is called a repeated parameter. cf here.
You don't have to go through all the hassles you've been going through, you can use window function to get the counts of each groups and check using when/otherwise function to populate Others or not as below
val df = sc.parallelize(Seq(
(1, "ABC"),
(2, "ABC"),
(3, "123"),
(4, "FPK"),
(5, "FPK"),
(6, "ABC"),
(7, "ABC"),
(8, "980"),
(9, "abc"),
(10, "FPK")
)).toDF("n", "s")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("s", when(count("s").over(Window.partitionBy("s").orderBy("n").rowsBetween(Long.MinValue, Long.MaxValue)) > 1, col("s")).otherwise("Others")).show(false)
which should give you
+---+------+
|n |s |
+---+------+
|4 |FPK |
|5 |FPK |
|10 |FPK |
|8 |Others|
|9 |Others|
|1 |ABC |
|2 |ABC |
|6 |ABC |
|7 |ABC |
|3 |Others|
+---+------+
I hope the answer is helpful
+------+-----+
|userID|entID|
+------+-----+
| 0| 5|
| 0| 15|
| 1| 7|
| 1| 3|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 9|
| 3| 25|
+------+-----+
I want the result as {0->(5,15), 1->(7,3),..}
Any help would be appreciated.
Here is your table again:
val df = Seq(
(0, 5),
(0, 15),
(1, 7),
(1, 3),
(2, 3),
(2, 4),
(2, 5),
(2, 9),
(3, 25)
).toDF("userId", "entId")
df.show()
Outputs:
+------+-----+
|userId|entId|
+------+-----+
| 0| 5|
| 0| 15|
| 1| 7|
| 1| 3|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 9|
| 3| 25|
+------+-----+
Now you can group by userId and then collect endId to lists, aliasing the resulting column with lists as entIds:
import org.apache.spark.sql.functions._
val entIdsForUserId = df.
groupBy($"userId").
agg(collect_list($"entId").alias("entIds"))
entIdsForUserId.show()
Output:
+------+------------+
|userId| entIds|
+------+------------+
| 1| [7, 3]|
| 3| [25]|
| 2|[3, 4, 5, 9]|
| 0| [5, 15]|
+------+------------+
The order after groupBy is not specified. Depending on what you want to do with it, you could additionally sort it.
You can collect it into a single map on the master node:
val m = entIdsForUserId.
map(r => (r.getAs[Int](0), r.getAs[Seq[Int]](1))).
collect.toMap
this will give you:
Map(1 -> List(7, 3), 3 -> List(25), 2 -> List(3, 4, 5, 9), 0 -> List(5, 15))
One approach would be to convert the Dataset to a RDD and perform a groupByKey. To obtain the result as a Map, you'll need to collect the grouped RDD provided if the dataset isn't too big:
val ds = Seq(
(0, 5), (0, 15), (1, 7), (1, 3),
(2, 3), (2, 4), (2, 5), (2, 9), (3, 25)
).toDF("userID", "entID").as[(Int, Int)]
// ds: org.apache.spark.sql.Dataset[(Int, Int)] =[userID: int, entID: int]
val map = ds.rdd.groupByKey.collectAsMap
// map: scala.collection.Map[Int,Iterable[Int]] = Map(
// 2 -> CompactBuffer(3, 4, 5, 9), 1 -> CompactBuffer(7, 3),
// 3 -> CompactBuffer(25), 0 -> CompactBuffer(5, 15)
// )