assign unique id to dataframe elements - scala

I have a dataframe recording each person's name and address, denoted as
case class Person(name:String, addr:String)
dataframe looks like this,
+----+-----+
|name| addr|
+----+-----+
| u1|addr1|
| u1|addr2|
| u2|addr1|
+----+-----+
but now I need to assign a Long type unique id to each element in this dataframe, which could be denoted as
case class PersonX(name:String, name_id:Long, addr:String, addr_id:Long)
and dataframe looks like this,
+----+-------+-----+------+
|name|name_id| addr|addr_id|
+----+-------+-----+------+
| u1| 1|addr1| 2|
| u1| 1|addr2| 3|
| u2| 4|addr1| 2|
+----+-------+-----+------+
NOTE that, the elements in both columns (name and addr) share the same id space, which means, name_id should not have duplicates, and addr_id should not either, and furthermore name_ids & addr_ids should not overlap with each other.
How to achieve this?

The easiest way to assign id is to use dense_rank function from spark sql.
To make sure ids don't overlap between names and addresses you can make a trick:
compute ids of names and addresses separately
update ids of addresses by adding maximum id of name
In this way ids of addresses will be after ids of names
val input = spark.sparkContext.parallelize(List(
Person("u1", "addr1"),
Person("u1", "addr2"),
Person("u2", "addr1")
)).toDF("name", "addr")
input.createOrReplaceTempView("people")
val people = spark.sql(
"""select name,
| dense_rank() over(partition by 1 order by name) as name_id,
| addr,
| dense_rank() over(partition by 1 order by addr) as addr_id
| from people
""".stripMargin)
people.show()
//+----+-------+-----+-------+
//|name|name_id| addr|addr_id|
//+----+-------+-----+-------+
//| u1| 1|addr1| 1|
//| u2| 2|addr1| 1|
//| u1| 1|addr2| 2|
//+----+-------+-----+-------+
val name = people.col("name")
val nameId = people.col("name_id")
val addr = people.col("addr")
val addrId = people.col("addr_id")
val maxNameId = people.select(max(nameId)).first().getInt(0)
val shiftedAddrId = (addrId + maxNameId).as("addr_id")
people.select(name, addr, nameId, shiftedAddrId).as[PersonX].show()
//+----+-----+-------+-------+
//|name| addr|name_id|addr_id|
//+----+-----+-------+-------+
//| u1|addr1| 1| 3|
//| u2|addr1| 2| 3|
//| u1|addr2| 1| 4|
//+----+-----+-------+-------+

Related

Calculate sequences of constantly increasing dates Spark

I have a dataframe in Spark with name column and dates. And I would like to find all continuous sequences of constantly increasing dates (day after day) for each name and calculate their durations. The output should contain a name, start date (of the dates sequence) and duration of such time period (amount of days)
How can I do this with Spark functions?
A consecutive sequence of dates example:
2019-03-12
2019-03-13
2019-03-14
2019-03-15
I have defined such solution but it calculates the overall amount of days by each name and does not divide it into sequences:
val result = allDataDf
.groupBy($"name")
.agg(count($"date").as("timePeriod"))
.orderBy($"timePeriod".desc)
.head()
Also, I tried with ranks, but counts column has only 1s, for some reason:
val names = Window
.partitionBy($"name")
.orderBy($"date")
val result = allDataDf
.select($"name", $"date", rank over names as "rank")
.groupBy($"name", $"date", $"rank")
.agg(count($"*") as "count")
The output looks like this:
+-----------+----------+----+-----+
|stationName| date|rank|count|
+-----------+----------+----+-----+
| NAME|2019-03-24| 1| 1|
| NAME|2019-03-25| 2| 1|
| NAME|2019-03-27| 3| 1|
| NAME|2019-03-28| 4| 1|
| NAME|2019-01-29| 5| 1|
| NAME|2019-03-30| 6| 1|
| NAME|2019-03-31| 7| 1|
| NAME|2019-04-02| 8| 1|
| NAME|2019-04-05| 9| 1|
| NAME|2019-04-07| 10| 1|
+-----------+----------+----+-----+
Finding consecutive dates is fairly easy in SQL. You could do it with a query like:
WITH s AS (
SELECT
stationName,
date,
date_add(date, -(row_number() over (partition by stationName order by date))) as discriminator
FROM stations
)
SELECT
stationName,
MIN(date) as start,
COUNT(1) AS duration
FROM s GROUP BY stationName, discriminator
Fortunately, we can use SQL in spark. Let's check if it works (I used different dates):
val df = Seq(
("NAME1", "2019-03-22"),
("NAME1", "2019-03-23"),
("NAME1", "2019-03-24"),
("NAME1", "2019-03-25"),
("NAME1", "2019-03-27"),
("NAME1", "2019-03-28"),
("NAME2", "2019-03-27"),
("NAME2", "2019-03-28"),
("NAME2", "2019-03-30"),
("NAME2", "2019-03-31"),
("NAME2", "2019-04-04"),
("NAME2", "2019-04-05"),
("NAME2", "2019-04-06")
).toDF("stationName", "date")
.withColumn("date", date_format(col("date"), "yyyy-MM-dd"))
df.createTempView("stations");
val result = spark.sql(
"""
|WITH s AS (
| SELECT
| stationName,
| date,
| date_add(date, -(row_number() over (partition by stationName order by date)) + 1) as discriminator
| FROM stations
|)
|SELECT
| stationName,
| MIN(date) as start,
| COUNT(1) AS duration
|FROM s GROUP BY stationName, discriminator
""".stripMargin)
result.show()
It seems to output correct dataset:
+-----------+----------+--------+
|stationName| start|duration|
+-----------+----------+--------+
| NAME1|2019-03-22| 4|
| NAME1|2019-03-27| 2|
| NAME2|2019-03-27| 2|
| NAME2|2019-03-30| 2|
| NAME2|2019-04-04| 3|
+-----------+----------+--------+

how to select elements in scala dataframe?

Reference to How do I select item with most count in a dataframe and define is as a variable in scala?
Given a table below, how can I select nth src_ip and put it as a variable?
+--------------+------------+
| src_ip|src_ip_count|
+--------------+------------+
| 58.242.83.11| 52|
|58.218.198.160| 33|
|58.218.198.175| 22|
|221.194.47.221| 6|
You can create another column with row number as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val tempdf = df.withColumn("row_number", monotonically_increasing_id())
tempdf.withColumn("row_number", row_number().over(Window.orderBy("row_number")))
which should give you tempdf as
+--------------+------------+----------+
| src_ip|src_ip_count|row_number|
+--------------+------------+----------+
| 58.242.83.11| 52| 1|
|58.218.198.160| 33| 2|
|58.218.198.175| 22| 3|
|221.194.47.221| 6| 4|
+--------------+------------+----------+
Now you can use filter to filter in the nth row as
.filter($"row_number" === n)
That should be it.
For extracting the ip, lets say your n is 2 as
val n = 2
Then the above process would give you
+--------------+------------+----------+
| src_ip|src_ip_count|row_number|
+--------------+------------+----------+
|58.218.198.160| 33| 2|
+--------*------+------------+----------+
getting the ip address* is explained in the link you provided in the question by doing
.head.get(0)
Safest way is to use zipWithIndex in the dataframe converted into rdd and then convert back to dataframe, so that we have unmistakable row_number column.
val finalDF = df.rdd.zipWithIndex().map(row => (row._1(0).toString, row._1(1).toString, (row._2+1).toInt)).toDF("src_ip", "src_ip_count", "row_number")
Rest of the steps are already explained before.

Randomly join two dataframes

I have two tables, one called Reasons that has 9 records and another containing IDs with 40k records.
IDs:
+------+------+
|pc_pid|pc_aid|
+------+------+
| 4569| 1101|
| 63961| 1101|
|140677| 4364|
|127113| 7|
| 96097| 480|
| 8309| 3129|
| 45218| 89|
|147036| 3289|
| 88493| 3669|
| 29973| 3129|
|127444| 3129|
| 36095| 89|
|131001| 1634|
|104731| 781|
| 79219| 244|
+-------------+
Reasons:
+-----------------+
| reasons|
+-----------------+
| follow up|
| skin chk|
| annual meet|
|review lab result|
| REF BY DR|
| sick visit|
| body pain|
| test|
| other|
+-----------------+
I want output like this
|pc_pid|pc_aid| reason
+------+------+-------------------
| 4569| 1101| body pain
| 63961| 1101| review lab result
|140677| 4364| body pain
|127113| 7| sick visit
| 96097| 480| test
| 8309| 3129| other
| 45218| 89| follow up
|147036| 3289| annual meet
| 88493| 3669| review lab result
| 29973| 3129| REF BY DR
|127444| 3129| skin chk
| 36095| 89| other
In the reasons I have only 9 records and in the ID dataframe I have 40k records, I want to assign reason randomly to each and every id.
The following solution tries to be more robust to the number of reasons (ie. you can have as many reasons as you can reasonably fit in your cluster). If you just have few reasons (like the OP asks), you can probably broadcast them or embed them in a udf and easily solve this problem.
The general idea is to create an index (sequential) for the reasons and then random values from 0 to N (where N is the number of reasons) on the IDs dataset and then join the two tables using these two new columns. Here is how you can do this:
case class Reasons(s: String)
defined class Reasons
case class Data(id: Long)
defined class Data
Data will hold the IDs (simplified version of the OP) and Reasons will hold some simplified reasons.
val d1 = spark.createDataFrame( Data(1) :: Data(2) :: Data(10) :: Nil)
d1: org.apache.spark.sql.DataFrame = [id: bigint]
d1.show()
+---+
| id|
+---+
| 1|
| 2|
| 10|
+---+
val d2 = spark.createDataFrame( Reasons("a") :: Reasons("b") :: Reasons("c") :: Nil)
+---+
| s|
+---+
| a|
| b|
| c|
+---+
We will later need the number of reasons so we calculate that first.
val numerOfReasons = d2.count()
val d2Indexed = spark.createDataFrame(d2.rdd.map(_.getString(0)).zipWithIndex)
d2Indexed.show()
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
val d1WithRand = d1.select($"id", (rand * numerOfReasons).cast("int").as("rnd"))
The last step is to join on the new columns and the remove them.
val res = d1WithRand.join(d2Indexed, d1WithRand("rnd") === d2Indexed("_2")).drop("_2").drop("rnd")
res.show()
+---+---+
| id| _1|
+---+---+
| 2| a|
| 10| b|
| 1| c|
+---+---+
pyspark random join itself
data_neg = data_pos.sortBy(lambda x: uniform(1, 10000))
data_neg = data_neg.coalesce(1, False).zip(data_pos.coalesce(1, True))
The fastest way to randomly join dataA (huge dataframe) and dataB (smaller dataframe, sorted by any column):
dfB = dataB.withColumn(
"index", F.row_number().over(Window.orderBy("col")) - 1
)
dfA = dataA.withColumn("index", (F.rand() * dfB.count()).cast("bigint"))
df = dfA.join(dfB, on="index", how="left").drop("index")
Since dataB is already sorted, row numbers can be assigned over sorted window with high degree of parallelism. F.rand() is another highly parallel function, so adding index to dataA will be very fast as well.
If dataB is small enough, you may benefit from broadcasting it.
This method is better than using:
zipWithIndex: Can be very expensive to convert dataframe to rdd, zipWithIndex, and then to df.
monotonically_increasing_id: Need to be used with row_number which will collect all the partitions into a single executor.
Reference: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6

Dataframe.map need to result with more than the rows in dataset

I am using scala and spark and have a simple dataframe.map to produce the required transformation on data. However I need to provide an additional row of data with the modified original. How can I use the dataframe.map to give out this.
ex:
dataset from:
id, name, age
1, john, 23
2, peter, 32
if age < 25 default to 25.
dataset to:
id, name, age
1, john, 25
1, john, -23
2, peter, 32
Would a 'UnionAll' handle it?
eg.
df1 = original dataframe
df2 = transformed df1
df1.unionAll(df2)
EDIT: implementation using unionAll()
val df1=sqlContext.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfTransform= udf[Int,Int] { (age) => if (age<25) 25 else age }
val df2=df1.withColumn("age2", udfTransform($"age")).
where("age!=age2").
drop("age2")
df1.withColumn("age", udfTransform($"age")).
unionAll(df2).
orderBy("id").
show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 25|
| 1| john| 23|
| 2|peter| 32|
+---+-----+---+
Note: the implementation differs a bit from the originally proposed (naive) solution. The devil is always in the detail!
EDIT 2: implementation using nested array and explode
val df1=sx.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfArr= udf[Array[Int],Int] { (age) =>
if (age<25) Array(age,25) else Array(age) }
val df2=df1.withColumn("age", udfArr($"age"))
df2.show()
+---+-----+--------+
| id| name| age|
+---+-----+--------+
| 1| john|[23, 25]|
| 2|peter| [32]|
+---+-----+--------+
df2.withColumn("age",explode($"age") ).show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 23|
| 1| john| 25|
| 2|peter| 32|
+---+-----+---+

Want to generate unique Ids as value changes from previous row using scala

I want to generate unique IDs as value changes from previous row in given column. I have dataframe in Spark Scala and want to add Unique_ID column to existing dataframe. I can not use Row Number over partitions or groupBy as same Product_IDs coming multiple times and want Unique_ID every time it is coming to column.
Product_IDs Unique_ID
Prod_1 1
Prod_1 1
Prod_1 1
Prod_2 2
Prod_3 3
Prod_3 3
Prod_2 4
Prod_3 5
Prod_1 6
Prod_1 6
Prod_4 7
I need this dataframe using Spark Scala.
There are tho ways to add a column with unique ids that I can think of just now. One is to use zipWithUniqueId:
val rows = df.rdd.zipWithUniqueId().map {
case (r: Row, id: Long) => Row.fromSeq(r.toSeq :+ id)
}
val newDf = sqlContext.createDataFrame(rows, StructType(df.schema.fields :+ StructField("uniqueIdColumn", LongType, false)))
another one is to use MonotonicallyIncreasingId function:
import org.apache.spark.sql.functions.monotonicallyIncreasingId
val newDf = df.withColumn("uniqueIdColumn", monotonicallyIncreasingId)
Here's a solution which isn't necessarily the most efficient (I admit I couldn't find a way to optimize it), and a bit long, but works.
I'm assuming the input is composed of records represented by this case class:
case class Record(id: Int, productId: String)
Where the id defines the order.
We'll perform two calculations:
For each record, find the minimum id of any subsequent record with a different productId
Group by that value (which represents a group of consecutive records with same productId, and then zipWithIndex to create the unique ID we're interested in
I'm mixing RDD operations (for #2) and SQL (for #1) mostly for convenience, I'm assuming both operations can be done in any API (although I didn't try):
val input = sqlContext.createDataFrame(Seq(
Record(1, "Prod_1"),
Record(2, "Prod_1"),
Record(3, "Prod_1"),
Record(4, "Prod_2"),
Record(5, "Prod_3"),
Record(6, "Prod_3"),
Record(7, "Prod_2"),
Record(8, "Prod_3"),
Record(9, "Prod_1"),
Record(10, "Prod_1"),
Record(11, "Prod_4")
))
input.registerTempTable("input")
// Step 1: find "nextShiftId" for each record
val withBlockId = sqlContext.sql(
"""
|SELECT FIRST(a.id) AS id, FIRST(a.productId) AS productId, MIN(b.id) AS nextShiftId
|FROM input a
|LEFT JOIN input b ON a.productId != b.productId AND a.id < b.id
|GROUP BY a.id
""".stripMargin)
withBlockId.show()
// prints:
// +---+---------+-----------+
// | id|productId|nextShiftId|
// +---+---------+-----------+
// | 1| Prod_1| 4|
// | 2| Prod_1| 4|
// | 3| Prod_1| 4|
// | 4| Prod_2| 5|
// | 5| Prod_3| 7|
// | 6| Prod_3| 7|
// | 7| Prod_2| 8|
// | 8| Prod_3| 9|
// | 9| Prod_1| 11|
// | 10| Prod_1| 11|
// | 11| Prod_4| null|
// +---+---------+-----------+
// Step 2: group by "productId" and "nextShiftId"
val resultRdd = withBlockId.rdd
.groupBy(r => (r.getAs[String]("productId"), r.getAs[Int]("nextShiftId")))
// sort by nextShiftId to get the order right before adding index
.sortBy {
case ((prodId, 0), v) => Long.MaxValue // to handle the last batch where nextShiftId is null
case ((prodId, nextShiftId), v) => nextShiftId
}
// zip with index (which would be the "unique id") and flatMap to just what we need:
.values
.zipWithIndex()
.flatMap { case (records, index) => records.map(r => (r.getAs[String]("productId"), index+1))}
// transform back into DataFrame:
val result = sqlContext.createDataFrame(resultRdd)
result.show()
// prints:
// +------+---+
// | _1| _2|
// +------+---+
// |Prod_1| 1|
// |Prod_1| 1|
// |Prod_1| 1|
// |Prod_2| 2|
// |Prod_3| 3|
// |Prod_3| 3|
// |Prod_2| 4|
// |Prod_3| 5|
// |Prod_1| 6|
// |Prod_1| 6|
// |Prod_4| 7|
// +------+---+