Finding most populated cities per country - scala

I need to write code that gives the most populated cities per country with the population.
Here is a input data:
DataFrame = {
/** Input data */
val inputDf = Seq(
("Warsaw", "Poland", "1 764 615"),
("Cracow", "Poland", "769 498"),
("Paris", "France", "2 206 488"),
("Villeneuve-Loubet", "France", "15 020"),
("Pittsburgh PA", "United States", "302 407"),
("Chicago IL", "United States", "2 716 000"),
("Milwaukee WI", "United States", "595 351"),
("Vilnius", "Lithuania", "580 020"),
("Stockholm", "Sweden", "972 647"),
("Goteborg", "Sweden", "580 020")
).toDF("name", "country", "population")
println("Input:")
inputDf.show(false)
My solution was:
val topPopulation = inputDf
// .select("name", "country", "population")
.withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
// .agg(max($"population").alias(("population")))
// .withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
// .withColumn("country", $"country")
// .withColumn("name", $"name")
// .cast("Integer")
.groupBy("country")
.agg(
max("population").alias("population")
)
.orderBy($"population".desc)
// .orderBy("max(population)")
topPopulation
But i have troubke, because "Except can only be performed on tables with the same number of columns, but the first table has 2 columns and the second table has 3 columns;;
"
Input:
+-----------------+-------------+----------+
|name |country |population|
+-----------------+-------------+----------+
|Warsaw |Poland |1 764 615 |
|Cracow |Poland |769 498 |
|Paris |France |2 206 488 |
|Villeneuve-Loubet|France |15 020 |
|Pittsburgh PA |United States|302 407 |
|Chicago IL |United States|2 716 000 |
|Milwaukee WI |United States|595 351 |
|Vilnius |Lithuania |580 020 |
|Stockholm |Sweden |972 647 |
|Goteborg |Sweden |580 020 |
+-----------------+-------------+----------+
Expected:
+----------+-------------+----------+
|name |country |population|
+----------+-------------+----------+
|Warsaw |Poland |1 764 615 |
|Paris |France |2 206 488 |
|Chicago IL|United States|2 716 000 |
|Vilnius |Lithuania |580 020 |
|Stockholm |Sweden |972 647 |
+----------+-------------+----------+
Actual:
+-------------+----------+
|country |population|
+-------------+----------+
|United States|2716000 |
|France |2206488 |
|Poland |1764615 |
|Sweden |972647 |
|Lithuania |580020 |
+-------------+----------+

Try this-
Load the test data
val inputDf = Seq(
("Warsaw", "Poland", "1 764 615"),
("Cracow", "Poland", "769 498"),
("Paris", "France", "2 206 488"),
("Villeneuve-Loubet", "France", "15 020"),
("Pittsburgh PA", "United States", "302 407"),
("Chicago IL", "United States", "2 716 000"),
("Milwaukee WI", "United States", "595 351"),
("Vilnius", "Lithuania", "580 020"),
("Stockholm", "Sweden", "972 647"),
("Goteborg", "Sweden", "580 020")
).toDF("name", "country", "population")
println("Input:")
inputDf.show(false)
/**
* Input:
* +-----------------+-------------+----------+
* |name |country |population|
* +-----------------+-------------+----------+
* |Warsaw |Poland |1 764 615 |
* |Cracow |Poland |769 498 |
* |Paris |France |2 206 488 |
* |Villeneuve-Loubet|France |15 020 |
* |Pittsburgh PA |United States|302 407 |
* |Chicago IL |United States|2 716 000 |
* |Milwaukee WI |United States|595 351 |
* |Vilnius |Lithuania |580 020 |
* |Stockholm |Sweden |972 647 |
* |Goteborg |Sweden |580 020 |
* +-----------------+-------------+----------+
*/
find the city in the country having max population
val topPopulation = inputDf
.withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
.withColumn("population_name", struct($"population", $"name"))
.groupBy("country")
.agg(max("population_name").as("population_name"))
.selectExpr("country", "population_name.*")
topPopulation.show(false)
topPopulation.printSchema()
/**
* +-------------+----------+----------+
* |country |population|name |
* +-------------+----------+----------+
* |France |2206488 |Paris |
* |Poland |1764615 |Warsaw |
* |Lithuania |580020 |Vilnius |
* |Sweden |972647 |Stockholm |
* |United States|2716000 |Chicago IL|
* +-------------+----------+----------+
*
* root
* |-- country: string (nullable = true)
* |-- population: integer (nullable = true)
* |-- name: string (nullable = true)
*/

Related

How to perform conditional join with time column in spark scala

I am looking for help in joining 2 DF's with conditional join in time columns, using Spark Scala.
DF1
time_1
revision
review_1
2022-04-05 08:32:00
1
abc
2022-04-05 10:15:00
2
abc
2022-04-05 12:15:00
3
abc
2022-04-05 09:00:00
1
xyz
2022-04-05 20:20:00
2
xyz
DF2:
time_2
review_1
value
2022-04-05 08:30:00
abc
value_1
2022-04-05 09:48:00
abc
value_2
2022-04-05 15:40:00
abc
value_3
2022-04-05 08:00:00
xyz
value_4
2022-04-05 09:00:00
xyz
value_5
2022-04-05 10:00:00
xyz
value_6
2022-04-05 11:00:00
xyz
value_7
2022-04-05 12:00:00
xyz
value_8
Desired Output DF:
time_1
revision
review_1
value
2022-04-05 08:32:00
1
abc
value_1
2022-04-05 10:15:00
2
abc
value_2
2022-04-05 12:15:00
3
abc
null
2022-04-05 09:00:00
1
xyz
value_6
2022-04-05 20:20:00
2
xyz
null
As in the case of row 4 of the final output (where time_1 = 2022-04-05 09:00:00, if multiple values match during the join then only the latest - in time - should be taken).
Furthermore if there is no match for a row of df in the join then there it should have null for the value column.
Here we need to join between 2 columns in the two DF's:
review_1 === review_2 &&
time_1 === time_2 (condition : time_1 should be in range +1/-1 Hr from time_2, If multiple records then show latest value, as in value_6 above)
Here is the code necessary to join the DataFrames:
I have commented the code so as to explain the logic.
TL;DR
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
Full breakdown
Let's start off with your DataFrames: df1 and df2 in code:
val df1 = List(
("2022-04-05 08:32:00", 1, "abc"),
("2022-04-05 10:15:00", 2, "abc"),
("2022-04-05 12:15:00", 3, "abc"),
("2022-04-05 09:00:00", 1, "xyz"),
("2022-04-05 20:20:00", 2, "xyz")
).toDF("time_1", "revision", "review_1")
df1.show(false)
gives:
+-------------------+--------+--------+
|time_1 |revision|review_1|
+-------------------+--------+--------+
|2022-04-05 08:32:00|1 |abc |
|2022-04-05 10:15:00|2 |abc |
|2022-04-05 12:15:00|3 |abc |
|2022-04-05 09:00:00|1 |xyz |
|2022-04-05 20:20:00|2 |xyz |
+-------------------+--------+--------+
val df2 = List(
("2022-04-05 08:30:00", "abc", "value_1"),
("2022-04-05 09:48:00", "abc", "value_2"),
("2022-04-05 15:40:00", "abc", "value_3"),
("2022-04-05 08:00:00", "xyz", "value_4"),
("2022-04-05 09:00:00", "xyz", "value_5"),
("2022-04-05 10:00:00", "xyz", "value_6"),
("2022-04-05 11:00:00", "xyz", "value_7"),
("2022-04-05 12:00:00", "xyz", "value_8")
).toDF("time_2", "review_2", "value")
df2.show(false)
gives:
+-------------------+--------+-------+
|time_2 |review_2|value |
+-------------------+--------+-------+
|2022-04-05 08:30:00|abc |value_1|
|2022-04-05 09:48:00|abc |value_2|
|2022-04-05 15:40:00|abc |value_3|
|2022-04-05 08:00:00|xyz |value_4|
|2022-04-05 09:00:00|xyz |value_5|
|2022-04-05 10:00:00|xyz |value_6|
|2022-04-05 11:00:00|xyz |value_7|
|2022-04-05 12:00:00|xyz |value_8|
+-------------------+--------+-------+
Next we need new columns which we can do the date range check on (where time is represented as a single number, making math operations easy:
// add a new column, temporarily, which contains the time in
// epoch format: with this adding/subtracting an hour can easily be done.
val df1WithEpoch = df1.withColumn("epoch_time_1", unix_timestamp(col("time_1")))
val df2WithEpoch = df2.withColumn("epoch_time_2", unix_timestamp(col("time_2")))
df1WithEpoch.show()
df2WithEpoch.show()
gives:
+-------------------+--------+--------+------------+
| time_1|revision|review_1|epoch_time_1|
+-------------------+--------+--------+------------+
|2022-04-05 08:32:00| 1| abc| 1649147520|
|2022-04-05 10:15:00| 2| abc| 1649153700|
|2022-04-05 12:15:00| 3| abc| 1649160900|
|2022-04-05 09:00:00| 1| xyz| 1649149200|
|2022-04-05 20:20:00| 2| xyz| 1649190000|
+-------------------+--------+--------+------------+
+-------------------+--------+-------+------------+
| time_2|review_2| value|epoch_time_2|
+-------------------+--------+-------+------------+
|2022-04-05 08:30:00| abc|value_1| 1649147400|
|2022-04-05 09:48:00| abc|value_2| 1649152080|
|2022-04-05 15:40:00| abc|value_3| 1649173200|
|2022-04-05 08:00:00| xyz|value_4| 1649145600|
|2022-04-05 09:00:00| xyz|value_5| 1649149200|
|2022-04-05 10:00:00| xyz|value_6| 1649152800|
|2022-04-05 11:00:00| xyz|value_7| 1649156400|
|2022-04-05 12:00:00| xyz|value_8| 1649160000|
+-------------------+--------+-------+------------+
and finally to join:
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
gives:
+-------------------+--------+--------+-------+
|time_1 |revision|review_1|value |
+-------------------+--------+--------+-------+
|2022-04-05 08:32:00|1 |abc |value_1|
|2022-04-05 10:15:00|2 |abc |value_2|
|2022-04-05 12:15:00|3 |abc |null |
|2022-04-05 09:00:00|1 |xyz |value_6|
|2022-04-05 20:20:00|2 |xyz |null |
+-------------------+--------+--------+-------+

how to get max value between unbounded preceding and ignoring the current row date value for a given id in pyspark?

I have a below pyspark dataframe:
df
id date key1
A1 2020-01-06 K1
A1 2020-01-06 K2
A1 2020-01-07 K3
A1 2020-01-07 K3
A1 2020-01-20 K3
A2 ..
I need to add column last_date which is last max date for a given id ignoring the current date.
id date key1 last_date
A1 2020-01-06 K1
A1 2020-01-06 K2
A1 2020-01-07 K3 2020-01-06
A1 2020-01-07 K3 2020-01-06
A1 2020-01-20 K3 2020-01-07
I am using code but it is giving same date, how to ignore current row date?
unbounded_window = (
Window.partitionBy("id")
.orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
prepared_df =df.withColumn("last_date", F.max("date").over(unbounded_window))
You need to find when the date changes and forward fill it. Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst = sqlContext.createDataFrame([('A1','2020-01-06','K1' ),('A1','2020-01-06','K2'),\
('A1','2020-01-07','K3' ),('A1','2020-01-07','K3'),('A1','2020-01-20','K3')],schema=['id','date','key'])
w=Window.partitionBy('id').orderBy('date')
tst_stp = tst.withColumn("date_lag",F.lag('date').over(w))
tst_dt = tst_stp.withColumn("date_chk",F.when((F.col('date')!=F.col('date_lag')),F.col('date_lag')))
#%%Forward fill
tst_res = tst_dt.withColumn('last_date',F.last('date_chk',ignorenulls=True).over(w))
Results:
tst_res.show()
+---+----------+---+----------+----------+----------+
| id| date|key| date_lag| date_chk| last_date|
+---+----------+---+----------+----------+----------+
| A1|2020-01-06| K1| null| null| null|
| A1|2020-01-06| K2|2020-01-06| null| null|
| A1|2020-01-07| K3|2020-01-06|2020-01-06|2020-01-06|
| A1|2020-01-07| K3|2020-01-07| null|2020-01-06|
| A1|2020-01-20| K3|2020-01-07|2020-01-07|2020-01-07|
+---+----------+---+----------+----------+----------+
Try this-
df1.show(false)
df1.printSchema()
/**
* +---+-------------------+----+
* |id |date |key1|
* +---+-------------------+----+
* |A1 |2020-01-06 00:00:00|K1 |
* |A1 |2020-01-06 00:00:00|K2 |
* |A1 |2020-01-07 00:00:00|K3 |
* |A1 |2020-01-07 00:00:00|K3 |
* |A1 |2020-01-20 00:00:00|K3 |
* +---+-------------------+----+
*
* root
* |-- id: string (nullable = true)
* |-- date: timestamp (nullable = true)
* |-- key1: string (nullable = true)
*/
val w = Window.partitionBy("id").orderBy("date")
val w1 = Window.partitionBy("id", "date")
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df1.withColumn("last_date", lag(col("date"), 1).over(w))
.withColumn("last_date", min(col("last_date")).over(w1))
.withColumn("last_date", when($"date" =!= $"last_date", $"last_date"))
.show(false)
/**
* +---+-------------------+----+-------------------+
* |id |date |key1|last_date |
* +---+-------------------+----+-------------------+
* |A1 |2020-01-06 00:00:00|K1 |null |
* |A1 |2020-01-06 00:00:00|K2 |null |
* |A1 |2020-01-07 00:00:00|K3 |2020-01-06 00:00:00|
* |A1 |2020-01-07 00:00:00|K3 |2020-01-06 00:00:00|
* |A1 |2020-01-20 00:00:00|K3 |2020-01-07 00:00:00|
* +---+-------------------+----+-------------------+
*/

How to split the spark dataframe into 2 using ratio given in terms of months and the unix epoch column?

I wanted to split the spark dataframe into 2 using ratio given in terms of months and the unix epoch column-
sample dataframe is as below-
unixepoch
---------
1539754800
1539754800
1539931200
1539927600
1539927600
1539931200
1539931200
1539931200
1539927600
1540014000
1540014000
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
strategy of splitting-
if total months of data given is say 30 months and splittingRatio is say 0.6
then expected dataframe 1 should have: 30 * 0.6 = 18 months of data
and expected dataframe 1 should have: 30 * 0.4 = 12 months of data
EDIT-1
most of the answers are given by considering splitting ratio for number of records i.e. if total records count = 100 and split ratio = 0.6
then split1DF~=60 records and split2DF~=40 records.
To be more clear, this is not i am looking for. Here splitting ratio is given for month which can be calculated by the given epoch unix timestamp column from the above sample dataframe.
Suppose above epoch column is some distibution of 30 months then I want first 18 months epoch in the dataframe 1 and last 12 months epoch rows in the second dataframe. you can consider this as split the dataframe for timeseries data in spark.
EDIT-2
if the data is given for July, 2018 to May, 2019=10 months data, then split1(0.6=first 6 months)= (July, 2018, Jan,2019 ) and split2(0.4=last 4 months)= (Feb,2019, May, 2019 ). randomized picking shouldn't be there.
Use row_number & filter to split data into two DataFrame.
scala> val totalMonths = 10
totalMonths: Int = 10
scala> val splitRatio = 0.6
splitRatio: Double = 0.6
scala> val condition = (totalMonths * splitRatio).floor + 1
condition: Double = 7.0
scala> epochDF.show(false)
+----------+-----+
|dt |month|
+----------+-----+
|1530383400|7 |
|1533061800|8 |
|1535740200|9 |
|1538332200|10 |
|1541010600|11 |
|1543602600|12 |
|1546281000|1 |
|1548959400|2 |
|1551378600|3 |
|1554057000|4 |
|1556649000|5 |
+----------+-----+
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> epochDF.orderBy($"dt".asc).withColumn("id",row_number().over(Window.orderBy($"dt".asc))).filter($"id" <= condition).show(false)
+----------+-----+---+
|dt |month|id |
+----------+-----+---+
|2018-07-01|7 |1 |
|2018-08-01|8 |2 |
|2018-09-01|9 |3 |
|2018-10-01|10 |4 |
|2018-11-01|11 |5 |
|2018-12-01|12 |6 |
|2019-01-01|1 |7 |
+----------+-----+---+
scala> epochDF.orderBy($"dt".asc).withColumn("id",row_number().over(Window.orderBy($"dt".asc))).filter($"id" > condition).show(false)
+----------+-----+---+
|dt |month|id |
+----------+-----+---+
|2019-02-01|2 |8 |
|2019-03-01|3 |9 |
|2019-04-01|4 |10 |
|2019-05-01|5 |11 |
+----------+-----+---+
I have divided data based on months and then days if the data is given for 1 month.
I prefer this method since this answer is not dependent on the windowing function. Other answer given here uses Window without partitionBy which degrades the performance seriously as data shuffles to one executor.
1. splitting method given a train ratio in terms of months
val EPOCH = "epoch"
def splitTrainTest(inputDF: DataFrame,
trainRatio: Double): (DataFrame, DataFrame) = {
require(trainRatio >= 0 && trainRatio <= 0.9, s"trainRatio must between 0 and 0.9, found : $trainRatio")
def extractDateCols(tuples: (String, Column)*): DataFrame = {
tuples.foldLeft(inputDF) {
case (df, (dateColPrefix, dateColumn)) =>
df
.withColumn(s"${dateColPrefix}_month", month(from_unixtime(dateColumn))) // month
.withColumn(s"${dateColPrefix}_dayofmonth", dayofmonth(from_unixtime(dateColumn))) // dayofmonth
.withColumn(s"${dateColPrefix}_year", year(from_unixtime(dateColumn))) // year
}
}
val extractDF = extractDateCols((EPOCH, inputDF(EPOCH)))
// derive min/max(yyyy-MM)
val yearCol = s"${EPOCH}_year"
val monthCol = s"${EPOCH}_month"
val dayCol = s"${EPOCH}_dayofmonth"
val SPLIT = "split"
require(trainRatio >= 0 && trainRatio <= 0.9, s"trainRatio must between 0 and 0.9, found : $trainRatio")
// derive min/max(yyyy-MM)
// val yearCol = PLANNED_START_YEAR
// val monthCol = PLANNED_START_MONTH
val dateCol = to_date(date_format(
concat_ws("-", Seq(yearCol, monthCol).map(col): _*), "yyyy-MM-01"))
val minMaxDF = extractDF.agg(max(dateCol).as("max_date"), min(dateCol).as("min_date"))
val min_max_date = minMaxDF.head()
import java.sql.{Date => SqlDate}
val minDate = min_max_date.getAs[SqlDate]("min_date")
val maxDate = min_max_date.getAs[SqlDate]("max_date")
println(s"Min Date Found: $minDate")
println(s"Max Date Found: $maxDate")
// Get the total months for which the data exist
val totalMonths = (maxDate.toLocalDate.getYear - minDate.toLocalDate.getYear) * 12 +
maxDate.toLocalDate.getMonthValue - minDate.toLocalDate.getMonthValue
println(s"Total Months of data found for is $totalMonths months")
// difference starts with 0
val splitDF = extractDF.withColumn(SPLIT, round(months_between(dateCol, to_date(lit(minDate)))).cast(DataTypes.IntegerType))
val (trainDF, testDF) = totalMonths match {
// data is provided for more than a month
case tm if tm > 0 =>
val trainMonths = Math.round(totalMonths * trainRatio)
println(s"Data considered for training is < $trainMonths months")
println(s"Data considered for testing is >= $trainMonths months")
(splitDF.filter(col(SPLIT) < trainMonths), splitDF.filter(col(SPLIT) >= trainMonths))
// data is provided for a month, split based on the total records in terms of days
case tm if tm == 0 =>
// val dayCol = PLANNED_START_DAYOFMONTH
val splitDF1 = splitDF.withColumn(SPLIT,
datediff(date_format(
concat_ws("-", Seq(yearCol, monthCol, dayCol).map(col): _*), "yyyy-MM-dd"), lit(minDate))
)
// Get the total days for which the data exist
val todalDays = splitDF1.select(max(SPLIT).as("total_days")).head.getAs[Int]("total_days")
if (todalDays <= 1) {
throw new RuntimeException(s"Insufficient data provided for training, Data found for $todalDays days but " +
s"$todalDays > 1 required")
}
println(s"Total Days of data found is $todalDays days")
val trainDays = Math.round(todalDays * trainRatio)
(splitDF1.filter(col(SPLIT) < trainDays), splitDF1.filter(col(SPLIT) >= trainDays))
// data should be there
case default => throw new RuntimeException(s"Insufficient data provided for training, Data found for $totalMonths " +
s"months but $totalMonths >= 1 required")
}
(trainDF.cache(), testDF.cache())
}
2. Test using the data from multiple months across years
// call methods
val implicits = sqlContext.sparkSession.implicits
import implicits._
val monthData = sc.parallelize(Seq(
1539754800,
1539754800,
1539931200,
1539927600,
1539927600,
1539931200,
1539931200,
1539931200,
1539927600,
1540449600,
1540449600,
1540536000,
1540536000,
1540536000,
1540424400,
1540424400,
1540618800,
1540618800,
1545979320,
1546062120,
1545892920,
1545892920,
1545892920,
1545201720,
1545892920,
1545892920
)).toDF(EPOCH)
val (split1, split2) = splitTrainTest(monthData, 0.6)
split1.show(false)
split2.show(false)
/**
* Min Date Found: 2018-10-01
* Max Date Found: 2018-12-01
* Total Months of data found for is 2 months
* Data considered for training is < 1 months
* Data considered for testing is >= 1 months
* +----------+-----------+----------------+----------+-----+
* |epoch |epoch_month|epoch_dayofmonth|epoch_year|split|
* +----------+-----------+----------------+----------+-----+
* |1539754800|10 |17 |2018 |0 |
* |1539754800|10 |17 |2018 |0 |
* |1539931200|10 |19 |2018 |0 |
* |1539927600|10 |19 |2018 |0 |
* |1539927600|10 |19 |2018 |0 |
* |1539931200|10 |19 |2018 |0 |
* |1539931200|10 |19 |2018 |0 |
* |1539931200|10 |19 |2018 |0 |
* |1539927600|10 |19 |2018 |0 |
* |1540449600|10 |25 |2018 |0 |
* |1540449600|10 |25 |2018 |0 |
* |1540536000|10 |26 |2018 |0 |
* |1540536000|10 |26 |2018 |0 |
* |1540536000|10 |26 |2018 |0 |
* |1540424400|10 |25 |2018 |0 |
* |1540424400|10 |25 |2018 |0 |
* |1540618800|10 |27 |2018 |0 |
* |1540618800|10 |27 |2018 |0 |
* +----------+-----------+----------------+----------+-----+
*
* +----------+-----------+----------------+----------+-----+
* |epoch |epoch_month|epoch_dayofmonth|epoch_year|split|
* +----------+-----------+----------------+----------+-----+
* |1545979320|12 |28 |2018 |2 |
* |1546062120|12 |29 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* |1545201720|12 |19 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* +----------+-----------+----------------+----------+-----+
*/
3. Test using one month of data from a year
val oneMonthData = sc.parallelize(Seq(
1589514575, // Friday, May 15, 2020 3:49:35 AM
1589600975, // Saturday, May 16, 2020 3:49:35 AM
1589946575, // Wednesday, May 20, 2020 3:49:35 AM
1590378575, // Monday, May 25, 2020 3:49:35 AM
1590464975, // Tuesday, May 26, 2020 3:49:35 AM
1590470135 // Tuesday, May 26, 2020 5:15:35 AM
)).toDF(EPOCH)
val (split3, split4) = splitTrainTest(oneMonthData, 0.6)
split3.show(false)
split4.show(false)
/**
* Min Date Found: 2020-05-01
* Max Date Found: 2020-05-01
* Total Months of data found for is 0 months
* Total Days of data found is 25 days
* +----------+-----------+----------------+----------+-----+
* |epoch |epoch_month|epoch_dayofmonth|epoch_year|split|
* +----------+-----------+----------------+----------+-----+
* |1589514575|5 |15 |2020 |14 |
* +----------+-----------+----------------+----------+-----+
*
* +----------+-----------+----------------+----------+-----+
* |epoch |epoch_month|epoch_dayofmonth|epoch_year|split|
* +----------+-----------+----------------+----------+-----+
* |1589600975|5 |16 |2020 |15 |
* |1589946575|5 |20 |2020 |19 |
* |1590378575|5 |25 |2020 |24 |
* |1590464975|5 |26 |2020 |25 |
* |1590470135|5 |26 |2020 |25 |
* +----------+-----------+----------------+----------+-----+
*/

unify the information of a user in a single row as a dataFrame using scala

dfFilter.show()
------------+-----------+-----------+------------+--------+
CONTR |COD | DATE |TYPCOD | Amount |
------------+-----------+-----------+------------+--------+
0004 |4433 |2006-11-04 |RMA | 150.0 |
0004 |4433 |2012-05-14 |FCB | 300.0 |
0004 |1122 |2011-10-17 |RMA | 100.0 |
0004 |1122 |2015-12-05 |FCB | 500.0 |
------------+-----------+-----------+------------+--------+
//
val addColumn = dfFilter.withColumn("RMA_AMOUNT", when(col("TYPCOD")==="RMA", col("Amount")))
.withColumn("DATE_RMA", when(col("TYPCOD")==="RMA", col("DATE")))
.withColumn("FCB_AMOUNT", when(col("TYPCOD")==="FCB", col("Amount")))
.withColumn("DATE_FCB", when(col("TYPCOD")==="FCB", col("DATE")))
addColumn.show()
--------+-----------+-----------+------------+--------+------------+-----------+-----------+-----------+
CONTR |COD | DATE |TYPCOD | Amount | RMA_AMOUNT |DATE_RMA |FCB_AMOUNT |DATE_FCB |
--------+-----------+-----------+------------+--------+------------+-----------+-----------+-----------+
0004 |4433 |2006-11-04 |RMA | 150.0 |150.0 |2006-11-04 |null |null |
0004 |4433 |2012-05-14 |FCB | 300.0 |null |null |300.0 |2012-05-14 |
0004 |1122 |2011-10-17 |RMA | 100.0 |100.0 |2011-10-17 |null |null |
0004 |1122 |2015-12-05 |FCB | 500.0 |null |null |500.0 |2015-12-05 |
--------+-----------+------------+-----------+--------+------------+-----------+-----------+-----------+
I have the same CONTR and COD but that client has different dates and amounts and
I want to group them and keep two lines in the dataFrame, I have added columns in relation
to the TYPCOD and DATE fields so that later I can only stay with two lines in the dataFrame
and thus not lose information.
it's possible?
Expected: ?
------------+-------------+------------+-----------+-----------+-----------+
CONTR |COD | RMA_AMOUNT |DATE_RMA |FCB_AMOUNT |DATE_FCB |
------------+-------------+------------+-----------+-----------+-----------+
0004 |4433 |150.0 |2006-11-04 |300.0 |2012-05-14 |
0004 |1122 |100.0 |2011-10-17 |500.0 |2015-12-05 |
------------+-------------+------------+-----------+-----------+-----------+
Use groupBy then use first(col,ignoreNull=true)functions for this case.
val df=Seq(("0004","4433","2006-11-04","RMA","150.0","150.0","2006-11-04",null.asInstanceOf[String],null.asInstanceOf[String]),("0004","4433","2012-05-14","FCB","300.0",null.asInstanceOf[String],null.asInstanceOf[String],"300.0","2012-05-14"),("0004","1122","2011-10-17","RMA","100.0","100.0","2011-10-17",null.asInstanceOf[String],null.asInstanceOf[String]),("0004","1122","2015-12-05","FCB","500.0",null.asInstanceOf[String],null.asInstanceOf[String],"500.0","2015-12-05")).toDF("CONTR","COD","DATE","TYPCOD","Amount","RMA_AMOUNT","DATE_RMA","FCB_AMOUNT","DATE_FCB")
//+-----+----+----------+------+------+----------+----------+----------+----------+
//|CONTR| COD| DATE|TYPCOD|Amount|RMA_AMOUNT| DATE_RMA|FCB_AMOUNT| DATE_FCB|
//+-----+----+----------+------+------+----------+----------+----------+----------+
//| 0004|4433|2006-11-04| RMA| 150.0| 150.0|2006-11-04| null| null|
//| 0004|4433|2012-05-14| FCB| 300.0| null| null| 300.0|2012-05-14|
//| 0004|1122|2011-10-17| RMA| 100.0| 100.0|2011-10-17| null| null|
//| 0004|1122|2015-12-05| FCB| 500.0| null| null| 500.0|2015-12-05|
//+-----+----+----------+------+------+----------+----------+----------+----------+
df.groupBy("CONTR","COD").agg(first(col("RMA_AMOUNT"),true).alias("RMA_AMOUNT"),first(col("DATE_RMA"),true).alias("DATE_RMA"),first(col("FCB_AMOUNT"),true).alias("FCB_AMOUNT"),first(col("DATE_FCB"),true).alias("DATE_FCB")).show()
//+-----+----+----------+----------+----------+----------+
//|CONTR| COD|RMA_AMOUNT| DATE_RMA|FCB_AMOUNT| DATE_FCB|
//+-----+----+----------+----------+----------+----------+
//| 0004|4433| 150.0|2006-11-04| 300.0|2012-05-14|
//| 0004|1122| 100.0|2011-10-17| 500.0|2015-12-05|
//+-----+----+----------+----------+----------+----------+
//incase if you want to keep TYPCOD and DATE values
df.groupBy("CONTR","COD").agg(concat_ws(",",collect_list(col("TYPCOD"))).alias("TYPECOD"),concat_ws(",",collect_list(col("DATE"))).alias("DATE"),first(col("RMA_AMOUNT"),true).alias("RMA_AMOUNT"),first(col("DATE_RMA"),true).alias("DATE_RMA"),first(col("FCB_AMOUNT"),true).alias("FCB_AMOUNT"),first(col("DATE_FCB"),true).alias("DATE_FCB")).show(false)
//+-----+----+-------+---------------------+----------+----------+----------+----------+
//|CONTR|COD |TYPECOD|DATE |RMA_AMOUNT|DATE_RMA |FCB_AMOUNT|DATE_FCB |
//+-----+----+-------+---------------------+----------+----------+----------+----------+
//|0004 |4433|RMA,FCB|2006-11-04,2012-05-14|150.0 |2006-11-04|300.0 |2012-05-14|
//|0004 |1122|RMA,FCB|2011-10-17,2015-12-05|100.0 |2011-10-17|500.0 |2015-12-05|
//+-----+----+-------+---------------------+----------+----------+----------+----------+
Yes, It's possible. Please check code below.
scala> val df = Seq(("0004",4433,"2006-11-04","RMA",150.0),("0004",4433,"2012-05-14","FCB",300.0),("0004",1122,"2011-10-17","RMA",100.0),("0004",1122,"2015-12-05","FCB",500.0)).toDF("contr","cod","date","typcod","amount")
df: org.apache.spark.sql.DataFrame = [contr: string, cod: int ... 3 more fields]
scala> val rma = df.filter($"typcod" === "RMA").select($"contr",$"cod",$"date".as("rma_date"),$"typcod",$"amount".as("rma_amount"))
rma: org.apache.spark.sql.DataFrame = [contr: string, cod: int ... 3 more fields]
scala> rma.show(false)
+-----+----+----------+------+----------+
|contr|cod |rma_date |typcod|rma_amount|
+-----+----+----------+------+----------+
|0004 |4433|2006-11-04|RMA |150.0 |
|0004 |1122|2011-10-17|RMA |100.0 |
+-----+----+----------+------+----------+
scala> val fcb = df.filter($"typcod" === "FCB").select($"contr",$"cod",$"date".as("fcb_date"),$"typcod",$"amount".as("fcb_amount")).drop("contr")
fcb: org.apache.spark.sql.DataFrame = [cod: int, fcb_date: string ... 2 more fields]
scala> fcb.show(false)
+----+----------+------+----------+
|cod |fcb_date |typcod|fcb_amount|
+----+----------+------+----------+
|4433|2012-05-14|FCB |300.0 |
|1122|2015-12-05|FCB |500.0 |
+----+----------+------+----------+
scala> rma.join(fcb,Seq("cod"),"inner").select("contr","cod","rma_amount","rma_date","fcb_amount","fcb_date").show(false)
+-----+----+----------+----------+----------+----------+
|contr|cod |rma_amount|rma_date |fcb_amount|fcb_date |
+-----+----+----------+----------+----------+----------+
|0004 |4433|150.0 |2006-11-04|300.0 |2012-05-14|
|0004 |1122|100.0 |2011-10-17|500.0 |2015-12-05|
+-----+----+----------+----------+----------+----------+
scala>

Add column to dataframe based on operation (min, max, sum) between other columns

I have this dataframe:
val df = Seq(
("thin", "Cell phone", 6000, 150, "01/01/2018"),
("Normal", "Tablet", 1500, 200, "01/01/2018"),
("Mini", "Tablet", 2000, 250, "02/01/2018"),
("Ultra thin", "Cell phone", 5000, 300, "02/01/2018"),
("Very thin", "Cell phone", 6000, 400, "03/01/2018"),
("Big", "Tablet", 4500, 250, "03/01/2018"),
("Bendable", "Cell phone", 3000, 200, "04/01/2018"),
("Fordable", "Cell phone", 3000, 150, "05/01/2018"),
("Pro", "Cell phone", 4500, 300, "06/01/2018"),
("Pro2", "Tablet", 6500, 350, "04/01/2018")).toDF("product", "category",
"revenue", "extra", "date")
I am trying to add a Column to this dataframe which contains the an operation based on Columns revenue and extra. Let´s say a min operation so that I get a Column such as this:
df.withColumn("output", min("revenue", "extra"))
The problem I am finding with spark functions, is that these min, max aggregations are applied vertically, in a Column. However, my goal here is to apply these concepts horizontally, across columns.
Thanks
You need to use UDF() for that. Check this out.
scala> val df = Seq(
| ("thin", "Cell phone", 6000, 150, "01/01/2018"),
| ("Normal", "Tablet", 1500, 200, "01/01/2018"),
| ("Mini", "Tablet", 2000, 250, "02/01/2018"),
| ("Ultra thin", "Cell phone", 5000, 300, "02/01/2018"),
| ("Very thin", "Cell phone", 6000, 400, "03/01/2018"),
| ("Big", "Tablet", 4500, 250, "03/01/2018"),
| ("Bendable", "Cell phone", 3000, 200, "04/01/2018"),
| ("Fordable", "Cell phone", 3000, 150, "05/01/2018"),
| ("Pro", "Cell phone", 4500, 300, "06/01/2018"),
| ("Pro2", "Tablet", 6500, 350, "04/01/2018")).toDF("product", "category",
| "revenue", "extra", "date")
df: org.apache.spark.sql.DataFrame = [product: string, category: string ... 3 more fields]
scala> df.printSchema
root
|-- product: string (nullable = true)
|-- category: string (nullable = true)
|-- revenue: integer (nullable = false)
|-- extra: integer (nullable = false)
|-- date: string (nullable = true)
scala> def min2col(x:Int,y:Int):Int =
| return if(x<y) x else y
min2col: (x: Int, y: Int)Int
scala> val myudfmin2col = udf( min2col(_:Int,_:Int):Int )
myudfmin2col: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(IntegerType, IntegerType)))
scala> df.withColumn("output",myudfmin2col('extra,'revenue)).show(false)
+----------+----------+-------+-----+----------+------+
|product |category |revenue|extra|date |output|
+----------+----------+-------+-----+----------+------+
|thin |Cell phone|6000 |150 |01/01/2018|150 |
|Normal |Tablet |1500 |200 |01/01/2018|200 |
|Mini |Tablet |2000 |250 |02/01/2018|250 |
|Ultra thin|Cell phone|5000 |300 |02/01/2018|300 |
|Very thin |Cell phone|6000 |400 |03/01/2018|400 |
|Big |Tablet |4500 |250 |03/01/2018|250 |
|Bendable |Cell phone|3000 |200 |04/01/2018|200 |
|Fordable |Cell phone|3000 |150 |05/01/2018|150 |
|Pro |Cell phone|4500 |300 |06/01/2018|300 |
|Pro2 |Tablet |6500 |350 |04/01/2018|350 |
+----------+----------+-------+-----+----------+------+
scala>
EDIT1:
scala> df.createOrReplaceTempView("product")
scala> spark.sql("select product,category,revenue,extra,date, case when revenue<extra then revenue else extra end as minextra from product ").show(false)
+----------+----------+-------+-----+----------+--------+
|product |category |revenue|extra|date |minextra|
+----------+----------+-------+-----+----------+--------+
|thin |Cell phone|6000 |150 |01/01/2018|150 |
|Normal |Tablet |1500 |200 |01/01/2018|200 |
|Mini |Tablet |2000 |250 |02/01/2018|250 |
|Ultra thin|Cell phone|5000 |300 |02/01/2018|300 |
|Very thin |Cell phone|6000 |400 |03/01/2018|400 |
|Big |Tablet |4500 |250 |03/01/2018|250 |
|Bendable |Cell phone|3000 |200 |04/01/2018|200 |
|Fordable |Cell phone|3000 |150 |05/01/2018|150 |
|Pro |Cell phone|4500 |300 |06/01/2018|300 |
|Pro2 |Tablet |6500 |350 |04/01/2018|350 |
+----------+----------+-------+-----+----------+--------+
scala>