I was asked to aggregate my data by two-week, let's say my data starts at jun 1st 2020, which happens to be a MONDAY. Ever since then, i need to aggregate that by every two-weeks.
I will show you by a simple dataset down below,
+----+-----------+----------+
| id|day_revenue| ts_string|
+----+-----------+----------+
| 1| 10|2020-06-01|
| 1| 8|2020-06-04|
| 2| 10|2020-06-30|
|1081| 100|2020-07-07|
+----+-----------+----------+
i skip a lot data, and that is only a sample.
My purpose is making this dataframe looks like the one down below.
+----+-----------+----------+-------------+---------------------+
|id |day_revenue|ts_string |bi_week_start|bi_week_full |
+----+-----------+----------+-------------+---------------------+
|1 |10 |2020-06-01|2020-06-01 |2020-06-01/2020-06-14|
|1 |8 |2020-06-04|2020-06-01 |2020-06-01/2020-06-14|
|2 |10 |2020-06-30|2020-06-29 |2020-06-29/2020-07-12|
|1081|100 |2020-07-07|2020-06-29 |2020-06-29/2020-07-12|
+----+-----------+----------+-------------+---------------------+
so, whatever you do, the 2020-06-01 is a magic day, because we split every two weeks base d on this day. Importantly,bi-week-full column is not necessary but if you can create new column in that way, that will be much more convenient for me.
There are two things please keep in mind,
My data frame is still growing every single day.Thus, I would like to find a universal function or general way to treat this kind of ad-hoc request.
My data starts from 2020-06-01 and it has accumulated more than two years. It is a big data frame.
THANK YOU in advance.
you can create sample df by code below,
data_ls = [('1', '10', '2020-06-01'),
('1', '8', '2020-06-04'),
('2', '10', '2020-06-30'),
('1081', '100', '2020-07-07'),]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id', 'day_revenue', 'ts_string'])
Using a combination of datediff() and date_add()
import pyspark.sql.functions as F
(
df
.withColumn('start', F.lit('2020-06-01'))
.withColumn('n', F.floor(F.datediff('ts_string', 'start') / 14).cast('int'))
.select(
'id', 'day_revenue', 'ts_string',
F.date_add('start', F.col('n') * 14).alias('bi_week_start'),
F.date_add('start', (F.col('n') + 1) * 14 - 1).alias('bi_week_end'),
)
.withColumn('bi_week_full', F.concat_ws('/', 'bi_week_start', 'bi_week_end'))
.show(truncate=False)
)
Related
Here's a link to an example of what I want to achieve: https://community.powerbi.com/t5/Desktop/Append-Rows-using-Another-columns/m-p/401836. Basically, I need to merge all the rows of a pair of columns into another pair of columns. How can I do this in Spark Scala?
Input
Output
Correct me if I'm wrong, but I understand that you have a dataframe with 4 columns and you want two of them to be in the previous couple of columns right?
For instance with this input (only two rows for simplicity)
df.show
+----+----------+-----------+----------+---------+
|name| date1| cost1| date2| cost2|
+----+----------+-----------+----------+---------+
| A|2013-03-25|19923245.06| | |
| B|2015-06-04| 4104660.00|2017-10-16|392073.48|
+----+----------+-----------+----------+---------+
With just a couple of selects and a unionn you can achieve what you want
df.select("name", "date1", "cost1")
.union(df.select("name", "date2", "cost2"))
.withColumnRenamed("date1", "date")
.withColumnRenamed("cost1", "cost")
+----+----------+-----------+
|name| date| cost|
+----+----------+-----------+
| A|2013-03-25|19923245.06|
| B|2015-06-04| 4104660.00|
| A| | |
| B|2017-10-16| 392073.48|
+----+----------+-----------+
I have a dataset within which there are multiple groups. I have a rank column which incrementally counts counts each entry per group. An example of this structure is shown below:
+-----------+---------+---------+
| equipment| run_id|run_order|
+-----------+---------+---------+
|1 |430032589| 1|
|1 |430332632| 2|
|1 |430563033| 3|
|1 |430785715| 4|
|1 |431368577| 5|
|1 |431672148| 6|
|2 |435497596| 1|
|1 |435522469| 7|
Each group (equipment) has a different amount of runs. Shown above equipment 1 has 7 runs whilst equipment 2 has 1 run. I would like to select the first and last n runs per equipment. To select the first n runs is straightforward:
df.select("equipment", "run_id").distinct().where(df.run_order <= n).orderBy("equipment").show()
The distinct is in the query because each row is equivalent to a timestep and therefore each row will log sensor readings associated with that timestep. Therefore there will be many rows with the same equipment, run_id and run_order, which should be preserved in the end result and not aggregated.
As the number of runs is unique to each equipment I can't do an equivalent select query with a where clause (I think) to get the last n runs:
df.select("equipment", "run_id").distinct().where(df.rank >= total_runs - n).orderBy("equipment").show()
I can run a groupBy to get the highest run_order for each equipment
+-----------+----------------+
| equipment| max(run_order) |
+-----------+----------------+
|1 | 7|
|2 | 1|
But I am unsure if there is a way I can construct a dynamic where clause that works like this. So that I get the last n runs (including all timestep data for each run).
You can add a column of the max rank for each equipment and do a filter based on that column:
from pyspark.sql import functions as F, Window
n = 3
df2 = df.withColumn(
'max_run',
F.max('run_order').over(Window.partitionBy('equipment'))
).where(F.col('run_order') >= F.col('max_run') - n)
I have a sample CSV file with columns as shown below.
col1,col2
1,57.5
2,24.0
3,56.7
4,12.5
5,75.5
I want a new Timestamp column in the HH:mm:ss format and the timestamp should keep on the increase by seconds as shown below.
col1,col2,ts
1,57.5,00:00:00
2,24.0,00:00:01
3,56.7,00:00:02
4,12.5,00:00:03
5,75.5,00:00:04
Thanks in advance for your help.
I can propose a solution based on pyspark. The scala implementation should be almost transparent.
My idea is to create a column filled with a unique timestamps (here 1980 as an example but does not matter) and add seconds based on your first column (row number). Then, you just reformat the timestamp to only see hours
import pyspark.sql.functions as psf
df = (df
.withColumn("ts", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("ts", psf.col("ts") + psf.col("i") - 1)
.withColumn("ts", psf.from_unixtime("ts", format='HH:mm:ss'))
)
df.show(2)
+---+----+---------+
| i| x| ts|
+---+----+---------+
| 1|57.5| 00:00:00|
| 2|24.0| 00:00:01|
+---+----+---------+
only showing top 2 rows
Data generation
df = spark.createDataFrame([(1,57.5),
(2,24.0),
(3,56.7),
(4,12.5),
(5,75.5)], ['i','x'])
df.show(2)
+---+----+
| i| x|
+---+----+
| 1|57.5|
| 2|24.0|
+---+----+
only showing top 2 rows
Update: if you don't have a row number in your csv (from your comment)
In that case, you will need row_number function.
This is not straightforward to number rows in Spark because the data are distributed on independent partitions and locations. The order observed in the csv will not be respected by spark when mapping file rows to partitions. I think it would be better not to use Spark to number your rows in your csv if the order in the file is important. A pre-processing step based on pandas with a loop over all your files, to take it one at a time, could make it work.
Anyway, I can propose you a solution if you don't mind having row order different from the one in your csv stored in disk.
import pyspark.sql.window as psw
w = psw.Window.partitionBy().orderBy("x")
(df
.drop("i")
.withColumn("i", psf.row_number().over(w))
.withColumn("Timestamp", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("Timestamp", psf.col("Timestamp") + psf.col("i") - 1)
.withColumn("Timestamp", psf.from_unixtime("Timestamp", format='HH:mm:ss'))
.show(2)
)
+----+---+---------+
| x| i|Timestamp|
+----+---+---------+
|12.5| 1| 00:00:00|
|24.0| 2| 00:00:01|
+----+---+---------+
only showing top 2 rows
In terms of efficiency this is bad (it's like collecting all the data in master) because you don't use partitionBy. In this step, using Spark is overkill.
You could also use a temporary column and use this one to order. In this particular example it will produce the expected output but not sure it works great in general
w2 = psw.Window.partitionBy().orderBy("temp")
(df
.drop("i")
.withColumn("temp", psf.lit(1))
.withColumn("i", psf.row_number().over(w2))
.withColumn("Timestamp", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("Timestamp", psf.col("Timestamp") + psf.col("i") - 1)
.withColumn("Timestamp", psf.from_unixtime("Timestamp", format='HH:mm:ss'))
.show(2)
)
+----+----+---+---------+
| x|temp| i|Timestamp|
+----+----+---+---------+
|57.5| 1| 1| 00:00:00|
|24.0| 1| 2| 00:00:01|
+----+----+---+---------+
only showing top 2 rows
I am trying to count for a given order_id how many orders there were in the past 365 days which had a payment. And this is not the problem: I use the window function.
Where it gets tricky for me is: I don't want to count orders in this time window where the payment_date is after order_date of the current order_id.
Currently, I have something like this:
val window: WindowSpec = Window
.partitionBy("customer_id")
.orderBy("order_date")
.rangeBetween(-365*days, -1)
and
df.withColumn("paid_order_count", count("*") over window)
which would count all orders for the customer within the last 365 days before his current order.
How can I now incorporate a condition for the counting that takes the order_date of the current order into account?
Example:
+---------+-----------+-------------+------------+
|order_id |order_date |payment_date |customer_id |
+---------+-----------+-------------+------------+
|1 |2017-01-01 |2017-01-10 |A |
|2 |2017-02-01 |2017-02-10 |A |
|3 |2017-02-02 |2017-02-20 |A |
The resulting table should look like this:
+---------+-----------+-------------+------------+-----------------+
|order_id |order_date |payment_date |customer_id |paid_order_count |
+---------+-----------+-------------+------------+-----------------+
|1 |2017-01-01 |2017-01-10 |A |0 |
|2 |2017-02-01 |2017-02-10 |A |1 |
|3 |2017-02-02 |2017-02-20 |A |1 |
For order_id = 3 the paid_order_count should not be 2 but 1 as order_id = 2 is paid after order_id = 3 is placed.
I hope that I explained my problem well and look forward to your ideas!
Very good question!!!
A couple of remarks, using rangeBetween creates a fixed frame that is based on number of rows in it and not on values, so it will be problematic in 2 cases:
customer does not have orders every single day, so 365 rows window might contain rows with order_date well before one year ago
if customer has more than one order per day, it will mess with the one year coverage
combination of the 1 and 2
Also rangeBetween does not work with Date and Timestamp datatypes.
To solve it, it is possible to use window function with lists and an UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.sparkContext.parallelize(Seq(
(1, "2017-01-01", "2017-01-10", "A")
, (2, "2017-02-01", "2017-02-10", "A")
, (3, "2017-02-02", "2017-02-20", "A")
)
).toDF("order_id", "order_date", "payment_date", "customer_id")
.withColumn("order_date_ts", to_timestamp($"order_date", "yyyy-MM-dd").cast("long"))
.withColumn("payment_date_ts", to_timestamp($"payment_date", "yyyy-MM-dd").cast("long"))
// df.printSchema()
// df.show(false)
val window = Window.partitionBy("customer_id").orderBy("order_date_ts").rangeBetween(Window.unboundedPreceding, -1)
val count_filtered_dates = udf( (days: Int, top: Long, array: Seq[Long]) => {
val bottom = top - (days * 60 * 60 * 24).toLong // in spark timestamps are in secconds, calculating the date days ago
array.count(v => v >= bottom && v < top)
}
)
val res = df.withColumn("paid_orders", collect_list("payment_date_ts") over window)
.withColumn("paid_order_count", count_filtered_dates(lit(365), $"order_date_ts", $"paid_orders"))
res.show(false)
Output:
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|order_id|order_date|payment_date|customer_id|order_date_ts|payment_date_ts|paid_orders |paid_order_count|
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|1 |2017-01-01|2017-01-10 |A |1483228800 |1484006400 |[] |0 |
|2 |2017-02-01|2017-02-10 |A |1485907200 |1486684800 |[1484006400] |1 |
|3 |2017-02-02|2017-02-20 |A |1485993600 |1487548800 |[1484006400, 1486684800]|1 |
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
Converting dates to Spark timestamps in seconds makes the lists more memory efficient.
It is the easiest code to implement, but not the most optimal as the lists will take up some memory, custom UDAF would be best, but requires more coding, might do later. This will still work if you have thousands of orders per customer.
The original DataFrame looks like:
+--------------------+--------------------+--------------------+
| user_id| measurement_date| features|
+--------------------+--------------------+--------------------+
|b6d0bb3d-7a8e-4ac...|2016-06-28 02:00:...|[3492.68576170840...|
..
|048ffee9-a942-4d1...|2016-04-28 02:00:...|[1404.42230898422...|
|05101595-5a6f-4cd...|2016-07-10 02:00:...|[1898.50082132108...|
+--------------------+--------------------+--------------------+
My pivoting efforts:
data = data.select(data.col("user_id"),data.col("features"),data.col("measurement_date").cast(DateType).alias("date")).filter(data.col("measurement_date").between("2016-01-01", "2016-01-07"))
data = data.select(data.col("user_id"),data.col("features"),data.col("date")).groupBy("user_id","features").pivot("date").min()
I the output is:
+--------------------+--------------------+
| user_id| features|
+--------------------+--------------------+
|14cd26dc-200a-436...|[2281.34579074947...|
..
|d8ae1b5e-c1e0-4bf...|[2568.49641198251...|
|1cceb175-12b4-4c3...|[4436.36029554227...|
+--------------------+--------------------+
The columns I want 2016-01-01,..,2016-01-07 are missing, nothing was pivoted at all.
What am I doing wrong?
EDIT:
This is how the DataFrame looks after the first statement:
| user_id| features| date|
+--------------------+--------------------+----------+
|60f1cd63-0d5a-4f2...|[1553.35305181118...|2016-01-05|
|a56d1fef-5f17-4c9...|[1704.34897309186...|2016-01-02|
..
|992b6a34-803d-44b...|[1518.14292508305...|2016-01-05|
It might be noteworthy that (user_id, features) is not a time series, there are gaps in the data. Sometimes there are no measurements for certain dates, in that case I want Null as entry.
You forgot the aggregation part. So that your second line of code should be
data = data.select(data.col("user_id"),data.col("features"),data.col("date")).groupBy("user_id","features").pivot("date").agg(min("date"))