Lets say I have flight data (from Foundry Academy).
Starting dataset:
Date
flight_id
origin_state
carrier_name
jan
000000001
California
delta air
jan
000000002
Alabama
delta air
jan
000000003
California
southwest
feb
000000004
California
southwest
...
...
...
...
I'm doing monthly data aggregation by state and by carrier. Header of my aggregated data looks like this:
origin state
carrier name
jan
feb
...
Alabama
delta air
1
0
...
California
delta air
1
0
...
California
southwest
1
1
...
I need to get subtotals for each state;
I need to sort by most flights;
and I want it to be sorted by states, then by carrier.
desired output
origin state
carrier name
jan
feb
...
California
null
2
1
...
California
delta air
1
0
...
California
southwest
1
1
...
Alabama
null
1
0
...
Alabama
delta air
1
0
...
PIVOT - doesn't provide subtotals for categories;
EXPRESSION - doesn't offer possibility to split date column into columns.
I solved it on Contour. not the prettiest solution, but it works.
I've created two paths to the same dataset:
| Date | flight_id | origin_state | carrier_name |
| ---- | --------- | ------------ | ------------ |
| ... | ... | ... | ... |
1st path was used to calculate full aggregation. pivot table and switch to pivoted data:
Switch to pivoted data: using column "date",
grouped by "origin_state" and "carrier_name",
aggregated by Count
2nd path was used to get subtotals:
Switch to pivoted data: using column "date",
grouped by "origin_state",
aggregated by Count
Afterwards I've added empty column "carrier_name" to second dataset. And made union of both datasets
Add rows that appear in "second_path" by column name
After that I've added additional column with expression
Add new column "order" from max("Jan") OVER (
PARTITION BY "origin_state" )
After that I sorted resulting dataset.
Sort dataset by "order" descending, then by "Jan" descending
I received result. but it has additional column, and now I wish to change row formatting of subtotals.
Other approaches are welcome. as my real data has more hierarchical levels.
Related
I have two Spark DataFrames, the first one (Events) contains events information as following:
Event_id
Date
User_id
1
2019-04-19
1
2
2019-05-30
2
3
2020-01-20
1
The second one (User) contains information from users as below:
Id
User_id
Date
Weight-kg
1
1
2019-04-05
78
2
1
2019-04-17
75
3
2
2019-10-10
50
4
1
2020-02-10
76
What I wonder to know is how to bring the latest value of weight from User before the Event Date using PySpark?
The return of this code must be the following table:
Event_id
Date
User_id
Weight-kg
1
2019-04-19
1
75
2
2019-05-30
2
null
3
2020-01-20
1
75
The idea is left join events and users then ranking the weight based on dates to get the latest ones
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(event
# left join to keep all events
# note the join condition where
# event's date >= user's date
.join(
user,
on=[
event['User_id'] == user['User_id'],
event['Date'] >= user['Date'],
],
how='left'
)
# rank user's weight to get the latest
# based on the dates that already filtered by event's date
.withColumn('rank_weight', F.rank().over(W.partitionBy(user['User_id']).orderBy(User['Date'].desc())))
.where(F.col('rank_weight') == 1)
.drop('rank_weight')
# drop unnecessary columns
.drop(user['User_id'])
.drop(user['Date'])
.drop('Id')
.orderBy('Event_id')
.show()
)
# Output
# +--------+----------+-------+------+
# |Event_id| Date|User_id|Weight|
# +--------+----------+-------+------+
# | 1|2019-04-19| 1| 75|
# | 2|2019-05-30| 2| null|
# | 3|2020-01-20| 1| 75|
# +--------+----------+-------+------+
i have a table named hotel with 2 columns named : hotel_name , hotel_price
hotel_name | hotel_price
hotel1 | 5
hotel2 | 20
hotel3 | 100
hotel4 | 50
and another table named city that contains the column : city_name , average_prices
city_name | average_prices
paris | 20
london | 30
rome | 75
madrid | 100
I want to find which hotel has a price that's more expensive than average prices in the cities.For example i want in the end to find something like this:
hotel_name | city_name
hotel3 | paris --hotel3 is more expensive than the average price in paris
hotel3 | london --hotel3 is more expensive than the average price in london etc.
hotel3 | rome
hotel4 | paris
hotel4 | london
(I found the hotels that are more expensive than the average prices of the cities)
Any help would be valuable thank you .
A simple join is all that is needed. Typically tables are joined on a defined relationship (PK/FK) but there is nothing requiring that. See fiddle.
select h.hotel_name, c.city_name
from hotels h
join cities c
on h.hotel_price > c.average_prices;
However, while you can get the desired results, it's pretty meaningless. You cannot tell whether a particular hotel is even in a given city.
I have a file with a [start] and [end] date in Tableau and would like to create a calculated field that counts number of rows on a rolling basis that occur between [start] and [end] for each [person]. This data is like so:
| Start | End | Person
|1/1/2019 |1/7/2019 | A
|1/3/2019 |1/9/2019 | A
|1/8/2019 |1/15/2019| A
|1/1/2019 |1/7/2019 | B
I'd like to create a calculated field [count] with results like so:
| Start | End | Person | Count
|1/1/2019 |1/7/2019 | A | 1
|1/3/2019 |1/9/2019 | A | 2
|1/8/2019 |1/15/2019| A | 2
|1/1/2019 |1/7/2019 | B | 1
EDITED: A good analogy for what [count] represents is: "how many videos does each person rented at the same time as of that moment?" With the 1st row for person A, count is 1, with 1 item rented. As of row 2, person A has 2 items rented. But for the 3rd row [count]= 2 since the video rented in the first row is no longer rented.
I have two tables,
1. items
2. festival_rents
Sample records:
items
id | name | rent
------------------------
1 | Car | 100
2 | Truck | 150
3 | Van | 200
Sample records:
festival_rents
id | items_id | start_date | end_date | rent
------------------------------------------------
1 | 1 | 2018-07-01 | 2018-07-02 | 200
2 | 1 | 2018-07-04 | 2018-07-06 | 300
3 | 3 | 2018-07-06 | 2018-07-07 | 400
The table items contains list of items with name and rent. Each item in the items table may or may not have festival_rents. The table festival_rents has higher rents for each item for a date range with start_date and end_date. It is possible for a item to have multiple festival_rents with different date ranges. But it's for sure that date ranges for multiple festival_rents belonging to a same item won't collide and all date ranges are isolated.
The query that I'm looking for is, for a given start_date and end_date range, for each item in the items table, calculate the total rent and display each item with it's calculated total rent. The rent calculation for each item should include the festival_rents also, if any of the items has festival_rents falling within the given start_date and end_date.
Expected result:
Input: start_date=2018-07-01 and end_date=2018-07-06
Output:
id | name | total_price
------------------------
1 | Car | 1100 // 1st 2 days festival rent + 1 day normal rent + last 3 days festival rent (2 * 200) + (1 * 100) + (3 * 200)
2 | Truck | 900 // 6 days normal rent (6 * 150)
3 | Van | 1400 // 5 days normal rent + 1 day festival rent (200 * 5) + (400 * 1)
You need a list of days either on a table or create on the fly:
How to get list of dates between two dates in mysql select query
Generating time series between two dates in PostgreSQL
SELECT i.name, SUM (f.rent)
FROM allDays a
JOIN festival_rents f
ON a.day >= f.start_date
AND a.day < f.end_date
JOIN items i
ON f.item_id = i.item_id
WHERE a.day BETWEEN #start_date
AND #end_date
GROUP BY i.name
I assume the end_date is open range. So if you have ranges [A,B) and [B,C) Date B will have the rent from [B,C)
I am trying to find for each customer the Max consecutive years he buys something. I tried to create a calculated field but to no avail.
I created two calculated fields
Consecutive: if max([Count])>0 then previous_value(0)+1+index()-index() else 0 end
max: window_max([Consecutive])
My data looks something like:
Year | Customer | Count
1996 | a | 2
1996 | b | 1
1997 | a | 1
1997 | b | 2
1998 | b | 1
So the result would be
a:2
b:3
Use nested table calcs.
The first calc, call it running_good_years, is a running count of consecutive years with sales.
If count(Sales) = 0 then 0 else previous_value(0) + 1 end
The second just returns the max
Window_max(running_good_years)
With table calcs, defining the partitioning and addressing is critical. Partition by Customer, Address by year