Calculation of how many touch points the client has had in the last months - nosql

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer had in the last 6 months. Simply the number of times the customer has had a collection in the last few months.
The problem here is that there is no customerId, so the customerId must be a composite key partyConsignee -> address -> name, partyConsignee -> address -> street and partyConsignee -> address -> city.
Furthermore, the timestamp isArrivedAt -> changed -> timestamp is in another collection details which must be loaded out. This must be linked via the edges of deliveries -> _id:customer/0123 to deliveries_details -> _from:deliveries/0123 and deliveries_details -> _to:details/12347 to details -> _id:details/12347.
deliveries
object
_id:customer/0123
├── currency: USD
├── partyConsignee
│ └── address
│ └── name
│ └── street
│ └── city
│ └── ...
├── ...
details
object
_id:details/12347
├── currentDetail
├── isArrivedAt
│ └── value
│ └── changed
│ └── timestamp
├── ...
deliveries_details
_id:deliveries_details/1234
_from:deliveries/0123
_to:details/12347
Example
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
What I want
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
2 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
3 1 2022-05-20 2021-11-20 0 # No in the last 6 months
4 1 2021-09-05 2021-03-05 0 # No in the last 6 months
5 2 2022-06-02 2021-12-02 0 # No in the last 6 months
6 3 2021-03-01 2020-09-01 1 # 2021-02-01 = 1
7 3 2021-02-01 2020-08-01 0 # No in the last 6 months

Related

Failed to determine supertype of Datetime

Polars: 0.16.2
Python: 3.11.1
Windows 10
Attempting to filter a column using a time range via .is_between()
Couldn't find anything on StackOverflow, but found (maybe?) something similar in the github issues (but it's been solved): https://github.com/pola-rs/polars/issues/5236
To reproduce
import polars as pl
from datetime import time
df = pl.date_range(low=datetime(2023, 2, 7), high=datetime(2023, 2, 8), interval="30m", name="date").to_frame()
# Attempt to filter by time
df.filter(
pl.col('date').is_between(time(9, 30), time(14, 30))
)
Traceback:
PanicException Traceback (most recent call last)
Cell In[11], line 1
----> 1 df.filter(
2 pl.col('date').is_between(time(9, 30, 0, 0), time(14, 30, 0, 0))
3 )
File d:\My_Path\venv\Lib\site-packages\polars\internals\dataframe\frame.py:2747, in DataFrame.filter(self, predicate)
2741 if _check_for_numpy(predicate) and isinstance(predicate, np.ndarray):
2742 predicate = pli.Series(predicate)
2744 return (
2745 self.lazy()
2746 .filter(predicate) # type: ignore[arg-type]
-> 2747 .collect(no_optimization=True)
2748 )
File d:\My_Path\venv\Lib\site-packages\polars\internals\lazyframe\frame.py:1146, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
1135 common_subplan_elimination = False
1137 ldf = self._ldf.optimization_toggle(
1138 type_coercion,
1139 predicate_pushdown,
(...)
1144 streaming,
1145 )
-> 1146 return pli.wrap_df(ldf.collect())
PanicException: cannot coerce datatypes: ComputeError(Owned("Failed to determine supertype of Datetime(Microseconds, None) and Time"))
Not sure if I'm doing something wrong, or if this is a bug.
Tried to filter a series using a time range, and expected a filtered series for just those times. Instead, I got a PanicException (list above).
You are trying to filter a DateTime with a Time. You need to cast to pl.Time before doing the is_between
df.filter(
pl.col('date').cast(pl.Time).is_between(time(9, 30), time(14, 30))
)
┌─────────────────────┐
│ date │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2023-02-07 10:00:00 │
│ 2023-02-07 10:30:00 │
│ 2023-02-07 11:00:00 │
│ 2023-02-07 11:30:00 │
│ 2023-02-07 12:00:00 │
│ 2023-02-07 12:30:00 │
│ 2023-02-07 13:00:00 │
│ 2023-02-07 13:30:00 │
│ 2023-02-07 14:00:00 │
└─────────────────────┘

Julia #Subset Dates

This should be an easy one but I can't find any documentation or prior Q&A on this. Using Julia to subset is easy especially with the #Chain command. But I haven't for the life of me figured out a way to subset on a date:
maindf = #chain rawdf begin
#subset(Dates.year(:travel_date) .== 2019)
end
In all of the documentation Dates.year(today()) should produce (2021) but this ends up tossing me an error:
ERROR: MethodError: no method matching +(::Vector{Date}, ::Int64)
Closest candidates are:
+(::Any, ::Any, ::Any, ::Any...) at operators.jl:560
+(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:87
+(::T, ::Integer) where T<:AbstractChar at char.jl:223
Not sure exactly why I am getting a method error..
In R using DPLYR this would simply be:
maindf = rawdf %>%
filter(., year(travel_date) == 2019)
Any ideas?
Use:
julia> using DataFramesMeta, Dates
julia> df = DataFrame(travel_date=repeat([Date(2019,1,1), Date(2020,1,1)],3), id=1:6)
6×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2019-01-01 1
2 │ 2020-01-01 2
3 │ 2019-01-01 3
4 │ 2020-01-01 4
5 │ 2019-01-01 5
6 │ 2020-01-01 6
julia> #rsubset(df, year(:travel_date) == 2019)
3×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2019-01-01 1
2 │ 2019-01-01 3
3 │ 2019-01-01 5
julia> #subset(df, year.(:travel_date) .== 2019)
3×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2019-01-01 1
2 │ 2019-01-01 3
3 │ 2019-01-01 5
The difference is that #rsubset works by row and #subset works on whole columns.
Your problem was that in Dates.year(:travel_date) .== 2019) you mix non-broadcasted call of the year function and broadcasted comparison .== 2019. You always need to make sure that you either work row-wise (using #rsubset in this case) or on whole columns (using #subset).
Different scenarios might require a different approach. Here is an example when whole-column approach is useful:
julia> using Statistics
julia> #subset(df, :id .> mean(:id))
3×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2020-01-01 4
2 │ 2019-01-01 5
3 │ 2020-01-01 6
where you want mean to operate on a whole column.
EDIT
Here is the same with #chain:
julia> #chain df begin
#subset year.(:travel_date) .== 2019
end
3×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2019-01-01 1
2 │ 2019-01-01 3
3 │ 2019-01-01 5

PostgreSQL: detecting the first/last rows of result set

Is there any way to embed a flag in a select that indicates that it is the first or the last row of a result set? I'm thinking something to the effect of:
> SELECT is_first_row() AS f, is_last_row() AS l FROM blah;
f | l
-----------
t | f
f | f
f | f
f | f
f | t
The answer might be in window functions but I've only just learned about them, and I question their efficiency.
SELECT first_value(unique_column) OVER () = unique_column, last_value(unique_column) OVER () = unique_column, * FROM blah;
seems to do what I want. Unfortunately, I don't even fully understand that syntax, but since unique_column is unique and NOT NULL it should deliver unambiguous results. But if it does sorting, then the cure might be worse than the disease. (Actually, in my tests, unique_column is not sorted, so that's something.)
EXPLAIN ANALYZE doesn't indicate there's an efficiency problem, but when has it ever told me what I needed to know?
And I might need to use this in an aggregate function, but I've just been told window functions aren't allowed there. 😕
Edit:
Actually, I just added ORDER BY unique_column to the above query and the rows identified as first and last were thrown into the middle of the result set. It's as if first_value()/last_value() really means "the first/last value I picked up before I began sorting." I don't think I can safely do this optimally. Not unless a much better understanding of the use of the OVER keyword is to be had.
I'm running PostgreSQL 9.6 in a Debian 9.5 environment.
This isn't a duplicate, because I'm trying to get the first row and last row of the result set to identify themselves, while Postgres: get min, max, aggregate values in one select is just going for the minimum and maximum values for a column in a result set.
You can use the lead() and lag() window functions (over the appropiate window) and compare them to NULL:
-- \i tmp.sql
CREATE TABLE ztable
( id SERIAL PRIMARY KEY
, starttime TIMESTAMP
);
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '1 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '2 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '3 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '4 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '5 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '6 minute');
SELECT id, starttime
, ( lead(id) OVER www IS NULL) AS is_first
, ( lag(id) OVER www IS NULL) AS is_last
FROM ztable
WINDOW www AS (ORDER BY id )
ORDER BY id
;
SELECT id, starttime
, ( lead(id) OVER www IS NULL) AS is_first
, ( lag(id) OVER www IS NULL) AS is_last
FROM ztable
WINDOW www AS (ORDER BY starttime )
ORDER BY id
;
SELECT id, starttime
, ( lead(id) OVER www IS NULL) AS is_first
, ( lag(id) OVER www IS NULL) AS is_last
FROM ztable
WINDOW www AS (ORDER BY starttime )
ORDER BY random()
;
Result:
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
id | starttime | is_first | is_last
----+----------------------------+----------+---------
1 | 2018-08-31 18:38:45.567393 | f | t
2 | 2018-08-31 18:37:45.575586 | f | f
3 | 2018-08-31 18:36:45.587436 | f | f
4 | 2018-08-31 18:35:45.592316 | f | f
5 | 2018-08-31 18:34:45.600619 | f | f
6 | 2018-08-31 18:33:45.60907 | t | f
(6 rows)
id | starttime | is_first | is_last
----+----------------------------+----------+---------
1 | 2018-08-31 18:38:45.567393 | t | f
2 | 2018-08-31 18:37:45.575586 | f | f
3 | 2018-08-31 18:36:45.587436 | f | f
4 | 2018-08-31 18:35:45.592316 | f | f
5 | 2018-08-31 18:34:45.600619 | f | f
6 | 2018-08-31 18:33:45.60907 | f | t
(6 rows)
id | starttime | is_first | is_last
----+----------------------------+----------+---------
2 | 2018-08-31 18:37:45.575586 | f | f
4 | 2018-08-31 18:35:45.592316 | f | f
6 | 2018-08-31 18:33:45.60907 | f | t
5 | 2018-08-31 18:34:45.600619 | f | f
1 | 2018-08-31 18:38:45.567393 | t | f
3 | 2018-08-31 18:36:45.587436 | f | f
(6 rows)
[updated: added a randomly sorted case]
It is simple using window functions with particular frames:
with t(x, y) as (select generate_series(1,5), random())
select *,
count(*) over (rows between unbounded preceding and current row),
count(*) over (rows between current row and unbounded following)
from t;
┌───┬───────────────────┬───────┬───────┐
│ x │ y │ count │ count │
├───┼───────────────────┼───────┼───────┤
│ 1 │ 0.543995119165629 │ 1 │ 5 │
│ 2 │ 0.886343683116138 │ 2 │ 4 │
│ 3 │ 0.124682310037315 │ 3 │ 3 │
│ 4 │ 0.668972567655146 │ 4 │ 2 │
│ 5 │ 0.266671542543918 │ 5 │ 1 │
└───┴───────────────────┴───────┴───────┘
As you can see count(*) over (rows between unbounded preceding and current row) returns rows count from the data set beginning to current row and count(*) over (rows between current row and unbounded following) returns rows count from the current to data set end. 1 indicates the first/last rows.
It works until you ordering your data set by order by. In this case you need to duplicate it in the frames definitions:
with t(x, y) as (select generate_series(1,5), random())
select *,
count(*) over (order by y rows between unbounded preceding and current row),
count(*) over (order by y rows between current row and unbounded following)
from t order by y;
┌───┬───────────────────┬───────┬───────┐
│ x │ y │ count │ count │
├───┼───────────────────┼───────┼───────┤
│ 1 │ 0.125781774986535 │ 1 │ 5 │
│ 4 │ 0.25046408502385 │ 2 │ 4 │
│ 5 │ 0.538880597334355 │ 3 │ 3 │
│ 3 │ 0.802807193249464 │ 4 │ 2 │
│ 2 │ 0.869908029679209 │ 5 │ 1 │
└───┴───────────────────┴───────┴───────┘
PS: As mentioned by a_horse_with_no_name in the comment:
there is no such thing as the "first" or "last" row without sorting.
In fact, Window Functions are a great approach and for that requirement of yours, they are awesome.
Regarding efficiency, window functions work over the data set already at hand. Which means the DBMS will just add extra processing to infer first/last values.
Just one thing I'd like to suggest: I like to put an ORDER BY criteria inside the OVER clause, just to ensure the data set order is the same between multiple executions, thus returning the same values to you.
Try using
SELECT columns
FROM mytable
Join conditions
WHERE conditions ORDER BY date DESC LIMIT 1
UNION ALL
SELECT columns
FROM mytable
Join conditions
WHERE conditions ORDER BY date ASC LIMIT 1
SELECT just cut half of the processing time. You can go for indexing also.

Cumulative count on history table with deleted attributes

I've got a history table of updates to records, and I want to calculate cumulative totals where values may be added or deleted to the set. (ie the cumulative total for one month may be less than the previous).
For example, here's a table with the history of updates to tags for a person record. (id is the id of the person record).
I want to count how many people had the "established" tag in any given month, accounting for when it was added or removed in a prior month.
+----+------------------------+---------------------+
| id | tags | created_at |
+----+------------------------+---------------------+
| 1 | ["vip", "established"] | 2017-01-01 00:00:00 |
| 2 | ["established"] | 2017-01-01 00:00:00 |
| 3 | ["established"] | 2017-02-01 00:00:00 |
| 1 | ["vip"] | 2017-03-01 00:00:00 |
| 4 | ["established"] | 2017-05-01 00:00:00 |
+----+------------------------+---------------------+
With some help from these posts, I've gotten this far:
SELECT
item_month,
sum(count(distinct(id))) OVER (ORDER BY item_month)
FROM (
SELECT
to_char("created_at", 'yyyy-mm') as item_month,
id
FROM person_history
WHERE tags ? 'established'
) t1
GROUP BY item_month;
Which gives me:
month count
2017-01 2
2017-02 3
2017-05 4 <--- should be 3
And it's also missing an entry for 2017-03 which should be 2.
(An entry for 2017-04 would be nice too, but the UI could always infer it from the previous month if need be)
Here is step-by-step tutorial, you could try to collapse all those CTEs:
with
-- Example data
person_history(id, tags, created_at) as (values
(1, '["vip", "est"]'::jsonb, '2017-01-01'::timestamp),
(2, '["est"]', '2017-01-01'), -- Note that Person 2 changed its tags several times per month
(2, '["vip"]', '2017-01-02'),
(2, '["vip", "est"]', '2017-01-03'),
(3, '["est"]', '2017-02-01'),
(1, '["vip"]', '2017-03-01'),
(4, '["est"]', '2017-05-01')),
-- Get the last tags for each person per month
monthly as (
select distinct on (id, date_trunc('month', created_at))
id,
date_trunc('month', created_at) as month,
tags,
created_at
from person_history
order by 1, 2, created_at desc),
-- Retrieve tags from previous month
monthly_prev as (
select
*,
coalesce((lag(tags) over (partition by id order by month)), '[]') as prev_tags
from monthly),
-- Calculate delta: if "est" was added then 1, removed then -1, nothing heppens then 0
monthly_delta as (
select
*,
case
when tags ? 'est' and not prev_tags ? 'est' then 1
when not tags ? 'est' and prev_tags ? 'est' then -1
else 0
end as delta
from monthly_prev),
-- Sum all deltas for each month
monthly_total as (
select month, sum(delta) as total
from monthly_delta
group by month)
-- Finally calculate cumulative sum
select *, sum(total) over (order by month) from monthly_total
order by month;
Result:
┌─────────────────────┬───────┬─────┐
│ month │ total │ sum │
├─────────────────────┼───────┼─────┤
│ 2017-01-01 00:00:00 │ 2 │ 2 │
│ 2017-02-01 00:00:00 │ 1 │ 3 │
│ 2017-03-01 00:00:00 │ -1 │ 2 │
│ 2017-05-01 00:00:00 │ 1 │ 3 │
└─────────────────────┴───────┴─────┘

Implementing a sort column in Postgres that acts like a linked list

Say I have a database table teams that has an ordering column position, the position can either be null if it is the last result, or the id of next team that is positioned one higher than that team. This would result in a list that is always strictly sorted (if you use ints you have to manage all the other position values when inserting a new team, ie increment them all by one), and the insertion becomes less complicated...
But to retrieve this table as a sorted query has proved tricky, here is where I'm at so far:
WITH RECURSIVE teams AS (
SELECT *, 1 as depth FROM team
UNION
SELECT t.*, ts.depth + 1 as depth
FROM team t INNER JOIN teams ts ON ts.order = t.id
SELECT
id, order, depth
FROM
teams
;
Which gets me something like:
id | order | depth
----+-------+-------
53 | 55 | 1
55 | 52 | 1
55 | 52 | 2
52 | 54 | 2
52 | 54 | 3
54 | | 3
54 | | 4
Which kind of reflects where I need to get to in terms of ordering (the max of depth represents the ordering I want...) however I cant work out how to alter the query to get something like:
id | order | depth
----+-------+-------
53 | 55 | 1
55 | 52 | 2
52 | 54 | 3
54 | | 4
It seems however I change the query it complains at me about applying a GROUP BY across both id and depth... How do I get from where I am now to where I want to be?
Your recursive query should to start somewhere (for now you selecting whole table in the first subquery). I propose to start from the last record where order column is null and walk to the first record:
with recursive team(id, ord) as (values(53,55),(55,52),(52,54),(54,null)),
teams as (
select *, 1 as depth from team where ord is null -- select the last record here
union all
select t.*, ts.depth + 1 as depth
from team t join teams ts on ts.id = t.ord) -- note that the JOIN condition reversed comparing to the original query
select * from teams order by depth desc; -- finally reverse the order
┌────┬──────┬───────┐
│ id │ ord │ depth │
├────┼──────┼───────┤
│ 53 │ 55 │ 4 │
│ 55 │ 52 │ 3 │
│ 52 │ 54 │ 2 │
│ 54 │ ░░░░ │ 1 │
└────┴──────┴───────┘