Looping through unique dates in PostgreSQL - postgresql

In Python (pandas) I read from my database and then I use a pivot table to aggregate data each day. The raw data I am working on is about 2 million rows per day and it is per person and per 30 minutes. I am aggregating it to be daily instead so it is a lot smaller for visualization.
So in pandas, I would read each date into memory and aggregate it and then load it into a fresh table in postgres.
How can I do this directly in postgres? Can I loop through each unique report_date in my table, groupby, and then append it to another table? I am assuming doing it in postgres would be fast compared to reading it over a network in python, writing a temporary .csv file, and then writing it again over the network.

Here's an example: Suppose that you have a table
CREATE TABLE post (
posted_at timestamptz not null,
user_id integer not null,
score integer not null
);
representing the score various user have earned from posts they made in SO like forum. Then the following query
SELECT user_id, posted_at::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, posted_at::date;
will aggregate the scores per user per day.
Note that this will consider that the day changes at 00:00 UTC (like SO does). If you want a different time, say midnight Paris time, then you can do it like so:
SELECT user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date;
To have good performace for the above queries, you might want to create a (computed) index on (user_id, posted_at::date), or similarly for the second case.

Related

Get max timestamps efficiently for large table for a set of ids

I have a large PostgreSQL db table (Actually lots of partition tables divided up by yearly quarters) that for simplicity sake is defined something like
id bigint
ts (timestamp)
value (float)
For a particular set of ids what is an efficient way of finding the last timestamp in the table for each specified id ?
The table is indexed by (id, timestamp)
If I do something naive like
SELECT sensor_id, MAX(ts)
FROM sensor_values
WHERE ts >= (NOW() + INTERVAL '-100 days') :: TIMESTAMPTZ
GROUP BY 1;
Things are pretty slow.
Is there a way of perhaps narrowing down the times first by a binary search of one id
(I can assume the timestamps are similar for a particular set of ids)
I am accessing the db through psycopg so the solution can be in code or SQL if I am missing something easy to speed this up.
The explain for the query can be seen here. https://explain.depesz.com/s/PVqg
Any ideas appreciated.

How do I make a simple day dimension table for data warehousing star schema with postgresql?

How would I go about creating and populating a simple DAY dimension table for a star schema in postgreSQL ?
It is for an intro course to data warehousing and so it only has a few fields but most of the examples online are very involved and seem very complicated for a beginner. This isn't for an assignment - it is for studying because I am trying to make my own simple Star Schema with a fact table so I can start getting comfortable with it.
Can anyone give me a simple example of how I'd create the table with just a few fields (day_key as the surrogate key, a string describing the day, and some integer values representing the days or months for example) so I can at least get started on understanding?
A very simple DAY dimension table that should work for most versions of PostgreSQL (I am using 10.5). This is just something that should help someone newer to Data Warehousing make a basic day dimension for use when just getting started.
Create a Day Table
CREATE TABLE day (
day_key SERIAL PRIMARY KEY, -- SERIAL is an integer that will auto-increment as new rows added
description VARCHAR(40), -- a 'string' for a description
full_date DATE, -- an actual date type
month_number INTEGER,
month_name VARCHAR(40),
year INTEGER
);
Inserting Rows into the Day dimension
INSERT INTO day(description, full_date, month_number, month_name, year)
SELECT
to_char(days.d, 'FMMonth DD, YYYY'),
days.d::DATE,
to_char(days.d, 'MM')::integer,
to_char(days.d, 'FMMonth'),
to_char(days.d, 'YYYY')::integer
from (
SELECT generate_series(
('2019-01-01')::date, -- 'start' date
('2019-12-31')::date, -- 'end' date
interval '1 day' -- one for each day between the start and day
)) as days(d);
Result
Notes:
Basically you are just using the rows generated by the nested SELECT generate_series(... to insert into the Day table.
I used the FM above twice to remove some of the white space padding automatically generated in some of these date formatting.
I'd recommend removing the INSERT INTO day(...) line the first time you do this just to make sure the format of each column is what you're after before inserting it into your table.
This is just what I've seen commonly used - check the PostgreSQL documentation has some more thorough and good examples of more ways to format date types and get all kinds of useful dimensions.

Handling of multiple queries as one result

Lets say I have this table
CREATE TABLE device_data_by_year (
year int,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY (year, device_id, nano_since_epoch,sensor_id)
) WITH CLUSTERING ORDER BY (device_id desc, nano_since_epoch desc);
I need to query data for a particular device and sensor in a period between 2017 and 2018. In this case 2 queries will be issued:
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
Currently I iterate over the resultsets and build a List with all the results. I am aware that this could (and will) run into OOM problems some day. Is there a better approach, how to handle / merge query results into one set?
Thanks
You can use IN to specify a list of years, but this is not very optimal solution - because the year field is partition key, then most probably the data will be on different machines, so one of the node will work as "coordinator", and will need to ask another machine for results, and aggregate data. From performance point of view, 2 async requests issued in parallel could be faster, and then do the merge on client side.
P.S. your data model have quite serious problems - you partition by year, this means:
Data isn't very good distributed across the cluster - only N=RF machines will hold the data;
These partitions will be very huge, even if you get only hundred of devices, reporting one measurement per minute;
Only one partition will be "hot" - it will receive all data during the year, and other partitions won't be used very often.
You can use months, or even days as partition key to decrease the size of partition, but it still won't solve the problem of the "hot" partitions.
If I remember correctly, Data Modelling course at DataStax Academy has an example of data model for sensor network.
Changed the table structure to:
CREATE TABLE device_data (
week_first_day timestamp,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY ((week_first_day, device_id), nano_since_epoch, sensor_id)
) WITH CLUSTERING ORDER BY (nano_since_epoch desc, sensor_id desc);
according to #AlexOtt proposal. Some changes to the application logic are required - for example findAllByYear needs to iterate over single weeks now.
Coming back to the original question: would you rather send 52 queries (getDataByYear, one query per week) oder would you use the IN operator here?

Optimize large offset in select

I have table
Users (user_id integer, user_name string, scores integer)
That table will contain 1-6 millions records. That has indexes on user_name and scores
The user will input his name and I should show him one page from that table, ordered by scores, that will contain him around other users.
I do it in 2 queries:
First:
select user_id from (
select row_number() over (order by scores desc),
user_id
from users
where user_name="name" limit 1
)
Second:
select * from users limit 20 offset The_User_Id/20+1
than i get page, that contain my User around others.
But when user is in middle of table with millions record, I have offset 500000, that works slow, about 1-2 seconds, how to improve it?
Offset itself makes your query slow.
If you don't need a pure sql and can use a programming language to form the query, why not consider Paging Through Results? ordering the second query by user_id and limit 20 for pagination needs instead of using offset.

How to optimize a table for queries sorted by insertion order in Postgres

I have a table of time series data where for almost all queries, I wish to select data ordered by collection time. I do have a timestamp column, but I do not want to use actual Timestamps for this, because if two entries have the same timestamp it is crucial that I be able to sort them in the order they were collected, which is information I have at Insert time.
My current schema just has a timestamp column. How would I alter my schema to make sure I can sort based on collection/insertion time, and make sure querying in collection/insertion order is efficient?
Add column based on sequence (i.e. serial), and create index on (timestamp_column, serial_column). Then you can have insertion order (more or less) by doing:
ORDER BY timestamp_column, serial_column;
You could use a SERIAL column called insert_order. This way there will be no two rows with the same value. However, I am not sure that you requirement of being in absolute time order is possible to achieve.
For example suppose there are two transactions, T1 and T2 and they do happen at the same time, and you are running on a machine with multiple processor, so in fact both T1 and T2 did the insert at exactly the same instant. Is this a case that you are concerned about? There was not enough info your question to know exactly.
Also with a serial column you have the issue of gaps, for example T1 cloud grab serial value 14 and T2 can grab value 15, then T1 rolls back and T2 does not, so you have to expect that the insert_order column might have gaps in it.