Efficient way to store multiple data in Postgres - postgresql

I need to track user tasks. So i came up with few ideas, please suggest which is efficient and good. So the user has around 6 tasks to finish up when they login for first time, lets say task1,task2.....task6. ( tasks are subjective to change in near future )
Approach 1:
Create a table with 10 columns ( extra 4 columns for safety if we need to extend like more than 6 tasks)
Table UserTasks:
task1 task2 task3 task4 ....... task9 task10
Approach 2
Create a table with three columns and have new rows for each tasks
Approach 3
Create a table with array to store all task results (as am using postgres)
Table UserTasks:
id task date
1 [task1:True, task2:false]
Approach 4
Just create a table with VarChar and add custom data to it. For example, planning to add task1 if task1 completed and concatenate task2 if task2 completed
Table UserTasks:
id task date
1 task1,task2,task3
These are the ideas which developed, please help to pick a good approach. The thing is, these data will be READ too often. There wont be any DELETE to this date. A single user would be probably read this data at least 10 times when they logged in so performance is my obvious concern

Related

Not updateable record set

I have a set of records linked (1 to many) to another table. The data tables are called tblJobs and tblTasks. I want to show 1) whose task is next 2) all the people who are involved in the tasks and 3) how many are done out of the total, and have this display for each Job.
As soon as I create a query that is aggregated (maybe there is a better way) I can no longer amend the main job data. If I try and create a subform with this info, I can't add subforms to continuous records in a form.
I'm doing all this in Access VBA. Here's an example of data seen on form:
tblJob
Company Name: Smiths
Date Start: 01/01/2023
Data finished: Null
Job Completed: Null (Boolean)
tblTask
Task No: 1
Task: Write email
Who: Bob
Completed: Null (Boolean)
Task No: 2
Task: Create Contacts
Who: Fred
Completed: Null
So here I would want to show with the job data that Bob is next, the job involves Bob and Fred and 0 out of 2 task are completed.
Hope that makes sense and many thanks in advance.
I've tried all the data in one record but it locks the data.
Also tried subforms but the screen then only can show one job at a time (no longer Continuous). Currently I am able to show taks completed by placing a subform in a header of a separate form that only appears using OnCurrent. This only shows the data one record at a time. If there is a way to bring this to the main form, that would be great so all records can be seen at once. Let's start a conversation!

Keep table synced with another but with accumulated / grouped data

If I have large amounts of data in a table defined like
CREATE TABLE sensor_values ( ts TIMESTAMPTZ(35, 6) NOT NULL,
value FLOAT8(17, 17) DEFAULT 'NaN' :: REAL NOT NULL,
sensor_id INT4(10) NOT NULL, );
Data comes in every minute for thousands of points. Quite often though I need to extract and work with daily values over years (On a web frontend). To aid this I would like a sensor_values_days table that only has the daily sums for each point and then I can use this for faster queries over longer timespans.
I don't want a trigger for every write to the db as I am afraid that would slow down the already bottle neck of writes to the db.
Is there a way to trigger only after so many rows have been inserted ?
Or perhaps an index and maintains a index of a sum of entries over days ? I don't think that is possible.
What would be the best way to do this. It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Thanks
What would be the best way to do this.
Install clickhouse and use AggregatingMergeTree table type.
With postgres:
Create per-period aggregate table. You can have several with different granularity, like hours, days, and months.
Have a cron or scheduled task run at the end of each period plus a few minutes. First, select the latest timestamp in the per-period table, so you know at which period to start. Then, aggregate all rows in the main table for periods that came after the last available one. This process will also work if the per-period table is empty, or if it missed the last update then it will catch up.
In order to do only inserts and no updates, you have to run it at the end of each period, to make sure it got all the data. You can also store the first and last timestamp of the rows that were aggregated, so later if you check the table you see it did use all the data from the period.
After aggregation, the "hour" table should be 60x smaller than the "minute" table, that should help!
Then, repeat the same process for the "day" and "month" table.
If you want up-to-date stats, you can UNION ALL the results of the "per day" table (for example) to the results of the live table, but only pull the current day out of the live table, since all the previous days's worth of data have been summarized into the "per day" table. Hopefully, the current day's data will be cached in RAM.
It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Also if you want to partition your huge table, make sure you do it before its size becomes unmanageable...
Materialized Views and a Cron every 5 minutes can help you:
https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
In PG14, we will have INCREMENTAL MATERIALIZED VIEW, but for the moment is in devel.

Simple update query taking too long - Postgres

I have a table with 28 million rows that I want to update. It has around 60 columns and a ID column (primary key) with an index created on it. I created four new columns and I want to populate them with the data from four columns from other table which also has an ID column with an index created on it. Both tables have the same amount of rows and just the primary key and the index on the IDENTI column. The query has been running for 15 hours and since it is high priority work, we are starting to get nervous about it and we don't have so much time to experiment with queries. We have never update a table so big (7 GB), so we are not sure if this amount of time is normal.
This is the query:
UPDATE consolidated
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni, consolidated con
WHERE con.IDENTI=uni.IDENTI
How can I make it faster? Is it possible? If not, is there a way to check how much longer it is going to take (without killing the process)?
Just as additional information, we have ran before much more complex queries for 3 million row tables (postgis) and It has taken it about 15 hours as well.
You should not repeat the target table in the FROM clause. Your statement creates a cartesian join of the consolidated table with itself, which is not what you want.
You should use the following:
UPDATE consolidated con
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni
WHERE con.IDENTI = uni.IDENTI

Executing query in chunks on Greenplum

I am trying to creating a way to convert bulk date queries into incremental query. For example, if a query has where condition specified as
WHERE date > now()::date - interval '365 days' and date < now()::date
this will fetch a years data if executed today. Now if the same query is executed tomorrow, 365 days data will again be fetched. However, I already have last 364 days data from previous run. I just want a single day's data to be fetched and a single day's data to be deleted from the system, so that I end up with 365 days data with better performance. This data is to be stored in a separate temp table.
To achieve this, I create an incremental query, which will be executed in next run. However, deleting the single date data is proving tricky when that "date" column does not feature in the SELECT clause but feature in the WHERE condition as the temp table schema will not have the "date" column.
So I thought of executing the bulk query in chunks and assign an ID to that chunk. This way, I can delete a chunk and add a chunk and other data remains unaffected.
Is there a way to achieve the same in postgres or greenplum? Like some inbuilt functionality. I went through the whole documentation but could not find any.
Also, if not, is there any better solution to this problem.
I think this is best handled with something like an aggregates table (I assume the issue is you have heavy aggregates to handle over a lot of data). This doesn't necessarily cause normalization problems (and data warehouses often denormalize anyway). In this regard the aggregates you need can be stored per day so you are able to cut down to one record per day of the closed data, plus non-closed data. Keeping the aggregates to data which cannot change is what is required to avoid the normal insert/update anomilies that normalization prevents.

Database design challenge

I'm creating an virtual stamp card program for the iphone and have run into an issue with implementing my database. The program essentially has a main points system that can be utitlized through all merchants (sort've like air miles), but i also want to keep track of how many times you've been to EACH merchant
So far, i have created 3 main tables for users, merchants, and transactions.
1) Users table contains basic info like user_id and total points collected.
2) Merchants table contains info like merchant_id, location, total points given.
3) Transactions table simply creates a new row for every time someone checks into each merchant, and records date-stamp, user name, merchant name, and points awarded.
So the most basic way to deal with finding out how many times you've been to each merchant is to query the entire transaction table for both user and merchant, and this will give me a transaction history of how many times you've been to that specific merchant(which is perfect), but in the long run, i feel this will be horrible for performance.
The other straightforward, yet "dumb" method for implementing this, would be to create a column in the users table for EACH merchant, and keep the running totals there. This seems inappropriate, as I will be adding new merchants on a regular basis, and there would need to be new columns added to every user for every time this happens.
I've looked into one-to-many and many-to-many relationships for mySQL databases, but can't seem to come up with something very concrete, as i'm extremely new to web/PHP/mySQL development but i'm guessing this is what i'm looking for...
I've also thought of creating a special transaction table for each user, which will have a column for merchant and another for the # of times visited. Again, not sure if this is the most efficient implementation.
Can someone point me in the right direction?
You're doing the right thing in the sense of thinking up the different options, and weighing up the good and bad for each.
Personally, I'd go with a MerchantCounter table which joins on your Merchant table by id_merchant (for example) and which you keep up-to-date explicitly.
Over time it does not get slower (unlike an activity-search), and does not take up lots of space.
Edit: based on your comment, Janan, no I would use a single MerchantCounter table. So you've got your Merchant table:
id_merchant nm_merchant
12 Jim
15 Tom
17 Wilbur
You would add a single additional table, MerchantCounter (edited to show how to tally totals for individual users):
id_merchant id_user num_visits
12 101 3
12 102 8
15 101 6007
17 102 88
17 104 19
17 105 1
You can see how id_merchant links the table to the Merchant table, and id_user links to a further User table.