Database design to send notifications to all users - postgresql

I'm searching for a solutions to create a notification and notify to all users, record their reaches and views. We have around tens of thousands users.
If each time of creating a new notification, I need to write records for all the users, the database may be overloaded with a surge in writing processes.
Do you have a better design for this use case? Thank you in advance.
I use PostgreSQL, with those two tables somehow like below.
CREATE TABLE notification (
id BIGSERIAL PRIMARY KEY,
notification_message VARCHAR(255),
)
CREATE TABLE notification_user (
user_id BIGINT,
notification_id BIGINT,
status VARCHAR
)

Without a lot more details there is not much anyone can advise you on. But do not dwell over a measly 90K rows. First off I have no idea of your design, but assuming you have normalized you should have 3 tables here: users, notifications, and user_notifications. But put together something and TEST it, that is the only to determine if you actually have an issue of just the presumption of an issue.
I have put together a small demo. I like round number so I used 100K users and a simple query to insert a notification as a user_notification for each user. I then ran a that insert 1,2,3,4,5,10, and 25 notification. That results in 100K rows to 2.5M and captured the time. All on my "play machine". This is not a formal performance test just more of a back-of-the-envelope test.
Environment
Acer laptop with
Intel I5 1.6GHz 4Core 8GB Ram
Windows 10 Home 64bit
Postgres 12.0
IDE: DBeaver 7.0.0
Overall, a very much underrated server.
Results:
users 100000
notice # rows time (in sec)
-------- --------- ---------------
1 100,001 1.750
2 200,002 3.781
3 300,003 5.500
4 400,004 7.663
5 500,005 9.367
10 1,000,010 21.186
25 2,500,025 60.6
# rows includes notifications + user_notifications inserts
See fiddle for full sample, but it has only 100 users not 100K. I don't know what performance your server can provide, should be more than my toy.

Related

What is the best way to move millions of data from one postgres database to another?

So we have a task at the moment where I need to move millions of records from one database to another.
To complicate things slightly I need to change an id on each record before inserting the data.
How it works is we have 100 stations in database a.
Each station contains 30+ sensors.
Each sensor contains readings for about the last 10 years.
These readings are anywhere from 15minute interval to daily interval.
So each station can have at least 5m records.
database b has the same structure as database a.
The reading table contains the following fields
id: primary key
sensor_id: int4
value: numeric(12)
time: timestamp
What I have done so far for one station is.
Connect to database a and select all readings for station 1
Find all corresponding sensors in database b
Change the sensor_id from database a to it's new sensor_id from database b
Chunk the updated sensor_id data to groups of about 5000 parameters
Loop over the chunks and do a mass insert
In theory, this should work.
However, I am getting errors saying duplicate key violates unique constraint.
If I query the database on those records that are failing, the data doesn't exist.
The weird thing about this is that if I run the script 4 or 5 times in a row all the data eventually gets in there. So I am at a loss as to why I would be receiving this error because it doesn't seem accurate.
Is there a way I can get around this error from happening?
Is there a more efficient way of doing this?

Concurrent data insert client (golang) results in first 50 rows missing in database (postgres), but the rest of the 390 are okay

I am pulling down stock market data and inserting it into a postgresql database. I have 500 stocks for 60 days of historical data. Each day has 390 trading minutes, and each minute is a row in the database table. The summary of the issue is that the first 20-50 minutes of each day are missing for the each stock. Sometimes its less than 50, but it is never more than 50. Every minute after that for each day is fine (EDIT: on further inspection there are missing minutes all over the place). The maximum matches the max number of concurrent goroutines (https://github.com/korovkin/limiter).
The hardware is set up in my home. I have a laptop that pulls the data, and a 8 year old gaming computer that has been repurposed as a postgres database running in ubuntu. They are connected through a netgear nighthawk x6 router and communicate over the LAN.
The laptop is running a go program that pulls data down and performs concurrent inserts. I loop through the 60 days, for each day I loop through each stock, and for each stock I loop through each minute and insert it into the database via a INSERT statement. Inside the minute loop I used a library that limits the max number of goroutines.
I am fixing it by grabbing the data again, and inserting until the first time the postgres server responds that the entry is a duplicate and violates the unique constraints on the table and breaking out of the loop for each stock.
However, I'd like to know what happened, as I want to better understand how these problems can arise under load. Any ideas?
limit := NewConcurrencyLimiter(50)
for _, m := range ms {
limit.Execute(func() {
m.Insert()
})
}
limit.Wait()
The issue is that using a receiver means that everything is passed by reference. I needed to copy the values I wanted inserted within the for loop, and change the method away from a receiver to one with input parameters
for i, _ := range ms {
value := ms[i]
limit.Execute(func() {
Insert(value)
})
}
limit.Wait()

PostgreSQL table design for frequent "save" action in web app

Our web based app with 100,000 concurrent users has a use case where we auto-save the user's activity every 5 seconds. Consider a table like this:
create table essays
(
id uuid not null constraint essays_pkey primary key,
userId text not null,
essayparts jsonb default '{ }' :: jsonb,
create_date timestamp with time zone default now() not null,
modify_date timestamp with time zone default now() not null
);
create index essays_create_idx on essays ("create_date");
create index essays_modify_idx on essays ("modify_date");
This works well for us as all the stuff related to a user's essay such as title, brief byline. requestor, full essay body, etc. are all stored in the essayparts column as a JSON. For auto-saving the essay, we don't insert new rows all the time though. We update each ID (each essay) with all its components.
So there are plenty of updates per essay, as this is a time consuming and thoughtful activity. Given the auto save every 5 seconds, if a user was to be writing for half an hour, we'd have updated her essay around 360 times.
This would be fine with the "HOT" (heap only tuples) functionality of PostgreSQL. We're using v10 so we are fine. However, the challenge is that we also update the modify_date column every time the essay is saved and this has an index too. Which means by the principle of HOT this is not benefiting from the HOT update and a lot of fragmentation occurs.
I suppose in the web or mobile world this is not an unusual pattern. Many services seem to auto-save content. Are they insert only? If so, if the user logs out and comes back in, how do they show the records, by looking at the max(modify_date)? Or is there any other mechanism to leverage HOT updates while also updating an indexed column in the table?
Appreciate any pointers, thank you!
Performing an update every 5 second with 100000 concurrent users will produce 20000 updates per second. This is quite challenging as such, and you would need a good system to pull it off, but autovacuum will never be able to keep up if those updates are not HOT.
You have several options:
Choose a relational database management system other than PostgreSQL that updates rows in place.
Do not index modify_date and hope that HOT will do the trick.
Perform these updates way less often than once every 5 seconds (who needs auto-save every 5 seconds anyway?).
Auto-save the data somewhere else than in the database.

how can I set up my sql server to email me when databases are approaching max size?

I was working on an old project the other day over vpn for a client and found an issue where I was purging data off of the wrong PK and as a result their database was huge and was slow to return info which was causing our software to freak out.
I got to thinking that I would just like to know when I am approaching the max size. I know how to set up the sql server for email notification but I've only sent test messages. I looked at my databases properties hoping I would see some options related to email but I saw nothing.
I've seen where you can send out the email after a job so i'm hoping you can do this too. Anyone know how I can achieve this?
sys.database_files has a size column which stores the number of pages. A page is 8kb, so you need to multiply this by 8 * 1.024, which is 8.192. This will show us the size of the file on disk. Just replace [database name] with the actual name of your database, and adjust the size check if you want something other than 2 GB as the warning threshold.
DECLARE #size DECIMAL(20,2);
SELECT #size = SUM(size * 8.192)/1000 FROM [database name].sys.database_files;
IF #size >= 2000 -- this is in MB
BEGIN
-- send e-mail
END
If you want to do it for all databases, you can do this without going into each individual database's sys.database_files view, by using master.sys.sysaltfiles - I have observed that the size column here is not always in sync with the size column in sys.database_files - I would trust the latter first.

Database design challenge

I'm creating an virtual stamp card program for the iphone and have run into an issue with implementing my database. The program essentially has a main points system that can be utitlized through all merchants (sort've like air miles), but i also want to keep track of how many times you've been to EACH merchant
So far, i have created 3 main tables for users, merchants, and transactions.
1) Users table contains basic info like user_id and total points collected.
2) Merchants table contains info like merchant_id, location, total points given.
3) Transactions table simply creates a new row for every time someone checks into each merchant, and records date-stamp, user name, merchant name, and points awarded.
So the most basic way to deal with finding out how many times you've been to each merchant is to query the entire transaction table for both user and merchant, and this will give me a transaction history of how many times you've been to that specific merchant(which is perfect), but in the long run, i feel this will be horrible for performance.
The other straightforward, yet "dumb" method for implementing this, would be to create a column in the users table for EACH merchant, and keep the running totals there. This seems inappropriate, as I will be adding new merchants on a regular basis, and there would need to be new columns added to every user for every time this happens.
I've looked into one-to-many and many-to-many relationships for mySQL databases, but can't seem to come up with something very concrete, as i'm extremely new to web/PHP/mySQL development but i'm guessing this is what i'm looking for...
I've also thought of creating a special transaction table for each user, which will have a column for merchant and another for the # of times visited. Again, not sure if this is the most efficient implementation.
Can someone point me in the right direction?
You're doing the right thing in the sense of thinking up the different options, and weighing up the good and bad for each.
Personally, I'd go with a MerchantCounter table which joins on your Merchant table by id_merchant (for example) and which you keep up-to-date explicitly.
Over time it does not get slower (unlike an activity-search), and does not take up lots of space.
Edit: based on your comment, Janan, no I would use a single MerchantCounter table. So you've got your Merchant table:
id_merchant nm_merchant
12 Jim
15 Tom
17 Wilbur
You would add a single additional table, MerchantCounter (edited to show how to tally totals for individual users):
id_merchant id_user num_visits
12 101 3
12 102 8
15 101 6007
17 102 88
17 104 19
17 105 1
You can see how id_merchant links the table to the Merchant table, and id_user links to a further User table.