How does a database handle auto-number field if it's limit is reached? - rdbms

In my tables, i have selected id column to be of type int (4 bytes). The answer i want to know is how will any database handle it once it's limits are reached? Will the database refuse to insert any more records in table? or what exactly will happen? Also how should i tackle this type of problem (if the database doesn't handle it by itself)?

I would love to know what you refer to as "the" database. But usually it is an error. The database is full then. You should provide some means for "compacting" the primary key then, or more easy:
Use long integers as keys (8 bytes). Even if you insert 1000 items per second from now on, this will last for nearly 300 million years. The 4 byte integer (signed) will only last for 24 days in this scenario.

Related

What kind of index should I use in postgresql for a column with 3 values

I have a table with 100Mil+ records and 237 fields.
One of the fields is a varchar 1 field with three possible values (Y,N,I)
I need to find all of the records with N.
Right now I have a b-tree index built and the query below takes about 20 min to run.
Is there another index I can use to get better performance?
SELECT * FROM tableone WHERE export_value='N';
Assuming your values are roughly equally distributed (say at least 15% of each value) and roughly equally distributed throughout the table (some physically at the beginning, some in the middle, some at the end) then no.
If you think about it you'll see why. You'll have to look up tens of millions of disk blocks in the index and then fetch them from the disk one by one. By the time you have done that, it would have been quicker to just scan the whole table and pick out the values as they match. The planner knows this and would probably not use the index at all.
However - if you only have 17 rows with "N" or they are all very recently added to the table and so physically happen to be close to each other then yes, and index can help.
If you only had a few rows with "N" you would have mentioned it, so we can ignore that one.
If however you mostly insert to this table you might find a BRIN index helpful. That can let the planner see that e.g. the first 80% of your table doesn't have any "N" blocks and so it just needs to look at the last bit.

How to quickly insert ~300GB/one billion record with relation into PostgreSQL database?

I have been working on this for months but still no solution, I hope I can get help from you ...
The task is, I need to insert/import real time data records from an online data provider into our database.
Real time data is provided in form of files. Each file contains up to 200,000 json records, one record one line. Every day several/tens of files are provided online, begins from year 2013.
I calculated the whole file store and got a total size of around 300GB. I estimated the whole number of records (I can get the file sizes via rest api but not line numbers of each file), it should be around one billion records or a little bit more.
Before I can import/insert one record into our database, I need to find out two parameters (station, parameter) from the record and make relationship.
So the workflow is something like:
Find parameter in our database, if the parameter exists in our db, just return parameter.id; otherwise the parameter should be inserted into parameter table as a new entry and a new parameter.id will be created and returned;
Find station in our database, similarly, if that station exists already, just take its id, otherwise a new station will be created and a new station.id will be created;
Then I can insert this json record into our main data table with its identified parameter.id and station.id to make relationship.
Basically it is a simple database structure with three main tables (data, parameter, station). They are conneted by parameter.id and station.id with primary key/foreign key relationship.
But the querying is very time consuming and I cannot find a way to properly insert this amound of data into the database in a foreseeable time.
I did two trials:
Just use normal SQL queries with bulk insert.
The workflow is described above.
For each record a) get parameter.id b) get station.id c) insert this record
Even with bulk insert, I could only insert one million records in a week. The records are not that short and contain about 20 fields.
After that, I tried following:
I don't check parameter and station in advance, but just use COPY command to copy the records into a intermediate table with no relation. For this I calculated, all the one billion records can be imported in around ten days.
But after the COPY, I need to manully find out all dictinct stations (there are only a few parameters so I can ignore the parameter part) with SQL query select distinct or group by, and create these stations in the station table, and then UPDATE the whole records with their corresponding station.id, but this UPDATE operation takes very very long.
Here is an example:
I spent one and a half days to import 33,000,000 records into the intermediate table.
I queried with select longitude, latitude from records group by longitude, latitude and got 4,500 stations
I inserted these 4,500 stations into the station table
And for each station I do
update record set stationid = station.id where longitude=station.longitude and latitude=station.latitude
The job is still running but I estimate it will take two days
And this is only for 30,000,000 records, I have one billion.
So my question is, how can I insert this amount of data into the database quickly?
Thank you very much in advance!
2020-08-20 10:01
Thank you all very much for the comments!
#Brits:
Yes and no, all the "COPY"s took over 24 hours. One "COPY" command for one json file. For each file I need doing following:
Download the file
Flat the Json file (no relation check of station, but with parameter check, it is very easy and quick, parameter is like constant in the project) to a CSV like text file for "COPY" command
Execute the "COPY" command via Python
1, 2 and 3 all together will take if I am in the company network around one minute for a Json file containing ~ 200,000 records, and around 20 minutes if I work from home.
So just ignore the "one and a half day" statement. My estimation is, if I work in the company network, I will be able to import all 300GB/One billion records into my intermediate table without relation check in ten days.
I think my problem now is not to move the json content into a flat database table, but to build the relationship between data - station.
After I have my json content in the flat table, I need to find out all stations and update all records with:
select longitude, latitude, count(*) from records group by longitude, latitude
insert these longitude latitude combination into station table
for each entry in station table, do
update records set stationid = station.id where longitude=station.longitude and latitude=station.latitude
This 3. above takes very long (also 1. takes several minutes only for 34million records, I have no idea how long 1. will take for one billion records)
#Mike Organek #Laurenz Albe Thanks a lot. Even your comments are now for me difficult to understand. I will study them and give a feedback.
The total file count is 100,000+
I am thinking about to parse all the files and get individual stations firstly, and then with station.id and parameter.id in advance do the "COPY"s. I will give a feedback.
2020-09-04 09:14
I finally got what I want, even I am not sure whether it is correct.
What I have done since my question:
Parse all the json files and extract unique stations by coordinate and save them into the station table
Parse all the json files again and save all the fields into the record table, with parameter id from constants and station id from 1), with COPY command
I did 1) and 2) in a pipeline, they were done parallelly. And 1) took longer than 2), so I always needed to let 2) wait.
And after 10 days, I have my data in the postgres, with ca. 15000 stations, and 0.65 billion records in total, each records has the corresponding station id.
#steven-matison Thank you very much and could you please explain a little bit more?

Does varchar's length have any effect on performance

We have discussions with our development staff over the use of VARCHAR columns as they define every varchar fileds as varchar(255),varchar(500),... and much bigger than the maximum length of the field,
does varchar's length have any effect on performance in db2? We have find that it is recommended to use char instead of varchar for column of 30 bytes or less and our concern is about varchar fileds that are greater than 30 bytes.
Allowing excessive column length is not a good idea. If you allow, let’s say, a FirstName column to have maximum length 500, you may find quite a long irrelevant story there eventually, because why not if it’s allowed :)
As for performance implications.
The only problem may be here, if Extended row size is turned on for the database (you simply can’t create too “wide” table otherwise), and the total length of the row exceeds the tablespace page size. Some varchar column value gets out from the data page, and more IO will be needed to access such a row in future. You should keep in mind such a behavior. And the probability of such events is higher in case of uncontrolled varchar columns length.
This can have an performance hit with ORGANIZE BY COLUMN tables. There is a limit in the total declared width that can be processed within the Columnar Data Engine, if this limit is breached in a query plan, the remainder of the query will be processed in the Row Data Engine.

Postgresql Hstore and Toast Bloat

I was using hstore, Postgresql 9.3.4, to store a count for each time an event happened in a given day, with an update like the following.
days_count = days_count || hstore('x', (coalesce((days_count -> 'x')::integer, 0) + 1)::text)
Where x is the day of the year. After running a simulation of expected behavior for production I ended up with a table that was 150MB + 2GB Toast + 25-30MB for the index, after Analyze and Vacuum.
I am now instead breaking up the above column into one for each month like the following
y_month_days_count = y_month_days_count || hstore('x', (coalesce((y_month_days_count -> 'x')::integer, 0) + 1)::text)
Where x is the day of the month, and y is the month of the year.
I am still running the simulation right now, but so far at third of the way done I am at 60MB + A pretty steady 20-30MB of Toast + 25-30MB for the index. Which means in the end I should end up with about 180MB + 30-40MB for Toast + 25MB-30MB for the index after Analyze and Vacuum.
So first is there any known issues with Hstore and Toast bloat that would explain my issue with my first set up?
Second will my current solution of breaking up the columns cause any type of issues with hstore and performance in the future because of the number of hstore columns on one table? It seems to be steady now with row numbers in the hundred of thousands, and while I know more columns can make things slower, I am unsure if this is worse with hstore columns.
Finally I did find something out. I have one hstore column that ends up representing each hour a day, so it has 24 different keys. When I run the simulation for just this column I end up with almost no toast, in the KB, but when I run the whole simulation, with the days broken up into months columns, my largest hstore has 52 keys.
So for a simple store of either a counter or a word or two, the max number of keys before I see any amount of toast for hstore is between 24 and 52 keys.
So first is there any known issues with Hstore and Toast bloat that would explain my issue with my first set up?
Yes.
When you update any part of an out-of-line stored TOASTed field like text, hstore or json the whole field must be re-written as a new row version. This is a consequence of MVCC - it's necessary to retain a copy of every version of the row that might still be visible to another transaction.
The old one can be vacuumed away when it's no longer required by any running transaction, so in practice this has minimal impact so long as autovacuum is running aggressively enough.
So if you're updating lots of rows with big text, hstore or json fields, or updating them frequently, tune autovacuum up so it runs more often and does work faster. Make sure you don't have long running <IDLE> in transaction connections.
You say the table sizes you quoted were "after analyze and vacuum" but I'm guessing you only ran a regular vacuum, so the table bloat would've been freed for re-use by PostgreSQL but not released back to the OS. See if VACUUM FULL compacts it.
Will my current solution of breaking up the columns cause any type of issues with hstore and performance in the future because of the number of hstore columns on one table?
Depends on your query patterns and workload, but probably not.

How to predict PostgreSQL index size

Is it possible to predict the amount of disk space/memory that will be used by a basic index in PostgreSQL 9.0?
E.g. If I have a default B-tree index on an integer column in a table of 1 million rows, how much space would be taken up by the index? Is the entire index held in memory at all times?
Not really a definitive answer, but I looked at a table in a 9.0 test system I have with a couple of int indexes on a table of 280k rows. The indexs all report a size of 6232kb. So roughly 22 bytes per row.
There is no way to say that. It depends on the type of operations you will make, as PostgreSQL stores many different versions of the same row, including row versions stored in index files.
Just make the table you are interested in and check it.