PostgreSQL - Clustering never completes - long key? - postgresql

I am having problems with clustering a table where the key consists of one char(23) field and two TimeStamp fields. The char(23) field contains Alpha-Numeric values. The clustering operation never finishes. I have let it run for 24 hours and it still did not finish.
Has anyone run into this kind of problem before? Is my theory that the reason is the long key field makes sense? We have dealt with much larger tables that do not have long keys and we have always been able to perform DB operations on them without any problem. That makes me think that it might have to do with the size of the key in this case.

Cluster rewrites the table so it must wait on locks. It is possible that it is never getting the lock it needs. Why are you setting varchar(64000)? Why not just unrestricted varchar? And how big is this index?
If size is a problem it has to be based on the index size not the key size. I don't know what the effect of toasted key attributes is on cluster because these are moved into extended storage. TOAST might complicate CLUSTER and I have never heard of anyone clustering on a TOASTed attribute. It wouldn't make much sense to do so. TOASTing is necessary for any attribute more than 4k in size.
A better option is to create an index for the values without the possibly toasted value, and then cluster on that. That should give you something very close to what you'd get otherwise.

Related

Updating null value will split row in file Postgresql

I have a colleague that tells me that the reason why we add default values instead of null values to our table is, that Postgresql allocates a number of bytes to a file when a new row is stored. And if this column gets updated later on, it might end up splitting this row into two rows in the file, and multiple IO operations will have to occure when reading and writing.
I'm not a Postgresql expert at all, and I have hard time finding any documentation suggesting this.
Can someone clearify this for me?
Is this a good reason for not having null values in a column, and using some default instead? Will there be any hughe performance issues in such cases?
I'm not sure I'd say the documentation is hard to find:
https://www.postgresql.org/docs/10/storage-file-layout.html
https://www.postgresql.org/docs/current/storage-page-layout.html
It's fair to say there is a lot to absorb though.
So, the reason you SHOULD have defaults rather than NULLs is because you don't want to have an "unknown" in your column. Start with the requirements before worrying about efficiency tweaks.
Whether a particular value is null is stored in a bitmap. This bitmap is optional - so if there are no nulls in a row then the bitmap is not created. So - that suggests nulls would make a row bigger. But wait, if a bit is set to show null then you don't need the overhead of of the value structure, and (IIRC - you'll need to check the docs) that can end up saving you space. There is a good chance that general per-row overheads and type alignment issues are far more important to you though.
However - all of this is ignoring the elephant* in the room which is that if you update a row then PostgreSQL marks the current version of the row as expired and creates a whole new row. So the whole description of how updates work is just confused in that first paragraph you wrote.
So - don't worry about the efficiency of nulls in 99.9% of cases. Worry about using them properly and about the general structure of your database, its indexes and queries.
* no I'm not apologising for that pun.

What about expected performance in Pentaho?

I am using Pentaho to create ETL's and I am very focused on performance. I develop an ETL process that copy 163.000.000 rows from Sql server 2088 to PostgreSQL and it takes 17h.
I do not know how good or bad is this performance. Do you know how to measure if the time that takes some process is good? At least as a reference to know if I need to keep working heavily on performance or not.
Furthermore, I would like to know if it is normal that in the first 2 minutes of ETL process it load 2M rows. I calculate how long will take to load all the rows. The expected result is 6 hours, but then the performance decrease and it takes 17h.
I have been investigating in goole and I do not find any time references neither any explanations about performance.
Divide and conquer, and proceed by elimination.
First, add a LIMIT to your query so it takes 10 minutes instead of 17 hours, this will make it a lot easier to try different things.
Are the processes running on different machines? If so, measure network bandwidth utilization to make sure it isn't a bottleneck. Transfer a huge file, make sure the bandwidth is really there.
Are the processes running on the same machine? Maybe one is starving the other for IO. Are source and destination the same hard drive? Different hard drives? SSDs? You need to explain...
Examine IO and CPU usage of both processes. Does one process max out one cpu core?
Does a process max out one of the disks? Check iowait, iops, IO bandwidth, etc.
How many columns? Two INTs, 500 FLOATs, or a huge BLOB with a 12 megabyte PDF in each row? Performance would vary between these cases...
Now, I will assume the problem is on the POSTGRES side.
Create a dummy table, identical to your target table, which has:
Exact same columns (CREATE TABLE dummy LIKE table)
No indexes, No constraints (I think it is the default, double check the created table)
BEFORE INSERT trigger on it which returns NULL and drop the row.
The rows will be processed, just not inserted.
Is it fast now? OK, so the problem was insertion.
Do it again, but this time using an UNLOGGED TABLE (or a TEMPORARY TABLE). These do not have any crash-resistance because they don't use the journal, but for importing data it's OK.... if it crashes during the insert you're gonna wipe it out and restart anyway.
Still No indexes, No constraints. Is it fast?
If slow => IO write bandwidth issue, possibly caused by something else hitting the disks
If fast => IO is OK, problem not found yet!
With the table loaded with data, add indexes and constraints one by one, find out if you got, say, a CHECK that uses a slow SQL function, or a FK into a table which has no index, that kind of stuff. Just check how long it takes to create the constraint.
Note: on an import like this you would normally add indices and constraints after the import.
My gut feeling is that PG is checkpointing like crazy due to the large volume of data, due to too-low checkpointing settings in the config. Or some issue like that, probably random IO writes related. You put the WAL on a fast SSD, right?
17H is too much. Far too much. For 200 Million rows, 6 hours is even a lot.
Hints for optimization:
Check the memory size: edit the spoon.bat, find the line containing -Xmx and change it to half your machine memory size. Details varies with java version. Example for PDI V7.1.
Check if the query from the source database is not too long (because too complex, or server memory size, or ?).
Check the target commit size (try 25000 for PostgresSQL), the Use batch update for inserts in on, and also that the index and constraints are disabled.
Play with the Enable lazy conversion in the Table input. Warning, you may produce difficult to identify and debug errors due to data casting.
In the transformation property you can tune the Nr of rows in rowset (click anywhere, select Property, then the tab Miscelaneous). On the same tab check the transformation is NOT transactional.

ef-core - bool index vs historical table

I´m using aspnet-core, ef-core with sql server. I have an 'order' entity. As I'm expecting the orders table to be large and a the most frequent query would get the active orders only for certain customer (active orders are just a tiny fraction of the whole table) I like to optimize the speed of the query but I can decide from this two approaches:
1) I don't know if this is possible as I haven't done this before, but I was thinking about creating a Boolean column named 'IsActive' and make it an index thus when querying only Active orders would be faster.
2) When an order becomes not active, move the order to another table, i.e HistoricalOrders, thus keeping the orders table small.
Which of the two would have better results?, or none of this is a good solution and a third approach could be suggested?
If you want to partition away cold data then a leading boolean index column is a valid way to do that. That column must be added to all indexes that you want to hot/cold partition. This includes the clustered index. This is quite awkward. The query optimizer requires that you add a dummy predicate where IsActive IN (0, 1) to make it able to still seek on such indexes. Of course, this will now also touch the cold data. So you probably need to know the concrete value of IsActive or try the 1 value first and be sure that it matches 99% of the time.
Depending on the schema this can be impractical. I have never seen a good case for this but I'm sure it exists.
A different way to do that is to use partitioning. Here, the query optimizer is used to probing multiple partitions anyway but again you don't want it to probe cold data. Even if it does not find anything this will pull pages into memory making the partitioning moot.
The historical table idea (e.g. HistoricalOrders) is the same thing in different clothes.
So in order to make this work you need:
Modify all indexes that you are about (likely all), or partition, or create a history table.
Have a way to almost never need to probe the cold partition.
I think (2) kills it for most cases.
Among the 3 solutions I'd probably pick the indexing solution because it is simplest. If you are worried about people making mistakes by writing bad queries all the time, I'd pick a separate table. That makes mistakes hard but makes the code quite awkward.
Note, that many indexes are already naturally partitioned. Indexes on the identity column or on an increasing datetime column are hot at the end and cold elsewhere. An index on (OrderStatus INT, CreateDateTime datetime2) would have one hot spot per order status and be cold otherwise. So those are already solved.
Some further discussion.
Before think about the new table HistoricalOrders,Just create a column name IsActive and test it with your data.You don't need to make it as index column.B'cos Indexes eat up storage and it slows down writes and updates.So we must very careful when we create an index.When you query the data do it as shown below.On the below query where data selection (or filter) is done on the SQL srever side (IQueryable).So it is very fast.
Note : Use AsNoTracking() too.It will boost the perfromnace too.
var activeOrders =_context.Set<Orders>().Where(o => o.IsActive == true).AsNoTracking()
.ToList();
Reference : AsNoTracking()

designing table for high insert rate

I would like some sugestion on how to design a table that gets like 10 to 50 million inserts a day and needs to respond quickly to selects... should i use indexes? or the overhead cost would be too great?
edit:Im not worried about the transaction volume... this is actually a assigment... and i need to figure out a design to a table that "must respond very well to selects not based on the primary key, knowing that this table will receive a enourmous amount of inserts day-in-day-out"
definitely. At least the primary key, foreign keys, and then whatever you need for reporting, just don't overdo it. 10k-50k inserts a day is not a problem. If it was like, I don't know, a million inserts then you could start thinking of having separate tables, data dictionaries and what not, but for your needs I wouldn't worry.
Even if you did 50,000 per day and your day was an 8 hour work day, that would still be less than two inserts per second on average. I suppose you might get peaks that are much higher than that, but in general, SQL Server can handle much higher transaction rates than what you seem to have.
If your table is fairly wide (lots of columns or a few really long ones) then you might want to consider clustering by a surrogate (IDENTITY) column. Your volumes aren't enough to make for a bad hot-spot at the end of the table. In combination with this, use indexes for any keys needed for data consistency (i.e. FK's) and retrieval (PK, natural key, etc). Be careful about setting the fill factor on your indexes and consider rebuilding them during a periodic down-time window.
If your table is fairly narrow, then you could possibly consider clustering on the natural key, but you'll have to make sure that your response time expectations can be met.
Best rate is PK sort the same as the insert order and no other indexes. 10-50 thousand a day is not that much. If only inserts then I don't see any down side to dirty reads.
If you are optimizing for select then use row level locking for inserts.
Measure index fragmentation. Defragment the indexes on a regular basis with a proper fill factor. Fill factor determined the how fast the indexes fragment and how often you defragment.

How do I reset the primary key count/max in Core Data?

I've managed to delete all entities stored using Core Data (following this answer).
The problem is, I've noticed the primary key is still counting upwards. Is there a way (without manually writing a SQL query) to reset the Z_MAX value for the entity? Screenshot below to clarify what I mean.
The value itself isn't an issue, but I'm just concerned that at some point in the future the maximum integer may be reached and I don't want this to happen. My application syncs data with a web service and caches it using core data, so potentially the primary key may increase by hundreds/thousands at a time. Deleting the entire Sqlite DB isn't an option as I need to retain some of the information for other entities.
I've seen the 'reset' method, but surely that will reset the entire Sqlite DB? How can I reset the primary key for just this one set of entities? There are no relationships to other entities with the primary key I want to reset.
I'm just concerned that at some point
in the future the maximum integer may
be reached and I don't want this to
happen.
Really? What type is your primary key? Because if it's anything other than an Int16 you really don't need to care about that. A signed 32-bit integer gives you 2,147,483,647 values. A 64-bit signed integer gives you 9,223,372,036,854,775,807 values.
If you think you're going to use all those up, you probably have more important things to worry about than having an integer overflow.
More importantly, if you're using Core Data you shouldn't need to care about or really use primary keys. Core Data isn't a database - when using Core Data you are meant to use relationships and not really care about primary keys. Core Data has no real need for them.
Core Data uses 64 bit integer primary keys. Unless I/O systems get many orders of magnitude faster, which unlike CPUs, they have not in recent years, you could save as fast as possible for millions of years.
Please file a bug with bugreport.apple.com when you run out.
Ben
From the sqlite faq:
If the largest possible integer key,
9223372036854775807, then an unused
key value is chosen at random.
9,223,372,036,854,775,807 / (1024^4) = 8,388,608 tera-rows. I suspect you will run into other limits first. :) http://www.sqlite.org/limits.html reviews the more practical limits you'll run into.
asking sqlite3 about a handy core data store yields:
sqlite> .schema zbookmark
CREATE TABLE ZBOOKMARK ( Z_PK INTEGER PRIMARY KEY, ...
note lack of "autoincrement", which in sqlite means to never reuse a key. So core data does allow old keys to be reused, so you're pretty safe even if you manage to add (and remove most of) that many rows over time.
If you really do want to reset it, poking around in apple's z_ tables is really the only way. [This is not to say that this is a thing you should in fact do. It is not (at least in any code you want to ship), even if it seems to work.]
Besides the fact that directly/manually editing a Core Data store is a horrendously stupid idea, the correct answer is:
Delete the database and re-create it.
Of course, you're going to lose all your data doing that, but if you're that concerned about this little number, then that's ok, right?
Oh, and Core Data will make sure you don't have primary key collisions.