Newsequentialid() as an Index Column? - tsql

I have a newsequentialid() as a GUID Column and I wanted to know if I should index the column as Clustered or Non-Clustered Index as I will be using the GUID to query on.
I will also be inserting data into the table every week.

In general, yes put a clustered index on your table. But which columns to use as your clustered index? Please see this answer. Should point you in the right direction. Especially the link to GUIDs as PRIMARY KEYs and/or the clustering key. Much depends on the nature of your GUID. To paraphrase Kimberley Tripp's article: a GUID that is not sequential can be a bad choice. Examples:
Generated in the client (using .NET)
Generated by the newid() function
As there is quite a bit more to this decision than just that, I recommend you read the full answer provided by #marc_s and that entire article by Kimberley Tripp.
As far as the weekly data insert via SSIS consider these steps:
Drop your target table's indexes prior to the insert
Load your target table with a Data Flow
Configure your OLE DB Destination with "Data access mode:" = "Table or view - fast load"
Rebuild your target table's indexes after the insert
Depending on your specific situation what this can achieve is a dramatically faster insert and leave you with a clean index whose statistics are up to date and fragmentation is zero or near zero.
I realize your question states you are using newsequentialid(). While I was reading this article on TechNet the actual sequential nature of the values produced by this function comes with a caveat: After restarting Windows, the GUID can start again from a lower range, but is still globally unique. Would this impact your downstream usage of the GUID?
I guess I am wondering why you find it necessary to use a GUID and why that is a better choice in your situation as opposed to an integer based SEQUENCE which would make your primary key clustered index significantly smaller.

Related

Partitioning instead of thousand tables with identical structure

I have legacy, but pretty big(~25Gb) database questionably designed. The "pattern" which is commonly used across whole database boils down to the following:
separate logical parts into different tables (journal_1, journal_2, journal_n)
all tables have a unique bigserial/autoincrement field (journal_id_seq_1, journal_id_seq_2, journal_id_seq_n)
all tables have one or several foreign keys to one or several reference tables(journal tables have 2 foreign keys, group of tables with another structure(log_1, log_2, log_n) referenced just to one)
I'm extremely curious (actually near to panic :) in what happens if there are about 50 thousand such tables(Now is "just" about 15k).
My idea is get everything together(tables with identical structure) in one huge table with a common name(let's say journal), add a column with a journal_id(extracted from a suffix journal_{1|2|3}) partitioned by this column and obviously create partition tables for each table with same naming convention. Moreover, bigserial fields need to be converted into regular bigints, but I still need keep sequences for each partition and manually call nextval on every insert. Primary key also need to be extended with journal_id field additionally to seq_id. Finally, I see a bonus in a sharding which can be applied to partitions when database becomes enormous.
Please, share you thoughts about this strategy and especially about foreign keys. For now we need at least max_locks_per_transaction limited to 512, otherwise pg_dump is failed with
ERROR: out of shared memory HINT: You might need to increase max_locks_per_transaction. pg_dump: error: query was: LOCK TABLE. Besides locking nightmare, as far as I known, Postgres has limits for relations per database(the total number is huge but not unlimited). Do I need to make foreign keys for each partition table or just a part(certain rows) of a partitionED(general) table will be locked on insert, delete or update because all partitions are just "storages" but not a real relational entites?
Thank you in advance.
15K tables == Gag!
Partitioning is not likely to be any better than multiple tables.
Neither provides any performance benefits except in rare cases.
Let's investigate the future need for sharding. That, alone, may justify the existence of journal_nnn. In this, some journals would be on one machine, some on another machine, etc. Are all journals in active use? Or are most of them "old" and not really used?
The PRIMARY KEY can be a composite of two (or more) columns.
AUTO_INCREMENT has some advantages over manually creating "serial numbers". (However, the Question does not have enough details for me to elaborate.)
FOREIGN KEYs are two things: an implied INDEX (good for performance) and a constraint (good for integrity). In a well-debugged app, the integrity checks are unnecessary overhead. They must be abandoned in partitioning and probably in sharding.
Why do you use partitioning for such a small database? Your average table is less then 2MB in size, that's really really small.
Get rid of the partitioning and your problems are gone.
Having 50000 tables starts to get painful, and it makes no sense with a small database like this. The same holds for partitioning – after all, partitions are tables with a side job.
I would define only one table per object type.
About the auto-generated primary key numbers: make a combined primary key that consists of the old primary key and the table number (journal_id). For new entries, use a sequence that is initialized higher than the existing maximum of all tables.

ef-core - bool index vs historical table

I´m using aspnet-core, ef-core with sql server. I have an 'order' entity. As I'm expecting the orders table to be large and a the most frequent query would get the active orders only for certain customer (active orders are just a tiny fraction of the whole table) I like to optimize the speed of the query but I can decide from this two approaches:
1) I don't know if this is possible as I haven't done this before, but I was thinking about creating a Boolean column named 'IsActive' and make it an index thus when querying only Active orders would be faster.
2) When an order becomes not active, move the order to another table, i.e HistoricalOrders, thus keeping the orders table small.
Which of the two would have better results?, or none of this is a good solution and a third approach could be suggested?
If you want to partition away cold data then a leading boolean index column is a valid way to do that. That column must be added to all indexes that you want to hot/cold partition. This includes the clustered index. This is quite awkward. The query optimizer requires that you add a dummy predicate where IsActive IN (0, 1) to make it able to still seek on such indexes. Of course, this will now also touch the cold data. So you probably need to know the concrete value of IsActive or try the 1 value first and be sure that it matches 99% of the time.
Depending on the schema this can be impractical. I have never seen a good case for this but I'm sure it exists.
A different way to do that is to use partitioning. Here, the query optimizer is used to probing multiple partitions anyway but again you don't want it to probe cold data. Even if it does not find anything this will pull pages into memory making the partitioning moot.
The historical table idea (e.g. HistoricalOrders) is the same thing in different clothes.
So in order to make this work you need:
Modify all indexes that you are about (likely all), or partition, or create a history table.
Have a way to almost never need to probe the cold partition.
I think (2) kills it for most cases.
Among the 3 solutions I'd probably pick the indexing solution because it is simplest. If you are worried about people making mistakes by writing bad queries all the time, I'd pick a separate table. That makes mistakes hard but makes the code quite awkward.
Note, that many indexes are already naturally partitioned. Indexes on the identity column or on an increasing datetime column are hot at the end and cold elsewhere. An index on (OrderStatus INT, CreateDateTime datetime2) would have one hot spot per order status and be cold otherwise. So those are already solved.
Some further discussion.
Before think about the new table HistoricalOrders,Just create a column name IsActive and test it with your data.You don't need to make it as index column.B'cos Indexes eat up storage and it slows down writes and updates.So we must very careful when we create an index.When you query the data do it as shown below.On the below query where data selection (or filter) is done on the SQL srever side (IQueryable).So it is very fast.
Note : Use AsNoTracking() too.It will boost the perfromnace too.
var activeOrders =_context.Set<Orders>().Where(o => o.IsActive == true).AsNoTracking()
.ToList();
Reference : AsNoTracking()

anything wrong about having MANY sequences in postgres?

I am developing an application using a virtual private database pattern in postgres.
So every user gets his id and all rows of this user will hold this id to be separated from others. this id should also be part of the primary key. In addition every row has to have a id which is unique in the scope of the user. This id will be the other part of the primary key.
If we have to scale this across multiple servers we can also append a third column to the pk identifying the shard this id was generated at.
My question now is how to create per user unique ids. I came along with some options which i am not sure about all the implications. The 2 solutions that seem most promising to me are:
creating one sequence per user:
this can be done automatically, using a trigger, every time a user is created. This is for sure transaction safe and I think it should be quite ok in terms of performance.
What I am worried about is that this has to work for a lot of users (100k+) and I don't know how postgres will deal with 100k+ sequences. I tried to find out how sequences are implemented but without luck.
counter in user table:
keep all users in a table with a field holding the latest id given for this user.
when a user starts a transaction I can lock the row in the user table and create a temp sequence with the latest id from the user table as a starting value. this sequence can then be used to supply ids for new entries.
before exiting the transaction the current value has to be written back to the user table and the lock has to be released.
If another transaction from the same user tries to concurrently insert rows it will stall until the first transaction releases its lock on the user table.
This way I do not need thousands of sequences and i don't think that ther will be concurrent accesses from one user frequently (the application has oltp character - so there will not be long lasting transactions) and even if this happens it will just stall for about a second and not hurt anything.
The second part of my question is if I should just use 2 columns (or maybe three if the shard_id joins the game) and make them a composite pk or if I should put them together in one column. I think handling will be way easier having them in separate columns but what does performance look like? Lets assume both values are 32bit integers - is it better tho have 2 int columns in an index or 1 bigint column?
thx for all answers,
alex
I do not think sequences would be scalable to the level you want (100k sequences). A sequence is implemented as a relation with just one row in it.
Each sequence will appear in the system catalog (pg_class) which also contains all of the tables, views, etc. Having 100k rows there is sure to slow the system down dramatically. The amount of memory required to hold all of the data structures associated with these sequence relations would be also be large.
Your second idea might be more practical, if combined with temporary sequences, might be more scalable.
For your second question, I don't think a composite key would be any worse than a single column key, so I would go with whatever matches your functional needs.

postgresql immutable read workload tuning

I have a table where the non-primary key columns are deterministic given the primary key.
I think this could be pretty common, for example a table representing memoization/caching of an expensive function, or where the primary key is a hash of the other columns.
Further assume that the workload is mostly reads of 1-100 individual rows, and that writes can be batched or "async" based on what gives the best performance.
What are interesting tuning options on the table/database in this case?
This would be an ideal candidate for index-only-scans in versions 9.2 or up, by creating an index on all the primary key columns plus the frequently queried other columns. Aggressively vacuum the table (i.e. manually after every batch update) because the default autovacuum settings are not aggressive enough to get maximal benefit from IOS.

Future proof Primary Key design in postgresql

I've always used either auto_generated or Sequences in the past for my primary keys. With the current system I'm working on there is the possibility of having to eventually partition the data which has never been a requirement in the past. Knowing that I may need to partition the data in the future, is there any advantage of using UUIDs for PKs instead of the database's built-in sequences? If so, is there a design pattern that can safely generate relatively short keys (say 6 characters instead of the usual long one e6709870-5cbc-11df-a08a-0800200c9a66)? 36^6 keys per-table is more than sufficient for any table I could imagine.
I will be using the keys in URLs so conciseness is important.
There is no pattern to reduce a 128-Bit UUID to 6 chars, since information gets lost.
Almost all databases implement a surrogate key strategy called incremental keys.
Postgres and Informix have serials, MySql auto_increment, and Oracle offers sequence generators. In your case I think it would be safe to use integer IDs.
See this article: Choosing a Primary Key: Natural or Surrogate? for a discussion of availabe techniques
I'm not sure what type of partition are you planning (this?), but I don't see why to change the primary key design? Even if the old partitioned tables are "alive" (i.e., you might insert rows in any partitioned table), there is no problem in sharing the sequence among several tables.