How to implement a high performing non incremental ID in postgresql? [duplicate]

How to implement a high performing non incremental ID in postgresql? [duplicate] - postgresql

I would like to replace some of the sequences I use for id's in my postgresql db with my own custom made id generator. The generator would produce a random number with a checkdigit at the end. So this:
SELECT nextval('customers')
would be replaced by something like this:
SELECT get_new_rand_id('customer')
The function would then return a numerical value such as: [1-9][0-9]{9} where the last digit is a checksum.
The concerns I have is:
How do I make the thing atomic
How do I avoid returning the same id twice (this would be caught by trying to insert it into a column with unique constraint but then its to late to I think)
Is this a good idea at all?
Note1: I do not want to use uuid since it is to be communicated with customers and 10 digits is far simpler to communicate than the 36 character uuid.
Note2: The function would rarely be called with SELECT get_new_rand_id() but would be assigned as default value on the id-column instead of nextval().
EDIT: Ok, good discussusion below! Here are some explanation for why:
So why would I over-comlicate things this way? The purpouse is to hide the primary key from the customers.
I give each new customer a unique
customerId (generated serial number in
the db). Since I communicate that
number with the customer it is a
fairly simple task for my competitors
to monitor my business (there are
other numbers such as invoice nr and
order nr that have the same
properties). It is this monitoring I
would like to make a little bit
harder (note: not impossible but
harder).
Why the check digit?
Before there was any talk of hiding the serial nr I added a checkdigit to ordernr since there were klumbsy fingers at some points in the production, and my thought was that this would be a good practice to keep in the future.
After reading the discussion I can certainly see that my approach is not the best way to solve my problem, but I have no other good idea of how to solve it, so please help me out here.
Should I add an extra column where I put the id I expose to the customer and keep the serial as primary key?
How can I generate the id to expose in a sane and efficient way?
Is the checkdigit necessary?

For generating unique and random-looking identifiers from a serial, using ciphers might be a good idea. Since their output is bijective (there is a one-to-one mapping between input and output values) -- you will not have any collisions, unlike hashes. Which means your identifiers don't have to be as long as hashes.
Most cryptographic ciphers work on 64-bit or larger blocks, but the PostgreSQL wiki has an example PL/pgSQL procedure for a "non-cryptographic" cipher function that works on (32-bit) int type. Disclaimer: I have not tried using this function myself.
To use it for your primary keys, run the CREATE FUNCTION call from the wiki page, and then on your empty tables do:
ALTER TABLE foo ALTER COLUMN foo_id SET DEFAULT pseudo_encrypt(nextval('foo_foo_id_seq')::int);
And voila!
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> select * from foo;
foo_id
------------
1241588087
1500453386
1755259484
(4 rows)

I added my comment to your question and then realized that I should have explained myself better... My apologies.
You could have a second key - not the primary key - that is visible to the user. That key could use the primary as the seed for the hash function you describe and be the one that you use to do lookups. That key would be generated by a trigger after insert (which is much simpler than trying to ensure atomicity of the operation) and
That is the key that you share with your clients, never the PK. I know there is debate (albeit, I can't understand why) if PKs are to be invisible to the user applications or not. The modern database design practices, and my personal experience, all seem to suggest that PKs should NOT be visible to users. They tend to attach meaning to them and, over time, that is a very bad thing - regardless if they have a check digit in the key or not.
Your joins will still be done using the PK. This other generated key is just supposed to be used for client lookups. They are the face, the PK is the guts.
Hope that helps.
Edit: FWIW, there is little to be said about "right" or "wrong" in database design. Sometimes it boils down to a choice. I think the choice you face will be better served by leaving the PK alone and creating a secondary key - just that.

I think you are way over-complicating this. Why not let the database do what it does best and let it take care of atomicity and ensuring that the same id is not used twice? Why not use a postgresql SERIAL type and get an autogenerated surrogate primary key, just like an integer IDENTITY column in SQL Server or DB2? Use that on the column instead. Plus it will be faster than your user-defined function.
I concur regarding hiding this surrogate primary key and using an exposed secondary key (with a unique constraint on it) to lookup clients in your interface.
Are you using a sequence because you need a unique identifier across several tables? This is usually an indication that you need to rethink your table design, and those several tables should perhaps be combined into one, with an autogenerated surrogate primary key.
Also see here

How you generate the random and unique ids is a useful question - but you seem to be making a counter productive assumption about when to generate them!
My point is that you do not need to generate these id's at the time of creating your rows, because they are essentially independent of the data being inserted.
What I do is pre-generate random id's for future use, that way I can take my own sweet time and absolutely guarantee they are unique, and there's no processing to be done at the time of the insert.
For example I have an orders table with order_id in it. This id is generated on the fly when the user enters the order, incrementally 1,2,3 etc forever. The user does not need to see this internal id.
Then I have another table - random_ids with (order_id, random_id). I have a routine that runs every night which pre-loads this table with enough rows to more than cover the orders that might be inserted in the next 24 hours. (If I ever get 10000 orders in one day I'll have a problem - but that would be a good problem to have!)
This approach guarantees uniqueness and takes any processing load away from the insert transaction and into the batch routine, where it does not affect the user.

Your best bet would probably be some form of hash function, and then a checksum added to the end.

If you're not using this too often (you do not have a new customer every second, do you?) then it is feasible to just get a random number and then try to insert the record. Just be prepared to retry inserting with another number when it fails with unique constraint violation.
I'd use numbers 1000000 to 999999 (900000 possible numbers of the same length) and check digit using UPC or ISBN 10 algorithm. 2 check digits would be better though as they'll eliminate 99% of human errors instead of 9%.

Related

Postgres case sensitivity : Implications of Lower()

I have run into the issue of case sensitive searching in postgres, and have started to deal with this by using LOWER on each side of every WHERE test.
So far so good. However, I understand that in order to make use of indexes, they should be created use LOWER too, which makes sense.
However, what of the PK? presumably these are not going to be effective because it does not seem possible to create a PK using a function on the chosen PK field. Isnt this a concern for any filtering or joining which is done on PKs?
Is there a way of working around this ?

Here are some thoughts on this subjects.
First, you could add a constraint for any column requiring that the data stored be lower case. That would solve the problem inside the database.
Second, you could use a trigger to convert any value to lower case.
Third, you can use ilike. This can make use of indexes for case-insensitive searches.
And fourth, if all your primary keys are synthetic numeric primary keys, then you don't need to worry about case sensitivity.

You can still create a functional index on PK (even consisting of many columns):
CREATE TABLE test(a text, b text, c text, d text, primary key (a,b,c));
CREATE INDEX ON test (lower(a), lower(b), lower(c));
Though, it sounds like there is need for some data improvement operations to be done if you are experiencing this kind of behaviour almost everywhere in your database (like store everything in lower case).

Use case for hstore against multiple columns

I'm having some troubles deciding on which approach to use.
I have several entity "types", let's call them A,B and C, who share a certain number of attributes (about 10-15). I created a table called ENTITIES, and a column for each of the common attributes.
A,B,C also have some (mostly)unique attributes (all boolean, can be 10 to 30 approx).
I'm unsure what is the best approach to follow in modelling the tables:
Create a column in the ENTITIES table for each attribute, meaning that entity types that don't share that attribute will just have a null value.
Use separate tables for the unique attributes of each entity type, which is a bit harder to manage.
Use an hstore column, each entity will store its unique flags in this column.
???
I'm inclined to use 3, but I'd like to know if there's a better solution.

(4) Inheritance
The cleanest style from a database-design point-of-view would probably be inheritance, like #yieldsfalsehood suggested in his comment. Here is an example with more information, code and links:
Select (retrieve) all records from multiple schemas using Postgres
The current implementation of inheritance in Postgres has a number of limitations, though. Among others, you cannot define a common foreign key constraints for all inheriting tables. Read the last chapter about caveats carefully.
(3) hstore, json (pg 9.2+) / jsonb (pg 9.4+)
A good alternative for lots of different or a changing set of attributes, especially since you can even have functional indices on attributes inside the column:
unique index or constraint on hstore key
Index for finding an element in a JSON array
jsonb indexing in Postgres 9.4
EAV type of storage has its own set of advantages and disadvantages. This question on dba.SE provides a very good overview.
(1) One table with lots of columns
It's the simple, kind of brute-force alternative. Judging from your description, you would end up with around 100 columns, most of them boolean and most of them NULL most of the time. Add a column entity_id to mark the type. Enforcing constraints per type is a bit awkward with lots of columns. I wouldn't bother with too many constraints that might not be needed.
The maximum number of columns allowed is 1600. With most of the columns being NULL, this upper limit applies. As long as you keep it down to 100 - 200 columns, I wouldn't worry. NULL storage is very cheap in Postgres (basically 1 bit per column, but it's more complex than that.). That's only like 10 - 20 bytes extra per row. Contrary to what one might assume (!), most probably much smaller on disk than the hstore solution.
While such a table looks monstrous to the human eye, it is no problem for Postgres to handle. RDBMSes specialize in brute force. You might define a set of views (for each type of entity) on top of the base table with just the columns of interest and work with those where applicable. That's like the reverse approach of inheritance. But this way you can have common indexes and foreign keys etc. Not that bad. I might do that.
All that said, the decision is still yours. It all depends on the details of your requirements.

In my line of work, we have rapidly-changing requirements, and we rarely get downtime for proper schema upgrades. Having done both the big-record with lots on nulls and highly normalized (name,value), I've been thinking that it might be nice it have all the common attributes in proper columns, and the different/less common ones in a "hstore" or jsonb bucket for the rest.

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!

You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.

You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).

When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance

Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.

How to encrypt a string with standard PostgreSQL?

I'm working with PostgreSQL.
I need to transform "http://www.xyz.com/some_uri/index1.html" in something like "scdfdsffd"(some unique key, based on the URL that is a unique key in the table).
By other words... the URL is a unique key in the table, but I need to generate a small unique key based on the URL.
What can I do with standard PostgreSQL 8.4?
Best Regards,

Several methods:
a) Why not use an auto-incrementing column or sequence generator to generate unique integers per insert? If you have less than 100 million URLs, your identifiers are short and easy to remember. However, if that's not an option (e.g. because you don't want people guessing IDs and attacking the database that way):
b) The built-in MD5() function may help:
INSERT INTO table (pkey, url) VALUES (MD5('http://...'), 'http://...');
MD5() is a hash function and will most likely give you a unique identifier per URL. I say "most likely" because you get a 128-bit hash from MD5, and the likelihood of a hash collision is on the order of 2^-128 (about 10^-55).
If you need smaller identifiers you can chop the result from MD5 down to a smaller number of characters, but you could potentially significantly increase the chance of a hash collision depending on which characters you take.
[Note: timestamp answer redacted since it in no way solves the original problem. -BobG]

Sequence field, tracking individual sequences for primary key/sequence pair

I have a table of user-uploaded objects. Each user can have an arbitrary number of objects. I want each object to have a sequential identifier, like so:
USERNAME OBJECTNAME OBJID
Kerin cat 1
Kerin dog 2
Narcolepsy pie_tins 1
Kerin mouse 3
I'd like for OBJID to be a sequence, but tracking the sequence number individually per USERNAME field. I can sort of accomplish this by first querying the DB and SELECTing the highest OBJID and then incrementing that value by one and using it in my INSERT, and that's probably fine because it'd be difficult for a user to run two uploads at once, but the query overhead and the feeling that I'm doing it wrong makes me want to find a better way.

If you don't need them to be sequential then you could probably get away with adding a PK of type serial (or bigserial) to the table. The numbers would still be unique per-username but it would be dead simple to implement and you wouldn't have the ugliness of UUIDs.
You could create one sequence per username through manual CREATE SEQUENCE calls. Then, you could add a BEFORE INSERT trigger to set the objid by figuring out which sequence to use and then calling nextval on it. If your usernames are limited to the usual /[a-z][a-z0-9]*/ pattern, then you could build the sequence names as something like "seq_objid_username" and the trigger would be able to figure out which sequence to use quite easily; the per-username sequences could be created by an INSERT trigger on your user table. This approach will work and it will be safe because it relies on PostgreSQL's existing transaction-safe sequence system.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse