Postgres case sensitivity : Implications of Lower() - postgresql

I have run into the issue of case sensitive searching in postgres, and have started to deal with this by using LOWER on each side of every WHERE test.
So far so good. However, I understand that in order to make use of indexes, they should be created use LOWER too, which makes sense.
However, what of the PK? presumably these are not going to be effective because it does not seem possible to create a PK using a function on the chosen PK field. Isnt this a concern for any filtering or joining which is done on PKs?
Is there a way of working around this ?

Here are some thoughts on this subjects.
First, you could add a constraint for any column requiring that the data stored be lower case. That would solve the problem inside the database.
Second, you could use a trigger to convert any value to lower case.
Third, you can use ilike. This can make use of indexes for case-insensitive searches.
And fourth, if all your primary keys are synthetic numeric primary keys, then you don't need to worry about case sensitivity.

You can still create a functional index on PK (even consisting of many columns):
CREATE TABLE test(a text, b text, c text, d text, primary key (a,b,c));
CREATE INDEX ON test (lower(a), lower(b), lower(c));
Though, it sounds like there is need for some data improvement operations to be done if you are experiencing this kind of behaviour almost everywhere in your database (like store everything in lower case).

Related

How to implement a high performing non incremental ID in postgresql? [duplicate]

I would like to replace some of the sequences I use for id's in my postgresql db with my own custom made id generator. The generator would produce a random number with a checkdigit at the end. So this:
SELECT nextval('customers')
would be replaced by something like this:
SELECT get_new_rand_id('customer')
The function would then return a numerical value such as: [1-9][0-9]{9} where the last digit is a checksum.
The concerns I have is:
How do I make the thing atomic
How do I avoid returning the same id twice (this would be caught by trying to insert it into a column with unique constraint but then its to late to I think)
Is this a good idea at all?
Note1: I do not want to use uuid since it is to be communicated with customers and 10 digits is far simpler to communicate than the 36 character uuid.
Note2: The function would rarely be called with SELECT get_new_rand_id() but would be assigned as default value on the id-column instead of nextval().
EDIT: Ok, good discussusion below! Here are some explanation for why:
So why would I over-comlicate things this way? The purpouse is to hide the primary key from the customers.
I give each new customer a unique
customerId (generated serial number in
the db). Since I communicate that
number with the customer it is a
fairly simple task for my competitors
to monitor my business (there are
other numbers such as invoice nr and
order nr that have the same
properties). It is this monitoring I
would like to make a little bit
harder (note: not impossible but
harder).
Why the check digit?
Before there was any talk of hiding the serial nr I added a checkdigit to ordernr since there were klumbsy fingers at some points in the production, and my thought was that this would be a good practice to keep in the future.
After reading the discussion I can certainly see that my approach is not the best way to solve my problem, but I have no other good idea of how to solve it, so please help me out here.
Should I add an extra column where I put the id I expose to the customer and keep the serial as primary key?
How can I generate the id to expose in a sane and efficient way?
Is the checkdigit necessary?
For generating unique and random-looking identifiers from a serial, using ciphers might be a good idea. Since their output is bijective (there is a one-to-one mapping between input and output values) -- you will not have any collisions, unlike hashes. Which means your identifiers don't have to be as long as hashes.
Most cryptographic ciphers work on 64-bit or larger blocks, but the PostgreSQL wiki has an example PL/pgSQL procedure for a "non-cryptographic" cipher function that works on (32-bit) int type. Disclaimer: I have not tried using this function myself.
To use it for your primary keys, run the CREATE FUNCTION call from the wiki page, and then on your empty tables do:
ALTER TABLE foo ALTER COLUMN foo_id SET DEFAULT pseudo_encrypt(nextval('foo_foo_id_seq')::int);
And voila!
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> select * from foo;
foo_id
------------
1241588087
1500453386
1755259484
(4 rows)
I added my comment to your question and then realized that I should have explained myself better... My apologies.
You could have a second key - not the primary key - that is visible to the user. That key could use the primary as the seed for the hash function you describe and be the one that you use to do lookups. That key would be generated by a trigger after insert (which is much simpler than trying to ensure atomicity of the operation) and
That is the key that you share with your clients, never the PK. I know there is debate (albeit, I can't understand why) if PKs are to be invisible to the user applications or not. The modern database design practices, and my personal experience, all seem to suggest that PKs should NOT be visible to users. They tend to attach meaning to them and, over time, that is a very bad thing - regardless if they have a check digit in the key or not.
Your joins will still be done using the PK. This other generated key is just supposed to be used for client lookups. They are the face, the PK is the guts.
Hope that helps.
Edit: FWIW, there is little to be said about "right" or "wrong" in database design. Sometimes it boils down to a choice. I think the choice you face will be better served by leaving the PK alone and creating a secondary key - just that.
I think you are way over-complicating this. Why not let the database do what it does best and let it take care of atomicity and ensuring that the same id is not used twice? Why not use a postgresql SERIAL type and get an autogenerated surrogate primary key, just like an integer IDENTITY column in SQL Server or DB2? Use that on the column instead. Plus it will be faster than your user-defined function.
I concur regarding hiding this surrogate primary key and using an exposed secondary key (with a unique constraint on it) to lookup clients in your interface.
Are you using a sequence because you need a unique identifier across several tables? This is usually an indication that you need to rethink your table design, and those several tables should perhaps be combined into one, with an autogenerated surrogate primary key.
Also see here
How you generate the random and unique ids is a useful question - but you seem to be making a counter productive assumption about when to generate them!
My point is that you do not need to generate these id's at the time of creating your rows, because they are essentially independent of the data being inserted.
What I do is pre-generate random id's for future use, that way I can take my own sweet time and absolutely guarantee they are unique, and there's no processing to be done at the time of the insert.
For example I have an orders table with order_id in it. This id is generated on the fly when the user enters the order, incrementally 1,2,3 etc forever. The user does not need to see this internal id.
Then I have another table - random_ids with (order_id, random_id). I have a routine that runs every night which pre-loads this table with enough rows to more than cover the orders that might be inserted in the next 24 hours. (If I ever get 10000 orders in one day I'll have a problem - but that would be a good problem to have!)
This approach guarantees uniqueness and takes any processing load away from the insert transaction and into the batch routine, where it does not affect the user.
Your best bet would probably be some form of hash function, and then a checksum added to the end.
If you're not using this too often (you do not have a new customer every second, do you?) then it is feasible to just get a random number and then try to insert the record. Just be prepared to retry inserting with another number when it fails with unique constraint violation.
I'd use numbers 1000000 to 999999 (900000 possible numbers of the same length) and check digit using UPC or ISBN 10 algorithm. 2 check digits would be better though as they'll eliminate 99% of human errors instead of 9%.

Use case for hstore against multiple columns

I'm having some troubles deciding on which approach to use.
I have several entity "types", let's call them A,B and C, who share a certain number of attributes (about 10-15). I created a table called ENTITIES, and a column for each of the common attributes.
A,B,C also have some (mostly)unique attributes (all boolean, can be 10 to 30 approx).
I'm unsure what is the best approach to follow in modelling the tables:
Create a column in the ENTITIES table for each attribute, meaning that entity types that don't share that attribute will just have a null value.
Use separate tables for the unique attributes of each entity type, which is a bit harder to manage.
Use an hstore column, each entity will store its unique flags in this column.
???
I'm inclined to use 3, but I'd like to know if there's a better solution.
(4) Inheritance
The cleanest style from a database-design point-of-view would probably be inheritance, like #yieldsfalsehood suggested in his comment. Here is an example with more information, code and links:
Select (retrieve) all records from multiple schemas using Postgres
The current implementation of inheritance in Postgres has a number of limitations, though. Among others, you cannot define a common foreign key constraints for all inheriting tables. Read the last chapter about caveats carefully.
(3) hstore, json (pg 9.2+) / jsonb (pg 9.4+)
A good alternative for lots of different or a changing set of attributes, especially since you can even have functional indices on attributes inside the column:
unique index or constraint on hstore key
Index for finding an element in a JSON array
jsonb indexing in Postgres 9.4
EAV type of storage has its own set of advantages and disadvantages. This question on dba.SE provides a very good overview.
(1) One table with lots of columns
It's the simple, kind of brute-force alternative. Judging from your description, you would end up with around 100 columns, most of them boolean and most of them NULL most of the time. Add a column entity_id to mark the type. Enforcing constraints per type is a bit awkward with lots of columns. I wouldn't bother with too many constraints that might not be needed.
The maximum number of columns allowed is 1600. With most of the columns being NULL, this upper limit applies. As long as you keep it down to 100 - 200 columns, I wouldn't worry. NULL storage is very cheap in Postgres (basically 1 bit per column, but it's more complex than that.). That's only like 10 - 20 bytes extra per row. Contrary to what one might assume (!), most probably much smaller on disk than the hstore solution.
While such a table looks monstrous to the human eye, it is no problem for Postgres to handle. RDBMSes specialize in brute force. You might define a set of views (for each type of entity) on top of the base table with just the columns of interest and work with those where applicable. That's like the reverse approach of inheritance. But this way you can have common indexes and foreign keys etc. Not that bad. I might do that.
All that said, the decision is still yours. It all depends on the details of your requirements.
In my line of work, we have rapidly-changing requirements, and we rarely get downtime for proper schema upgrades. Having done both the big-record with lots on nulls and highly normalized (name,value), I've been thinking that it might be nice it have all the common attributes in proper columns, and the different/less common ones in a "hstore" or jsonb bucket for the rest.

Postgresql: auto lowercase text while (or before) inserting to a column

I want to achieve case insensitive uniqueness in a varchar column. But, there is no case insensitive text data type in Postgres. Since original case of text is not important, it will be a good idea to convert all to lowercase/uppercase before inserting in a column with UNIQUE constraint. Also, it will require one INDEX for quick search.
Is there any way in Postgres to manipulate data before insertion?
I looked at this other question: How to automatically convert a MySQL column to lowercase.
It suggests using triggers on insert/update to lowercase text or to use views with lowercased text. But, none of the suggested methods ensure uniqueness.
Also, since this data will be read/written by various applications, lowercasing data in every individual application is not a good idea.
ALTER TABLE your_table
ADD CONSTRAINT your_table_the_column_lowercase_ck
CHECK (the_column = lower(the_column));
From the manual:
The use of indexes to enforce unique constraints could be considered
an implementation detail that should not be accessed directly.
You don't need a case-insensitive data type (although there is one)
CREATE UNIQUE INDEX idx_lower_unique
ON your_table (lower(the_column));
That way you don't even have to mess around with the original data.

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.

How to alter Postgres table data based on its contents?

This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.