Postgresql - case insensitive build to allow all wheres, joins, group bys etc to be case insensitive - postgresql

I've had this thought brewing for some time but I can't find anyone online who's discussed this as a possibility.
Currently the recommendations available for making case insensitive searches seem to be either to use "ilike" or "citext".
We're moving away from Microsoft Sql Server to Postgresql and all our code assumes case insensitive comparisons - but our TSQL code base is huge so changing it all to use UCASE() or ilike or citext etc etc isn't really feasible as a commercial development project.
However it must be possible to grab the source of postgresql and change some of the C code so that all string comparisons as case-insensitive, and then make our own compilation of the whole product. I think it would possibly require only a few lines of code to be changed and so upgradeability might not be a huge issue.
So I'm wondering whether anyone on here knows the Postgresql code base well enough to kick around ideas about whether this is feasible and whereabout the code is that does the comparisons just to help us get started. I'm continuing to research this in the meantime, and getting started with just being able to build postgresql on windows, but the hope is to bring others onboard with the idea such that a community project could be started, and as well as case insensitivity there might be other tweaks to allow tsql code to work better thus easing migration projects. My company would contribute to strongly.
Sorry if this is off topic but it seems to strongly lean towards being a developer question and I'm sure many other postgres users would appreciate a case insensitive build in this day and age -thanks

I understand your sentiment, but I believe that you are wrong to assume that this would be a simple change. Otherwise PostgreSQL would probably already have case insensitive collations...
I'd say that your best bet is to use citext throughout. What is the problem you have with that?
You should take this to the hackers list to start a serious discussion, but make sure you read the archives first, because the problem is not a new one.

Related

Unit tests producing different results when using PostgreSQL

Been working on a module that is working pretty well when using MySQL, but when I try and run the unit tests I get an error when testing under PostgreSQL (using Travis).
The module itself is here: https://github.com/silvercommerce/taxable-currency
An example failed build is here: https://travis-ci.org/silvercommerce/taxable-currency/jobs/546838724
I don't have a huge amount of experience using PostgreSQL, but I am not really sure why this might be happening? The only thing I could think that might cause this is that I am trying to manually set the ID's in my fixtures file and maybe PostgreSQL not support this?
If this is not the case, does anyone have an idea what might be causing this issue?
Edit: I have looked again into this and the errors appear to be because of this assertion, which should be finding the Tax Rate vat but instead finds the Tax Rate reduced
I am guessing there is an issue in my logic that is causing the incorrect rate to be returned, though I am unsure why...
In the end it appears that Postgres has different default sorting to MySQL (https://www.postgresql.org/docs/9.1/queries-order.html). The line of interest is:
The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on
In the end I didn't really need to test a list with multiple items, so instead I just removed the additional items.
If you are working on something that needs to support MySQL and Postgres though, you might need to consider defining a consistent sort order as part of your query.

Wanting to obfuscate data in database incrementally

I am looking to obfuscate data in a postgres database that is quite large and would like to be able to do it incrementally. What i was thinking, is that I could roll the char's of names forward or something like that, but, I would need a way to be able to tell if it has been applied to that "name" already? any ideas on this? If it could be done this way i.e is_changed(), it would be easy to replay on the difference each day.
I am pretty much wanting to find all first/last /mobile/emails in the db and change them but not into garbage. Also, some names are in jsonb columns just to make it more complicated ;)
Cheers
Basically, I have decided to do a text pg_dump and scripted a solution which modifies all relevant data with the same pattern. This allows the relationships to be maintained after the obfuscation has been done.
It is also much simpler and performant than sql + updates across a large dataset.
Still open to other ideas if anyone has a better one.
If you're not terribly concerned with how obfuscated the resulting text is, maybe one of the hashing functions included within postgres would suffice, such as md5 just for a simple example.
UPDATE person p SET p.name = MD5(p.name::text);
A possible actual implementation might involve using the pgcrypto module to encode your values, this would not be terribly efficient however.
https://www.postgresql.org/docs/9.6/static/pgcrypto.html
UPDATE person p SET p.name = crypt(p.name::text, gen_salt('test'));
But as I asked in the comment, what is the threat profile you're trying to guard against? Obfuscation might not be a great solution for mitigating the effects of a data breach.

How to disable all optimizations of PostgreSQL

I'm studying query optimization and want to know how much each kind of optimizations help the query. Last time, I got an answer but when in my experiments, disable all optimization in the link has time complicity of O(n^1.8) enable all of them has O(n^0.5). there is not so much difference, if disable all of them, is there still other optimizations? how can I really have only one main optimizations each time?
You can't.
PostgreSQL's query planner has no "turn off optimisation" flag.
It'd be interesting to add, but would make the regression tests a lot more complex, and be of very limited utility.
To do what you want, I think you'd want to modify the query planner code, recompile, and reinstall PostgreSQL for each test. Or hack it to add a bunch of custom GUCs (system variables, like enable_seqscan) to let you turn particular optimisations on and off.
I doubt any such patch would be accepted into PostgreSQL, but it'd be worth doing as a throwaway.
The only challenge is that PostgreSQL doesn't differentiate strongly between "optimisation" and "thing we do to execute the query". Sometimes parts of the planner code expect and require that a particular optimisation has been applied in order to work correctly.

Why does PostgreSQL default everything to lower case?

I'm trying to familiarize myself with Postgres (9.2) after a fair bit of MySQL (5.1) usage, since I've been bitten by a handful of MySQL's gotchas. However, in my first five minutes with Postgres I ran into one of its gotchas, which I'm sure hits everyone:
By default, PostgreSQL converts everything that isn't quoted to lower case.
This isn't too big of a deal to me, since there are a couple of obvious workarounds:
Encapsulate everything in quotes.
Allow everything to be named in a lower case fashion.
But I'm wondering why. Considering how much contention I imagine this design decision causes, I'm surprised that I couldn't find any rationale on the internet. Does anybody have a thorough explanation, or preferably a link to some developer manifesto, as to why Postgres was designed this way? I'm interested.
The SQL standard specifies folding unquoted identifiers to upper case. Many other RDBMS's follow the standard in this way. Firebird and Oracle both do. This means that identifier matching is, by default, case insensitive. This behavior is very important when it comes to compatibility in basic queries. In this regard MySQL's behavior is a real outlier.
However PostgreSQL deviates from the standard by folding to lower case. There are general reasons why this is considered more readable, etc. because you can use case for cuing syntax. Something like:
SELECT foo FROM bar WHERE baz = 1;
This is more natural when cases are folded to lower. The alternative folding opposite would be:
select FOO from BAR where BAZ = 1;
In general like the former behavior (folding to lower case) becasue it emphasizes the sql operations better while folding to the other case de-emphasizes the operations and emphasizes the identifiers. Given the complexity of many queries, I think the former works better.
In general the most discussion I have seen on the postgres mailing lists have been that everyone agrees the standard-mandated behavior is broken. so the above is my understanding of the issues.

Why “Set based approaches” are better than the “Procedural approaches”?

I am very eager to know the real cause though earned some knowledge from googling.
Thanks in adavnce
Because SQL is a really poor language for writing procedural code, and because the SQL engine, storage, and optimizer are designed to make it efficient to assemble and join sets of records.
(Note that this isn't just applicable to SQL Server, but I'll leave your tags as they are)
Because, in general, the hundreds of man-years of development time that have gone into the database engine and optimizer, and the fact that it has access to real-time statistics about the data, have resulted in it being better than the user in working out the best way to process the data, for a given request.
Therefore by saying what we want to achieve (with a set-based approach), and letting it decide how to do it, we generally achieve better results than by spelling out exactly how to provess the data, line by line.
For example, suppose we have a simple inner join from table A to table B. At design time, we generally don't know 'which way round' will be most efficient to process: keep a list of all the values on the A side, and go through B matching them, or vice versa. But the query optimizer will know at runtime both the numbers of rows in the tables, and also the most recent statistics may provide more information about the values themselves. So this decision is obviously better made at runtime, by the optimizer.
Finally, note that I have put a number of 'generally's in this post - there will always be times when we know better than the optimizer will, and for such times we can provide hints (NOLOCK etc).
Set based approaches are declarative, so you don't describe the way the work will be done, only what you want the result to look like. The server can decide between several strategies how to complay with your request, and hopefully choose one that is efficient.
If you write procedural code, that code will at best be less then optimal in some situation.
Because using a set-based approach to SQL development conforms to the design of the data model. SQL is a very set-based language, used to build sets, subsets, unions, etc, from data. Keeping that in mind while developing in TSQL will generally lead to more natural algorithms. TSQL makes many procedural commands available that don't exist in plain SQL, but don't let that switch you to a procedural methodology.
This makes me think of one of my favorite quotes from Rob Pike in Notes on Programming C:
Data dominates. If you have chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
SQL databases and the way we query them are largely set-based. Thus, so should our algorithms be.
From an even more tangible standpoint, SQL servers are optimized with set-based approaches in mind. Indexing, storage systems, query optimizers, and other optimizations made by various SQL database implmentations will do a much better job if you simply tell them the data you need, through a set-based approach, rather than dictating how you want to get it procedurally. Let the SQL engine worry about the best way to get you the data, you just worry about telling it what data you want.
As each one has explained, let the SQL engine help you, believe, it is very smart.
If you do not use to write set based solution and use to develop procedural code, you will have to spend some time until write well formed set based solutions. This is a barrier for most people. A tip if you wish to start coding set base solutions is, stop thinking what you can do with rows, and start thinking what you can do with collumns, and do practice functional languages.