Using TSQL, CAST() with COLLATE is non-deterministic. How to make it deterministic? What is the work-around? - tsql

I have a function that includes:
SELECT #pString = CAST(#pString AS VARCHAR(255)) COLLATE SQL_Latin1_General_Cp1251_CS_AS
This is useful, for example, to remove accents in french; for example:
UPPER(CAST('Éléctricité' AS VARCHAR(255)) COLLATE SQL_Latin1_General_Cp1251_CS_AS)
gives ELECTRICITE.
But using COLLATE makes the function non-deterministic and therefore I cannot use it as a computed persisted value in a column.
Q1. Is there another (quick and easy) way to remove accents like this, with a deterministic function?
Q2. (Bonus Question) The reason I do this computed persisted column is to search. For example the user may enter the customer's last name as either 'Gagne' or 'Gagné' or 'GAGNE' or 'GAGNÉ' and the app will find it using the persisted computed column. Is there a better way to do this?
EDIT: Using SQL Server 2012 and SQL-Azure.

You will find that it is in fact deterministic, it just has different behavior depending on the character you're trying to collate.
Check the page for Windows 1251 encoding for behavior on accepted characters, and unacceptable characters.
Here is a collation chart for Cyrillic_General_CI_AI. This is codepage 1251 Case Insensitive and Accent Insensitive. This will show you the mappings for all acceptable characters within this collation.
As for the search question, as Keith said, I would investigate putting a full text index on the column you are going to be searching on.

The best answer I got was from Sebastian Sajaroff. I used his example to fix the issue. He suggested a VIEW with a UNIQUE INDEX. This gives a good idea of the solution:
create table Test(Id int primary key, Name varchar(20))
create view TestCIAI with schemabinding as
select ID, Name collate SQL_Latin1_General_CP1_CI_AI as NameCIAI from Test
create unique clustered index ix_Unique on TestCIAI (Id)
create unique nonclustered index ix_DistinctNames on TestCIAI (NameCIAI)
insert into Test values (1, 'Sébastien')
--Insertion 2 will fail because of the unique nonclustered indexed on the view
--(which is case-insensitive, accent-insensitive)
insert into Test values (2, 'Sebastien')

Related

UNACCENT when checking for UNIQUE contraint violations in PostgreSQL

We have a UNIQUE constraint on a table to prevent our city_name and state_id combinations from being duplicated. The problem we have found is that accents circumvent this.
Example:
"Montréal" "Quebec"
and
"Montreal" "Quebec"
We need a way to have the unique constraint run UNACCENT() and preferably wrap it in LOWER() as well for good measure. Is this possible?
You can create an immutable version of unaccent:
CREATE FUNCTION noaccent(text) RETURNS text
LANGUAGE sql IMMUTABLE STRICT AS
'SELECT unaccent(lower($1))';
and use that in a unique index on the column.
An alternative is to use a BEGORE INSERT OR UPDATE trigger that fills a new column with the unaccented value and put a unique constraint on that column.
You can create unique indexes on expressions, see the Postgres manual:
https://www.postgresql.org/docs/9.3/indexes-expressional.html
So in your case it could be something like
CREATE UNIQUE INDEX idx_foo ON my_table ( UNACCENT(LOWER(city_name)), state_id )

Prevent non-collation characters in a NVarChar column using constraint?

Little weird requirement, but here it goes. We have a CustomerId VarChar(25) column in a table. We need to make it NVarChar(25) to work around issues with type conversions.
CHARINDEX vs LIKE search gives very different performance, why?
But, we don't want to allow non-latin characters to be stored in this column. Is there any way to place such a constraint on column? I'd rather let database handle this check. In general we OK with NVarChar for all of our strings, but some columns like ID's is not a good candidates for this because of possibility of look alike strings from different languages
Example:
CustomerId NVarChar(1) - PK
Value 1: BOPOH
Value 2: ВОРОН
Those 2 strings different (second one is Cyrillic)
I want to prevent this entry scenario. I want to make sure Value 2 can not be saved into the field.
Just in case it helps somebody. Not sure it's most "elegant" solution but I placed constraint like this on those fields:
ALTER TABLE [dbo].[Carrier] WITH CHECK ADD CONSTRAINT [CK_Carrier_CarrierId] CHECK ((CONVERT([varchar](25),[CarrierId],(0))=[CarrierId]))
GO

T-SQL implicit conversion between 2 varchars

I have some T-SQL (SQL Server 2008) that I inherited and am trying to find out why some of queries are running really slow. In the Actual Execution Plan I have three clustered index scans which are costing me 19%, 21% and 26%, so this seems to be the source of my problem.
The contents of the fields are usually numeric (but some job numbers have an alpha prefix)
The database design (vendor supplied) is pretty poor. The max length of a job number in their application is 12 chars, but in the tables that are joined it is defined as varchar(50) in some places and varchar(15) in others. My parameter is a varchar(12), but I get same thing if I change it to a varchar(50)
The node contains this:
Predicate: [Live_Costing].[dbo].[TSTrans].[JobNo] as [sts1].[JobNo]=CONVERT_IMPLICIT(varchar(50),[#JobNo],0)
sts1 is a derived table, but the table it pulls jobno from is a varchar(50)
I don't understand why it's doing an implicit conversion between 2 varchars. Is it just because they are different lengths?
I'm fairly new to the execution plan
Is there an easy way to figure out which node in the exec plan relates to which part of the query?
Is the predicate, the join clause?
Regards
Mark
Some variables can have collation: enter link description here
Regardless you need to verify your collations, which can be specified at server, DB, table, and column level.
First, check your collation between tempdb and the vendor supplied database. It should match. If it doesn't, it will tend to do implicit conversions.
Assuming you cannot modify the vendor supplied code base, one or more of the following should help you:
1) Predefine your temp tables and specify the same collation for the key field as in the db in use, rather than tempdb.
2) Provide collations when doing string comparisons.
3) Specify collation for key values if using "select into" with a temp table
4) Make sure your collations on your tables and columns match your database collation (VERY important if you imported only specific tables from a vendor into an existing database.)
If you can change the vendor supplied code base, I would suggest reviewing the cost for making all of your char keys the same length and NOT varchar. Varchar has an overhead of 10. The caveat is that if you create a fixed length character field not null, it will be padded to the right (unavoidable).
Ideally, you would have int keys, and only use varchar fields for user interaction/lookup:
create table Products(ProductID int not null identity(1,1) primary key clustered, ProductNumber varchar(50) not null)
alter table Products add constraint uckProducts_ProductNumber unique(ProductNumber)
Then do all joins on ProductID, rather than ProductNumber. Just filter on ProductNumber.
would be perfectly fine.

Why does Postgres handle NULLs inconsistently where unique constraints are involved?

I recently noticed an inconsistency in how Postgres handles NULLs in columns with a unique constraint.
Consider a table of people:
create table People (
pid int not null,
name text not null,
SSN text unique,
primary key (pid)
);
The SSN column should be kept unique. We can check that:
-- Add a row.
insert into People(pid, name, SSN)
values(0, 'Bob', '123');
-- Test the unique constraint.
insert into People(pid, name, SSN)
values(1, 'Carol', '123');
The second insert fails because it violates the unique constraint on SSN. So far, so good. But let's try a NULL:
insert into People(pid, name, SSN)
values(1, 'Carol', null);
That works.
select *
from People;
0;"Bob";"123"
1;"Carol";"<NULL>"
A unique column will take a null. Interesting. How can Postgres assert that null is in any way unique, or not unique for that matter?
I wonder if I can add two rows with null in a unique column.
insert into People(pid, name, SSN)
values(2, 'Ted', null);
select *
from People;
0;"Bob";"123"
1;"Carol";"<NULL>"
2;"Ted";"<NULL>"
Yes I can. Now there are two rows with NULL in the SSN column even though SSN is supposed to be unique.
The Postgres documentation says, For the purpose of a unique constraint, null values are not considered equal.
Okay. I can see the point of this. It's a nice subtlety in null-handling: By considering all NULLs in a unique-constrained column to be disjoint, we delay the unique constraint enforcement until there is an actual non-null value on which to base that enforcement.
That's pretty cool. But here's where Postgres loses me. If all NULLs in a unique-constrained column are not equal, as the documentation says, then we should see all of the nulls in a select distinct query.
select distinct SSN
from People;
"<NULL>"
"123"
Nope. There's only a single null there. It seems like Postgres has this wrong. But I wonder: Is there another explanation?
Edit:
The Postgres docs do specify that "Null values are considered equal in this comparison." in the section on SELECT DISTINCT. While I do not understand that notion, I'm glad it's spelled out in the docs.
It is almost always a mistake when dealing with null to say:
"nulls behave like so-and-so here, *so they should behave like
such-and-such here"
Here is an excellent essay on the subject from a postgres perspective. Briefly summed up by saying nulls are treated differently depending on the context and don't make the mistake of making any assumptions about them.
The bottom line is, PostgreSQL does what it does with nulls because the SQL standard says so.
Nulls are obviously tricky and can be interpreted in multiple ways (unknown value, absent value, etc.), and so when the SQL standard was initially written, the authors had to make some calls at certain places. I'd say time has proved them more or less right, but that doesn't mean that there couldn't be another database language that handles unknown and absent values slightly (or wildly) differently. But PostgreSQL implements SQL, so that's that.
As was already mentioned in a different answer, Jeff Davis has written some good articles and presentations on dealing with nulls.
NULL is considered to be unique because NULL doesn't represent the absence of a value. A NULL in a column is an unknown value. When you compare two unknowns, you don't know whether or not they are equal because you don't know what they are.
Imagine that you have two boxes marked A and B. If you don't open the boxes and you can't see inside, you never know what the contents are. If you're asked "Are the contents of these two boxes the same?" you can only answer "I don't know".
In this case, PostgreSQL will do the same thing. When asked to compare two NULLs, it says "I don't know." This has a lot to do with the crazy semantics around NULL in SQL databases. The article linked to in the other answer is an excellent starting point to understanding how NULLs behave. Just beware: it varies by vendor.
Multiple NULL values in a unique index are okay because x = NULL is false for all x and, in particular, when x is itself NULL. You'll also run into this behavior in WHERE clauses where you have to say WHERE x IS NULL and WHERE x IS NOT NULL rather than WHERE x = NULL and WHERE x <> NULL.

In postgresql: Clarification on "CONSTRAINT foo_key PRIMARY KEY (foo)"

Sorry if this is a dead simple question but I'm confused from the documentation and I'm not getting any clear answers from searching the web.
If I have the following table schema:
CREATE TABLE footable
(
foo character varying(10) NOT NULL,
bar timestamp without time zone,
CONSTRAINT pk_foo PRIMARY KEY (foo)
);
and then use the query:
SELECT bar FROM footable WHERE foo = '1234567890';
Will the select query find the given row by searching an index or not? In other word: does the table have a primary key (which is foo) or not?
Just to get it clear. I'm used to specifying "PRIMARY KEY" after the column I'm specifying like this:
"...foo character varying(10) PRIMARY KEY, ..."
Does it change anything?
Why not look at the query plan and find out yourself? The query plan will tell you exactly what indexes are being used, so you don't have to guess. Here's how to do it:
http://www.postgresql.org/docs/current/static/sql-explain.html
But in general, it should use the index in this case since you specified the primary key in the where clause and you didn't use something that could prevent it from using it (a LIKE, for example).
It's always best to look at the query plan to verify it for sure, then there's no doubt.
In both cases, the primary key can be used but it depends. The optimizer will make a choice depending on the amount of data, the statistics, etc.
Naming the constraint can make debugging and error handling easier, you know what constraint is violated. Without a name, it can be confusing.