AnsiString being truncated with plenty of space - sql-server-2008-r2

I'm inserting a row with a JOBCODE field defined as varchar(50). When the string for that field is greater than 20 characters I get an error from SQL Server warning that the string would be truncated.
I suspect this may have to do with Unicode wide characters, but I thought then 25 characters would pass.
Has anyone seen something like this before? What am I missing?

I think there is something else at fault here.
VARCHAR(50) should be 50 characters, irrespective of the encoding
as an example
CREATE TABLE AnsiString
(
JobCode VARCHAR(20), -- ANSI with codepage
JobCodeUnicode NVARCHAR(20) -- Unicode
)
Inserting 20 unicode characters into both columns
INSERT INTO AnsiString(JobCode, JobCodeUnicode) VALUES ('葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0',
N'葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0')
select * from ansistring
Returns
?2?4?6?8?0?2?4?6?8?0 葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0
As expected, ? is inserted for characters which weren't mapped into ANSI, but either way, we can still insert 20 characters.
Do you possibly have a trigger on the table? Could it be another column entirely? Could your data access layer somehow be expanding your unicode string to something else (e.g. byte[])?

Related

How to store word "é" in postgres using limited varchar

I've been having some problems trying to save a string word with limited varchar(9).
create database big_text
LOCALE 'en_US.utf8'
ENCODING UTF8
create table big_text(
description VARCHAR(9) not null
)
# OK
insert into big_text (description) values ('sintético')
# I Got error here
insert into big_text (description) values ('sintético')
I already know that the problem is because one word is using 'é' -> Latin small letter E with Acute (this case only have 1 codepoint) and another word is using 'é' -> Latin Small Letter E + Combining Acute Accent Modifier. (this case I have 2 codepoint).
How can I store the same word using both representation in a limited varchar(9)? There is some configuration that the database is able to understand both ways? I thought that database being UTF8 is enough but not.
I appreciate any explanation that could help me understand where am I wrong? Thank you!
edit: Actually I would like to know if there is any way for postgres automatically normalize for me.
A possible workaround using CHECK to do the character length constraint.
show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
create table big_text(
description VARCHAR not null CHECK (length(normalize(description)) <= 9)
)
-- Note shortened string. Explanation below.
select 'sintético'::varchar(9);
varchar
----------
sintétic
insert into big_text values ('sintético');
INSERT 0 1
select description, length(description) from big_text;
description | length
-------------+--------
sintético | 10
insert into big_text values ('sintético test');
ERROR: new row for relation "big_text" violates check constraint "big_text_description_check"
DETAIL: Failing row contains (sintético test).
From here Character type the explanation for the string truncation vs the error you got when inserting:
An attempt to store a longer string into a column of these types will result in an error, unless the excess characters are all spaces, in which case the string will be truncated to the maximum length.(This somewhat bizarre exception is required by the SQL standard.)
If one explicitly casts a value to character varying(n) or character(n), then an over-length value will be truncated to n characters without raising an error. (This too is required by the SQL standard.)

TSQL LEN(Substring) returning wrong number

I have this bit of code reading in a fixed width file into a table with one column 'A' that's created as varchar(300) then reading that table as...
LTRIM(RTRIM(CONVERT(VARBINARY,SUBSTRING(A,101,4))))
and in most cases it returns a 4 digit year. I am coming across an error where I have a typo in the file that I received where the year is '20' so
LTRIM(RTRIM(CONVERT(VARBINARY,SUBSTRING(A,101,4))))
returns a 20 and I wanted to put in the where statement a filter that says that
LEN( LTRIM(RTRIM(CONVERT(VARBINARY,SUBSTRING(A,101,4)))) ) = 4
the problem is when I run
LEN( LTRIM(RTRIM(CONVERT(VARBINARY,SUBSTRING(A,101,4)))) )
in the select statement the '20' shows up as having a 4 LEN. What is this thing doing?
I'm not sure exactly why the VARBINARY is there, I think it has something to do with when the value from the mainframe is set as integer and NULL it comes across as '.' and that cleans it up. Regardless I still get this issue when I remove the CONVERT(VARBINARY.
LEN Returns the number of characters, rather than the number of bytes, of the given string expression, excluding trailing blanks.
DATALENGTH Returns the number of bytes used to represent any expression. DATALENGTH is especially useful with varchar, varbinary, text, image, nvarchar, and ntext data types because these data types can store variable-length data. The DATALENGTH of NULL is NULL.
EDIT
I think I got it. Many times you mention about ".". In my opinion you have float value which is implicitly converted to varchar. Check below.
SELECT LEN(t.val), t.val
FROM (SELECT 20.0) AS t(val)

Postgresql constraint to check for non-ascii characters

I have a Postgresql 9.3 database that is encoded 'UTF8'. However, there is a column in database that should never contain anything but ASCII. And if non-ascii gets in there, it causes a problem in another system that I have no control over. Therefore, I want to add a constraint to the column. Note: I already have a BEFORE INSERT trigger - so that might be a good place to do the check.
What's the best way to accomplish this in PostgreSQL?
You can define ASCII as ordinal 1 to 127 for this purpose, so the following query will identify a string with "non-ascii" values:
SELECT exists(SELECT 1 from regexp_split_to_table('abcdéfg','') x where ascii(x) not between 1 and 127);
but it's not likely to be super-efficient, and the use of subqueries would force you to do it in a trigger rather than a CHECK constraint.
Instead I'd use a regular expression. If you want all printable characters then you can use a range in a check constraint, like:
CHECK (my_column ~ '^[ -~]*$')
this will match everything from the space to the tilde, which is the printable ASCII range.
If you want all ASCII, printable and nonprintable, you can use byte escapes:
CHECK (my_column ~ '^[\x00-\x7F]*$')
The most strictly correct approach would be to convert_to(my_string, 'ascii') and let an exception be raised if it fails ... but PostgreSQL doesn't offer an ascii (i.e. 7-bit) encoding, so that approach isn't possible.
Use a CHECK constraint built around a regular expression.
Assuming that you mean a certain column should never contain anything but the lowercase letters from a to z, the uppercase letters from A to Z, and the numbers 0 through 9, something like this should work.
alter table your_table
add constraint allow_ascii_only
check (your_column ~ '^[a-zA-Z0-9]+$');
This is what people usually mean when they talk about "only ASCII" with respect to database columns, but ASCII also includes glyphs for punctuation, arithmetic operators, etc. Characters you want to allow go between the square brackets.

How to removing spacing in SQL

I have data in DB2 then i want to insert that data to SQL.
The DB2 data that i had is like :
select char('AAA ') as test from Table_1
But then, when i select in SQL after doing insert, the data become like this.
select test from Table_1
result :
test
------
AAA
why Space character read into box character. How do I fix this so that the space character is read into.
Or is there a setting I need to change? or do I have to use a parameter?
I used AS400 and datastage.
Thank you.
Datastage appends pad characters so you know that there are spaces there. The pad character is 0x00 (NUL) by default and that's what you're seeing.
Research the APT_STRING_PADCHAR environment variable; you can set it to something else if you want.
The 0x00 characters are not actually in your database. The short answer is, you can safely ignore it.
When you said:
select char('AAA ') as test from Table_1
You were not actually showing any data from the table. Instead you were showing an expression casting a constant AAA as a character value, and giving that result column the name test which coincidentally seems to be the name of a column in the table, although that coincidence doesn't matter here.
Then your 2nd statement does show the contents of the database column.
select test from Table_1
Find out what the hexadecimal value actually is.

TSQL Prefixing String Literal on Insert - Any Value to This, or Redundant?

I just inherited a project that has code similar to the following (rather simple) example:
DECLARE #Demo TABLE
(
Quantity INT,
Symbol NVARCHAR(10)
)
INSERT INTO #Demo (Quantity, Symbol)
SELECT 127, N'IBM'
My interest is with the N before the string literal.
I understand that the prefix N is to specify encoding (in this case, Unicode). But since the select is just for inserting into a field that is clearly already Unicode, wouldn't this value be automatically upcast?
I've run the code without the N and it appears to work, but am I missing something that the previous programmer intended? Or was the N an oversight on his/her part?
I expect behavior similar to when I pass an int to a decimal field (auto-upcast). Can I get rid of those Ns?
Your test is not really valid, try something like a Chinese character instead, I remember if you don't prefix it it will not insert the correct character
example, first one shows a question mark while the bottom one shows a square
select '作'
select N'作'
A better example, even here the output is not the same
declare #v nvarchar(50), #v2 nvarchar(50)
select #v = '作', #v2 = N'作'
select #v,#v2
Since what you look like is a stock table why are you using unicode, are there even symbols that are unicode..I have never seen any and this includes ISIN, CUSIPS and SEDOLS
Yes, SQL Server will automatically convert (widen, cast down) varchar to nvarchar, so you can remove the N in this case. Of course, if you're specifying a string literal where the characters aren't actually present in the database's default collation, then you need it.
It's like you can suffix a number with "L" in C et al to indicate it's a long literal instead of an int. Writing N'IBM' is either being precise or a slave to habit, depending on your point of view.
One trap for the unwary: nvarchar doesn't get automatically converted to varchar, and this can be an issue if your application is all Unicode and your database isn't. For example, we had this with the jTDS JDBC driver, which bound all parameter values as nvarchar, resulting in statements effectively like this:
select * from purchase where purchase_reference = N'AB1234'
(where purchase_reference was a varchar column)
Since the automatic conversions are only one way, that became:
select * from purchase where CONVERT(NVARCHAR, purchase_reference) = N'AB1234'
and therefore the index of purchase_reference wasn't used.
By contrast, the reverse is fine: if purchase_reference was an nvarchar, and an application passed in a varchar parameter, then the rewritten query:
select * from purchase where purchase_reference = CONVERT(NVARCHAR, 'AB1234')
would be fine. In the end we had to disable binding parameters as Unicode, hence causing a raft of i18n problems that were considered less serious.