I have a 1,000,000 rows plus string table, that has some garbage inside due to encoding errors.
The garbage is minimal, but needs to be found.
The column in question is a NVARCHAR column that normally contains text in one of 11 languages.
All of the text should be unicode (utf-8 when we process it application side).
The corrupt columns contain ? characters and or a very limited unusual glyph set, by eye they can be very easily seen not to be valid language. It is likely that these columns have been encoded backwards and forwards into total garbage.
So in the name of speed, is there anything I can do on SQL Server to detect bad encoding / string garbage?
Thanks.
EDIT to add garbage example:
This was Russian или Ð˜Ð¼Ñ Ð£Ñ‡Ð°Ñтника
Related
Consider a table with the following data:
id bigint Auto Increment
name character varying(255) NULL
category character varying(255) NULL
english character varying(255) NULL
french character varying(255) NULL
pivot character varying(255) NULL
credits character varying(255) NULL
hash character varying(20) NULL
The english column contains data of the following size (in bytes): max 116, min 5, average 42, median: 40.
The number of rows in the table is around 30,000 and will hardly change.
The new 107 columns will be translations of the English.
Will adding 107 columns hurt performance?
The Postgres site says the maximum number of columns on a Postgres table is
250-1600 depending on column types
and
The maximum number of columns for a table is further reduced as the tuple being stored must fit in a single 8192-byte heap page
Will the data fall under that limit?
Size of the largest row
What is the actual storage size of the table's rows? pg_column_size is the
Number of bytes used to store a particular value (possibly compressed)
SELECT id, pg_column_size(t.*) FROM table as t ORDER BY pg_column_size DESC
-- Some stats derived from the query:
-- Min 87 bytes
-- Max 514 bytes
-- Average 216 bytes
-- Median: 209 bytes
But no compression is actually happening here, because:
When a row that is to be stored is "too wide" (the threshold for that is 2KB by default), the TOAST mechanism first attempts to compress any wide field values. If that isn't enough to get the row under 2KB, it breaks up the wide field values into chunks that get stored in the associated TOAST table. Each original field value is replaced by a small pointer that shows where to find this "out of line" data in the TOAST table. TOAST will attempt to squeeze the user-table row down to 2KB in this way, but as long as it can get below 8KB, that's good enough and the row can be stored successfully.
Compression would start to kick in once the table gets bigger and those new columns are added.
It's unclear to me what the compression ratio would be for such data?
I wonder how effective it'll be on lots of short multilingual sentences. Also, tried to find the exact name of the compression algorithm used by Postgres: the docs say "the LZ family of compression techniques", but which one – LZ77? LZ78? A twist on one of them?
The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
TOAST'ed?
If the size of the table goes beyond the page size limit, then Posgres will rely on TOAST not just to compress but also to split the data for "out-of-line" rows.
I understand this will increase fetch times for those rows that don't fit… But what's the impact of TOAST on performance? Is it negligible for such a use case?
Bottom-line
At the end of the day…
Is adding those 107 columns a good idea, or should I use a different approach?
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
Or am I approaching this the wrong way, i.e. is it a case of premature optimization where I'd have been better off just adding the columns and only investigate later if faced with problems?
Using Postgres 9.6. Upgrading is an option if needed.
The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
I'd just copy the English version into each of the 107 columns. That should be good enough to get some useful findings. You might worry that the repetition would cause the compression to be idiosyncratic; but each value is compressed in isolation so won't "know" it is identical to some other value.
It's unclear to me what the compression ratio would be for such data?
Not very much. For example, the paragraph of yours I quoted first doesn't get any benefit from compression (when I copied it into 107 other columns). Short segments of ordinary text do not have enough repetition in them to be very compressible. Translating them to other languages is unlikely to change this.
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
This question has a very clear answer. You should absolutely select only what you need. Assembling a row from 100+ toasted columns, just to throw most of them away, will slow you down.
I don't know if this falls under "premature optimization" so much as falling under poor design. In one way or another you will need some method of know which of the 108 versions you need. But what happens when you need to add the 108th translation, or you delete say the 93rd. So use this information to form a key to a translation table. Something like Translation_Test (for_ref_in bigint, language text, translation text). Then access the necessary text (including perhaps the English version) from that table.
we are in a phase of migration of some tables from AS400 DB to DB2 LUW(V11.1).
While migrating we found some special character(€) in the source database(AS400)- (Column with CHAR) and that lead to error if we are unable to alter table column with CODEUNITS32, DB2 LUW Database configuration Byte Encoding Set at UTF-8.
We want to understand, what would be the behavior of the application after changing the char column to CODEUNITS32, Do I need to update any Configuration at the application level (C & Java Application) to handle both Character Encoding Set?
After changing to CODEUNITS32
- My C application able to compile and able to handle change in Character byte from 8 bit per character(UTF-8) to 4 Byte Per Character(CODEUNITS32)?
- My Java application is able to handle change in Character byte from 8 bit per character(UTF-8) to 4 Byte Per Character(CODEUNITS32)?
We did some pilot testing by inserting Special character manually to the table after setting column definition to CODEUNITS32 from CHAR and testing was successful.
Using a string units specification of CODEUNITS32 for a column does not change the encoding of a column, the data is still stored in UTF-8 for CHAR/VARCHAR columns.
It alters the physical length (CHAR) or max length (VARCHAR) of the column by a factor of 4.
It also enables "character semantics" in some functions such as SUBSTR(), such that they work on characters, not bytes when processing CODEUNITS32 columns. (SUBSTRING() will always use character semantics (unless processing a FOR BIT DATA column))
So a CHAR(4) is CHAR(4 OCTETS) is 4 bytes long, and can hold at most 4 characters if they are all single byte in UTF-8. For € which is 3 bytes long, it could only hold say €4 but not €42
ACHAR(4 CODEUNTIS32) is 16 bytes long, and is allowed to hold up to 4 characters. It could hold €€€€ but not €2345
It is worth considering avoiding CHAR(x CODEUNITS32) and prefering VARCHAR(x CODEUNITS32). UTF-8 does not really play well with fixed width data types. The more common UTF-8 characters are 1 or 2 bytes long, so typically a CHAR(x CODEUNITS32) column will hold be more than 50% space padding.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008470.html
CODEUNITS32
Indicates that the units for the length attribute are Unicode UTF-32 code units which approximate counting in characters.
This unit of length does not affect the underlying code page of the data type.
The actual length of a data value is determined by counting the UTF-32
code units as if the data was converted to UTF-32.
A string unit of CODEUNITS32 can be used only in a Unicode database.
CODEUNITS32 can be
explicitly specified or determined based on an environment setting.
Also, out of interest, GRAPHIC/VARGRAPHIC and columns are stored in UTF-16, and default to CODEUNITS16, but can also use CODEUNITS32.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008471.html
Below Query, I am using to get the SP definition but in TEXT column I am getting as NULL Value in IBM DATA Studio but I am able to CALL the SP.
SELECT PROCNAME, TEXT FROM SYSCAT.PROCEDURES WHERE PROCNAME LIKE '%USP_ABC%'
Please Help
You have confirmed that the syscat.procedures.language is SQL, and that your query-tool is able to display a substr() of the text.
Workaround depends on the length(text) of the row of interest:
SELECT PROCNAME, substr(TEXT,1, 1024) FROM SYSCAT.PROCEDURES WHERE PROCNAME LIKE '%USP_ABC%'
You may need to adjust the length of the substr extract depending on the length of the text and your configuration. For example substr(TEXT, 1, 2048 ) or a higher value for the length as necessary that your query-tool can cope with.
You can find the length of the text column with the LENGTH(TEXT) for the row of interest.
You can also CAST a CLOB to char or varchar to a length that fits within their limits and whatever query tool limitations you have.
Another option is to use a different query tool that can work with CLOB.
Are you using the latest version of Data Studio with the latest fix? It sounds like you might have an invalid UTF-8 character in you SP, or as you are using SUBSTR and SUBSTRING you are breaking a mulit-byte character in two.
You could try setting
-Ddb2.jcc.charsetDecoderEncoder=3
in your eclipse.ini to get Java to use a replacment character rather than replace the invalid string with nul
See this tech note
https://www-01.ibm.com/support/docview.wss?uid=swg21684365
Otherwise, do raise this with IBM Suppport
the doc of Postgres Ltree said that
A label is a sequence of alphanumeric characters and underscores (for example, in C locale the characters A-Za-z0-9_ are allowed). Labels must be less than 256 bytes long.
However, it does not said that if we set the locale to 'en_US.UTF-8', what is the valid character can be used in Postgres Ltree. So, can dash(hyphen) be used in the label of Ltree?
Sorry for not updating the answer.
Yes, i have finally figured out that it is one of our DBA that changed the source of ltree and recompile it with supporting the dash(-) char. We have a single table with more than 6B records.
I don't have much experience with MS SQL server 2008 R2 but here is the issue if you would help me please:
I have a table with a column/field (type : nvarchar) that stores text. The text is read from a text file and written to the database using vb.net application.
The text in the text file contains Turkish characters such as the u with 2 dots on top(in the future it will be in different languages )
When I open the table, the text in the column is not readable. It converts the Turkish special character to some unreadable characters.
Is there anyway to make the text readable in the table?
Thank you so much.
SQL Server doesn't change any character stored in tables, I think the problem is displaying the text in different character set. Try using UTF-8 character set.