AS400 jdbc character conversion - db2

using jdbc (jt400) to insert data into an as400 table.
db table code page is 424. Host Code Page 424
the ebcdic 424 code page does not support many of the characters that may come from the client.
for example the sign → (Ascii 26 Hex 1A)
the result is an incorrect translation.
is there any built-in way in the toolbox to remove any of the unsupported characters?

You could try to create a logical file over your ccsid424 physical file with a different codepage. It is possible on the as/400 to create logical files with different codepages for individual columns, by adding the keyword CCSID(<num>). You can even set it to an unicode charset, e.g. CCSID(1200) for UTF-16. Of course your physical file will still only be able to store chars that are in the 424 codepage, and those will be replaced by some invalid character char, but the translation might be better that way.
There is no way to store chars, that are not in codepage 424 in a column with that codepage directly (the only way I can think of is encoding them somehow with multiple chars, but that is most likely not what you want to do, since it will bring more problems than it "solves").
If you have control over that system, and it is possible to do some bigger changes, you could do it the other way around: create a new unicode version of that physical file with a different name (I'd propose CCSID(1200), that's as close as you get to UTF-16 on as/400 afaik, and UTF-8 is not supported by all parts of the system in my experience. IBM does recommend 1200 for unicode). Than transfer all data from your old file to the new one, delete the old one (before that, backup it!), and than create a logical file over the new physical, with the name of the old physical file. In that logical file change all ccsid-bearing columns from 1200 to 424. That way, existing programs can still work on the data. Of course there will be invalid chars in the logical file now, once you insert data that is not in a subset of ccsid 424; so you will most likely have to take a look at all programs that use the new logical file.

Related

cant find Varchar Chart of acceptable characters

Does anyone know of a simple chart or list that would show all acceptable varchar characters? I cannot seem to find this in my googling.
What codepage? Collation? Varchar stores characters assuming a specific codepage. Only the lower 127 characters (the ASCII subset) is standard. Higher characters vary by codepage.
The default codepage used matches the collation of the column, whose defaults are inherited from the table,database,server. All of the defaults can be overriden.
In sort, there IS no "simple chart". You'll have to check the character chart for the specific codepage, eg. using the "Character Map" utility in Windows.
It's far, far better to use Unicode and nvarchar when storing to the database. If you store text data from the wrong codepage you can easily end up with mangled and unrecoverable data. The only way to ensure the correct codepage is used, is to enforce it all the way from the client (ie the desktop app) to the application server, down to the database.
Even if your client/application server uses Unicode, a difference in the locale between the server and the database can result in faulty codepage conversions and mangled data.
On the other hand, when you use Unicode no conversions are needed or made.

Specifying ASCII columns in a UTF-8 PostgreSQL database

I have a PostgreSQL database with UTF8 encoding and LC_* en_US.UTF8. The database stores text columns in many different languages.
On some columns however, I am 100% sure there will never be any special characters, i.e. ISO country & currency codes.
I've tried doing something like:
"countryCode" char(3) CHARACTER SET "C" NOT NULL
and
"countryCode" char(3) CHARACTER SET "SQL_ASCII" NOT NULL
but this comes back with the error
ERROR: type "pg_catalog.bpchar_C" does not exist
ERROR: type "pg_catalog.bpchar_SQL_ASCII" does not exist
What am I doing wrong?
More importantly, should I even bother with this? I'm coming from a MySQL background where doing this was a performance and space enhancement, is this also the case with PostgreSQL?
TIA
Honestly, I do not see the purpose of such settings, as:
as #JoachimSauer mentions, ASCII subset in the UTF-8 encoding will occupy exactly the same number of bytes, as that was the main point of inventing UTF-8: keep ASCII unchanged. Therefore I see no size benefits;
all software that is capable of processing strings in different encoding will use a common internal encoding, which is UTF-8 by default for PostgreSQL nowadays. When some textual data comes in to the processing stage, database will convert it into the internal encoding if encodings do not match. Therefore, if you specify some columns as being non-UTF8, this will lead to the extra processing of the data, thus you will loose some cycles (don't think it will be notable performance hit though).
Given there's no space benefits and there's a potential performance hit, I think it is better to leave things as they are, i.e. keep all columns in the database's default encoding.
I think for the same arguments PostgreSQL do not allow to specify encodings for individual objects within the database. Character Set and Locale are set on the per-database level.

DB2 VARCHAR unicode data storage

We are currently using VARCHAR for storing text data in DB2 however we are hitting the problem that length of VARCHAR specified is not the same as length of text because in DB2 VARCHAR length specified is UTF-8 data length which can vary depending on stored text data. For example some texts contain characters from different languages and because of it some texts with 500 characters can't be saved in VARCHAR(500) and etc.
Now we are planning to migrate to VARGRAPHIC. I need to know what are limitations of using VARGRAPHIC for storing unicode text data in DB2.
Are there any problems with using VARGRAPHIC?
DB2 doesn't check that the data is in fact double-byte String, but it assumes it must be. Usually the drivers will do proper conversions for you but you might one day bump into some bug. It is unlikely though.
If you use federated databases Vargraphic support in queries might fail completely. In overall the amount of bug reports for vargraphic data types is somewhat high. Support for it isn't probably as well tested and tried as for other data types.
Vargraphic will with unicode database (ie. UTF-8 is requirement) use big-endian UCS-2, meaning your space requirements for those columns double. Vargraphic is DB2 properietary data type. If you migrate off DB2 some day you will have to do an extra conversion.

Compressing ASCII data to fit within a UTF-32 API?

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
This source code will take a given String as input
The bytes representation of that string will be taken (UTF8, ASCII, you decide)
Some magic happens - (this is the part I need your help on)
The resulting bytes will be converted into an int or long (no decimal points)
The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.
These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.

How to reduce the size of an sqlite3 database for iphone?

edit: many thanks for all the answers. Here are the results after applying the optimisations so far:
Switching to sorting the characters and run length encoding - new DB size 42M
Dropping the indexes on the booleans - new DB size 33M
The really nice part is this hasn't required any changes in the iphone code
I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.
Here is the number of entries and resulting size of the sqlite DB:
franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db
-rw-r--r-- 1 frank staff 59M 8 Oct 23:08 dictionary.db
franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt
453154 dictionary.txt
...an average of about 135 bytes per entry.
Here is my DB schema:
create table words (word text primary key, sowpods boolean, twl boolean, signature text)
create index sowpods_idx on words(sowpods)
create index twl_idx on words(twl)
create index signature_idx on words(signature)
Here is some sample data:
photoengrave|1|1|10002011000001210101010000
photoengraved|1|1|10012011000001210101010000
photoengraver|1|1|10002011000001210201010000
photoengravers|1|1|10002011000001210211010000
photoengraves|1|1|10002011000001210111010000
photoengraving|1|1|10001021100002210101010000
The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.
I need to do queries such as:
select signature from words where word = 'foo'
select word from words where signature = '10001021100002210101010000' order by word asc
select word from words where word like 'foo' order by word asc
select word from words where word = 'foo' and (sowpods='1' or twl='1')
One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.
Also is there any way to get stats on the DB so I can see what is using the most space?
Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?
Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.
You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.
As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.
I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.
The creator of SQLite sells a version of SQLite that includes database compression (and encryption). This would be perfect.
Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.
Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.
As a text field, signature is currently using at least 26 * 8 bytes per entry (208 bytes) but if you were to pack the data into a bitfield, you could probably get away with only 3 bits per letter (reducing your maximum frequency per letter to 7). That would mean you could pack the entire signature in 26 * 3 bits = 78 bits = 10 bytes. Even if you used 4 bits per letter (for a maximum frequency of 15 per letter) you would only use 104 bits (13 bytes).
EDIT: After a bit more thought, I think 4 bits per letter (instead of 3) would be a better idea because it would make the binary math easier.
EDIT2: Reading through the docs on SQLite data types, it seems that you might be able to just make the "signature" field span 26 columns of type INTEGER and SQLite will do the right thing and only use as many bits as required to store the value.
Do I reckon correctly that you have about 450K words like that in your database ?
I've got no clue about iPhone, neither serious about sqlitem but... as long as sqlite does not allow for a way to save the file as gz right away (it maybe already does internally? no, does not look like that when you say it's about 135 b per entry. not even with both indexes), I would move away from the table approach, save it "manually" in a dictionary approach compression and build the rest on the fly and in memory. That should perform VERY well on your type of data.
Wait... Are you using that signature to allow for fulltextsearching or mistyping recogition ? Would full text search on sqlite not obsolete that field ?
As noted storing "Signature" more efficiently seems like a good idea.
However, it also seems like you could gain a ton of space savings by using some kind of lookup table for words - since you seem to be taking a root word and then appending "er", "ed", "es", etc why not have a column with a numeric ID that references a root word from a separate lookup table, and then a separate column with a numeric ID that references a table of common word suffixes that would be appended to the base word.
If there were any tricks around storing shorthand versions of signatures for multiple entries with a single root word, you could also employ those to reduce the size of stored signatures (not sure what algorithm is producing those values)
This also seems to make a lot of sense to me as you have the "word" column as a primary key, but do not even index it - just create a separate numeric column that is the primary ID for the table.
mhmm... an iPhone... doesn't it have a permanent data connection ?
I think this is where a webapplication/webservice can jump in snugly.
Move most of your business logic to the webserver (he's gonna have real SQL with FTS and looooots of memory) and fetch that info online to the client on the device.
As mentioned elsewhere, lose the indexes on the boolean columns, they will almost certainly be slower (if used at all) than a table scan and are going to use space needlessly.
I'd consider applying a simple compression to the words, Huffman coding is pretty good for this sort of thing. Also, I'd look at the signatures: sort the columns in letter frequency order and don't bother storing trailing zeroes, which can be implied. I guess you could Huffman-encode those, too.
Always assuming your encoded strings don't upset SQLite, of course.