I currently have a routine written in .Net which is using SQLBulkCopy. I'm re-writing this routine using conventional SQL, thus I'm using bulk insert.
I am having a problem when it comes to special characters. In particular there is the ŷ character which is converted to ? when using SQLBulkCopy. However when I use bulk insert it is converted to the ² character. I need to somehow end up with the ? character to stick to the original output.
If it makes any difference, the file I'm importing is a Unix ANSI encoded file
Any tips please?
Related
I am trying to store unicode information in a MongoDB database so that I can render characters on a web page. I understand that MongoDB stores everything in BSON format and, in particular, stores BSON strings with utf-8 encoding (as per this link), so I bet this question can be resolved by someone who knows more than I do.
The problem: I want to render Hebrew characters. I made a CSV file in which I list their unicode code points as plain text and I need to figure out what prefix to include in this text string so that I can properly handle it with MongoDB.
A string such as "05D8" has no problem -- in my CSV file, it is represented as "05D8" and then in MongoDB that comes through as "05D8" no problem.
However, the string "05E0" -- meaning, U+05E0 in unicode -- the Hebrew letter "nun" -- is being ingested by MongoDB and coerced into an integer... scientific notation interpretation. Ten characters in the Hebrew alphabet all have this issue even though MongoDB is ingesting all of my other strings properly.
Two questions:
Q1: What prefix should I put on the front of the strings in the CSV file in order to get MongoDB to treat "05E0" as U+05E0? u'.. u"... I've tried u'05E0' but that gets stored in MongoDB as "u'05E0'" which is not quite what I want. (that's my problem, not mongo's problem -- I just can't figure out what to type in the CSV file)
Q2: is there a flag for mongoimport with which I can force the information from this CSV to be interpreted as text and not as scientific notation?
Does anyone know of a simple chart or list that would show all acceptable varchar characters? I cannot seem to find this in my googling.
What codepage? Collation? Varchar stores characters assuming a specific codepage. Only the lower 127 characters (the ASCII subset) is standard. Higher characters vary by codepage.
The default codepage used matches the collation of the column, whose defaults are inherited from the table,database,server. All of the defaults can be overriden.
In sort, there IS no "simple chart". You'll have to check the character chart for the specific codepage, eg. using the "Character Map" utility in Windows.
It's far, far better to use Unicode and nvarchar when storing to the database. If you store text data from the wrong codepage you can easily end up with mangled and unrecoverable data. The only way to ensure the correct codepage is used, is to enforce it all the way from the client (ie the desktop app) to the application server, down to the database.
Even if your client/application server uses Unicode, a difference in the locale between the server and the database can result in faulty codepage conversions and mangled data.
On the other hand, when you use Unicode no conversions are needed or made.
Is there a huffman or zip compression dll written for TSQL? I have searched and can't seem to find it. I want to store compressed data in one field and use a calculated field to display the uncompressed data.
There is no builtin function in TSQL but you can write your own dll written by C# or VB.net called SQLCLR and add it to your sql.
Now you can use this function.
using jdbc (jt400) to insert data into an as400 table.
db table code page is 424. Host Code Page 424
the ebcdic 424 code page does not support many of the characters that may come from the client.
for example the sign → (Ascii 26 Hex 1A)
the result is an incorrect translation.
is there any built-in way in the toolbox to remove any of the unsupported characters?
You could try to create a logical file over your ccsid424 physical file with a different codepage. It is possible on the as/400 to create logical files with different codepages for individual columns, by adding the keyword CCSID(<num>). You can even set it to an unicode charset, e.g. CCSID(1200) for UTF-16. Of course your physical file will still only be able to store chars that are in the 424 codepage, and those will be replaced by some invalid character char, but the translation might be better that way.
There is no way to store chars, that are not in codepage 424 in a column with that codepage directly (the only way I can think of is encoding them somehow with multiple chars, but that is most likely not what you want to do, since it will bring more problems than it "solves").
If you have control over that system, and it is possible to do some bigger changes, you could do it the other way around: create a new unicode version of that physical file with a different name (I'd propose CCSID(1200), that's as close as you get to UTF-16 on as/400 afaik, and UTF-8 is not supported by all parts of the system in my experience. IBM does recommend 1200 for unicode). Than transfer all data from your old file to the new one, delete the old one (before that, backup it!), and than create a logical file over the new physical, with the name of the old physical file. In that logical file change all ccsid-bearing columns from 1200 to 424. That way, existing programs can still work on the data. Of course there will be invalid chars in the logical file now, once you insert data that is not in a subset of ccsid 424; so you will most likely have to take a look at all programs that use the new logical file.
I am working on a web development project using flask and sqlalchemy orm. My post is related to use of unicode in developing this app.
What I have understood till now about unicode :
If I want my webapp to handle data in languages other than English I need to use unicode data type for my variables. Because string variables can't handle unicode data.
I use some database which stores unicode data or take responsibility to convert unicode to raw while saving and vice versa while retrieving. Sqlalchemy gives me option to set automatic conversion both ways, so that I dont have to worry about them.
I am using python2.7 so I have to be aware of processing unicode data properly. Normal string operations on unicode data maybe buggy.
Correct me if any of the above assumption is wrong.
Now my doubts or questions :
If I dont use unicodes now then will I have some problems if I or flask people decide to port to python3?
I dont want to hassle with the thought of my webapp catering to different languages right now. I just want to concentrate on first creating the app. Can I do that later without using unicode right now?
If I use unicode now then how it affects my code. Do I replace every string input and output with unicode or what?
Conversion of unicode when saving to database, Can it be source of performance problems?
Basically I am asking whether to use unicode or not with explaining my needs and requirement out of the project.
No, but make sure you separate binary data from text data. That makes it easier to port.
It's easier to use Unicode from the start, but of course you can postpone it. But it's really not very difficult.
You replace everything that should be text data with Unicode, yes.
Only of you make loads of conversions of really massive amounts of text.