What is the methodology that the SSIS data conversion transformation uses when converting a DT_WSTR to DT_STR? If a character exists in the DT_WSTR which does not exist in the destination code page of the DT_STR, will the mapping fail, or ignore the character?
I guess it uses WideCharToMultiByte or something similar.
You control what happens by setting truncation options, see
http://msdn.microsoft.com/en-us/library/ms141679.aspx
Related
I can't understand the reason why PostgreSQL store data in own format
The "hex" format encodes binary data as 2 hexadecimal digits per byte, most significant nibble first. The entire string is preceded by the sequence \x (to distinguish it from the escape format).
Does it's mean that it is not simple hex and it would not possible to simple convert this hex to byte type and I should write parser of PostgreSQL hex format?
The client driver usually takes care of bytea conversion for you, supplying you a native language data type like byte[] for Java. The representation of bytea on the wire shouldn't generally concern you. The only time it'll really matter is if you're using bytea literals in SQL text, rather than sending them as bind parameters.
Anyway, it is normal hex, it just has a \x prefix. So it's utterly trivial to "parse" if you do need to do so manually. E.g. in Python
r'\x736f6d65737472696e67'[2:].decode("hex")
The reason for the \x prefix is largely historical. PostgreSQL used to use an octal escape format for bytea data. When the format was changed to hex - to make it easier for clients to consume and work with and make it a bit more compact - it was necessary for the client to be able to tell what format the data was in. Since \x can never appear in octal ("escape") format literals, any string beginning with \x must be a hex bytea literal. This is even more important when receiving data from a client, which might be sending either hex or escape style literals, and the server must be able to tell which is which.
We could've just required that all clients use the format specified by the server. But that would break compatibility for all old clients that use bytea. Personally I think that's exactly what we should've done, and required that people using old clients set bytea_format = escape or something. That's not what happened, though. The setting bytea_output controls the format the server sends, but it still understands both formats as input. That makes interoperating with old clients and scripts easier. In theory.
In practice lots of old clients blindly interpreted hex literals sent by the server as if they were escape-format even though they were invalid; they'd ignore the backslash or treat it as a literal backslash. So they'd tend to corrupt bytea data when loading it then saving it again. Exactly what we wanted to avoid.
I'm using this:
COPY( select field1, field2, field3 from table ) TO 'C://Program
Files/PostgreSql//8.4//data//output.dat' WITH BINARY
To export some fields to a file, one of them is a ByteA field. Now, I need to read the file with a custom made program.
How can I parse this file?
The general format of a file generated by COPY...BINARY is explained in the documentation, and it's non-trivial.
bytea contents are the most easy to deal with, since they're not encoded.
Each other datatype has its own encoding rules, which are not described in the documentation but in the source code. From the doc:
To determine the appropriate binary format for the actual tuple data
you should consult the PostgreSQL source, in particular the *send and
*recv functions for each column's data type (typically these functions are found in the src/backend/utils/adt/ directory of the source
distribution).
It might be easier to use the text format rather than binary (so just remove the WITH BINARY). The text format has better documentation and is designed for better interoperability. The binary format is more intended for moving between postgres installations, and even there they have version incompatibilities.
Text format will write the bytea field as if it was text, and encode any non-printable characters with \nnn octal representation (except for a few special cases that it encodes with C style \x patterns, such as \n and \t etc.) These are listed in the COPY documentation.
The only caveat with this is you need to be absolutely sure that the character encoding you're using is the same when saving the file as when reading it. To make sure that the printable characters map to the same numbers. I'd stick to SQL_ASCII as it keeps thing simpler.
Mongodb use utf-8 in internal, how to set the output charset? Is there a command similar as MySQL's "set names"? I using c++ mongoclient.
I'm not sure If the c++ driver behaves differently, but from what I know you'll always get a UTF-8 encoded result back. So if you want those data converted to another character set you need to perform it for yourself (don't know what ways you have with c++).
MongoDB exclusively deals with UTF-8. You can't change either input or output character sets and character encodings. You will need to do that in your application, where you also need to make sure that every string you send to MongoDB is actually UTF-8. None of the drivers currently support anything else. It's not likely they will ever do either.
We restored from a backup in a different format to a new MySQL structure (which is setup correctly for UTF-8 support). We have weird characters showing in the browser, but we're not sure what they're called so we can find a master list of what they translate to.
I have noticed that they do, in fact, correlate to a specific character. For example:
â„¢ always translates to ™
— always translates to —
• always translates to ·
I referenced this post, which got me started, but this is far from a complete list. Either I'm not searching for the correct name, or the "master list" of these bad-to-good conversions as a reference doesn't exist.
Reference:
Detecting utf8 broken characters in MySQL
Also, when trying to search via MySQL query, if I search for â, I always get MySQL treating it as an "a". Is there any way to tweak my MySQL queries so that they are more literal searches? We don't use internationalization much so I can safely assume any fields containing the â character is considered to be a problematic entry, which would need to be remedied by our "fixit" script we're building.
Instead of designing a "fixit" script to go through and replace this data, I think it would be better to simply fix the issue directly. It seems like the data was originally stored in a different format than UTF-8 so that when you brought it into the table that was set up for UTF-8, it garbled the text. If you have the opportunity, go back to your original backup to determine the format the data was stored in. If you can't do that, you will probably need to do a bit of trial and error to figure out which format the data is in. However, once you know that, conversion is easy. Read the following article's section on Repairing:
http://www.istognosis.com/en/mysql/35-garbled-data-set-utf8-characters-to-mysql-
Basically you are going to set the column to BINARY and then set it to the original charset. That should make the text appear properly (a good check to know you are using the correct charset). Once that is done, set the column to UTF-8. This will convert the data properly and it will correct the problems you are currently experiencing.
how can i know what is the format code of parameter that comes to my page?
whorking with ASP
some of the character that i see cant found in the DB (access)
i want to know the unicode of the value
Assuming you're using C#, you can get the 16-bit characters of strings as integers using (int) foo[0].
(In VB.NET, I think it would be Convert.ToInt32(foo.Chars(0)). In VBA, Asc(foo).)
If the string you're passing to ADO is correct, but the results are not as expected, either the data in the database was already corrupted by some earlier non-Unicode-friendly process, or else Access is not treating the input string and the database record as equal for some unexpected reason. (Equivalence of Unicode strings is a bit complicated.)