SSIS unicode flat file issue "Character not in code page" - unicode

I have a text file created in java using UTF-16 encoding.
When I try to import I am getting a validation failure/error on the flat file source before it even begins to move data. The error is a character is not in the specified code page.
[Flat File Source [908]] Error: Data conversion failed. The data conversion for column "ACTIVE_INGREDIENT" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.
In my Flat File connection, I don't have unicode selected (as that struggles to find my CR LF line terminators), but have have set code page to 65001-UTF8.
In may flat file data source, I have changed all Internal and External Columns to be DT_WSTR in the advanced editor (I can't cahnge code page it seems, stuck on 0 with this option).
I am not doing a data conversion as I am mapping to NVARCHAR tables (the SSIS job isnt even getting this far to try to transfer data).
I cant even redirect the rows to a text file to identify them as I have the same issue trying to output to a flat file destination.
Any help appreciated.

Related

PostgreSQL Escape Microsoft Special Characters In Select Query

PostgreSQL, DBvisualizer and Salesforce
I'm selecting records from a database table and exporting them to a csv file: comma-separated and UTF8 encoded. I send the file to a user who is uploading the data into Saleforce. I do not know Salesforce, so I'm totally ignorant on that side of this. She is reporting that some data in the file is showing up as gibberish (non UTF8) characters (see below).
It seems that some of our users are copy/pasting emails into a web form which then inserts them into our db. Dates from the email headers (I believe) are the text that are showing as gibberish.
11‎/‎17‎/‎2015‎ ‎7‎:‎26‎:‎26‎ ‎AM
becomes
‎11‎/‎16‎/‎2015‎ ‎07‎:‎26‎:‎26‎ ‎AM
The text in the db field looks normal. It's when it is exported to a csv file and then that file is viewed in a text-editor like Wordpad or Salesforce. Then she sees the odd characters.
This only happens with dates from the text that is copy/pasted into the form/db. I have no idea how, or if there is a way, remove these "unseen" characters.
It's the same three-characters each time: ‎ I did a regex_replace() on these to strip them out, but it doesn't work. I think since they are not seen in the db field, the regex does see them.
It seems like even though I cannot see these characters, they must be there in some form that is making them show in text-editors like Wordpad or the Salesforce client after being exported to csv.
I can probably do a mass search/find/replace in the text editor, but it would be nice to do this in the sql and avoid the extra step each time.
Hoping someone has seen this and knows an easy fix.
Thanks for any ideas or pointers that may help.
The sequence ‎ is a left-to-right mark, encoded in UTF-8 (as 0xE2 0x80 0x8E), but being read as if it were in Windows-1252.
A left-to-right mark is invisible, so the fact that you can't see it in the database suggests that it's encoded correctly, but without knowing precisely what path the data took after that, it's hard to guess exactly where it was misinterpreted.
In any case, you should be able to replace the character in your Postgres query by using its Unicode escape sequence: E'\u200E'

Reading file names inside .zip file

I am familiar with the .zip file format, and able to read the internal file table content so far.
The problem occurs with non-english characters in the file name.
The specification states that file names use OEM character set, yet sometimes I get UTF-8 representation and sometimes I get OEM represantation.
The specification states the "version made by" field should be in range 0-20, yet I get versions 31 and 63 which may or may not affect the character set.
Another related problem: When I read the "extra field" there is "up" (unicode path, id=0x7075) which suppose to store the utf-8 represantation of the filename, well, it starts with 5 redundant bytes before the actual utf-8 string (Created by WinRar), yet the other softwares seems to read it correctly.
Any input about the issue?

MS Access Convert from Unicode when Reading from Text File

So, I have an Access database where I import data from a text file. The file is semi-colon delimited. Occasionally (and will become more frequent) I receive a file from one of our affiliates from Russia. The file has unicode (I think) characters like "Ìèðîøíèêîâ" instead of "Мирошников". Ultimately, I'd like to translate those into English upon import, but for now, I'll accept the Russian characters.
How should I go about doing this? Currently, I'm reading each line of the file, using the SPLIT function to separate each field by the ";" separator into an array, and sticking each array element into a table. Would changing the system Keyboard Layout to Russian prior to this work, or is it more complicated than that.
Does any of this make sense, or should I just bag it and go grab a beer (or some Vodka)?
Thanks!
You should be able to create an "Import Specification" that will tell Access how to convert the character data. Follow the procedure here...
Importing a text with separators using VBA
...and choose the appropriate character set from the "Code Page" combo box.
If you need to perform the imports from VBA code then you can save the specification (using the "Save As..." button) and then re-use that specification in a DoCmd.TransferText statement.

Identifying hidden characters in text

I have an ETL process that regularly extracts code from an ODBC data source, manipulates it, and inserts it into my postgres database. One of the columns from this data source regularly has odd characters in it.
For the most part I can catch and convert all of the characters appropriately, but I have one character that exists in the ODBC data source, cannot be brought into postgres (all of the text after that character gets truncated), and I'm having a hard time identifying what the character is.
I can't even insert an example of the character directly into this post because it gets stripped out :/ The closest I can get is a screen shot of the character in textmate (the only application I can actually see the character in):
There character is the diamond between the 1 and 0. When my data comes in, everything after the 0 is truncated.
Is there a good way of identifying what this character is so I can figure out a way of stripping it out?
Per tripleee's comment on the original question post:
To identify the character I grabbed the hex value of the text to identify the hex value of the offending character in question.
There are a number of ways to do this, but the quickest way for me was to use a utility application I have called HexFiend so dump the text into. Once the text was in and I highlighted the character it returned the hex value "00".
A bit more investigation pointed towards the hex null value being used as a line terminator in C applications (which makes sense given the context of my project).
I've fit this null value into my ETL process so that it gets switched out with a new line and now everything is sunshine and daises.
Thanks again for the help!

Bad MySQL import, now we have garbage showing in place of utf-8 chars

We restored from a backup in a different format to a new MySQL structure (which is setup correctly for UTF-8 support). We have weird characters showing in the browser, but we're not sure what they're called so we can find a master list of what they translate to.
I have noticed that they do, in fact, correlate to a specific character. For example:
â„¢ always translates to ™
— always translates to —
• always translates to ·
I referenced this post, which got me started, but this is far from a complete list. Either I'm not searching for the correct name, or the "master list" of these bad-to-good conversions as a reference doesn't exist.
Reference:
Detecting utf8 broken characters in MySQL
Also, when trying to search via MySQL query, if I search for â, I always get MySQL treating it as an "a". Is there any way to tweak my MySQL queries so that they are more literal searches? We don't use internationalization much so I can safely assume any fields containing the â character is considered to be a problematic entry, which would need to be remedied by our "fixit" script we're building.
Instead of designing a "fixit" script to go through and replace this data, I think it would be better to simply fix the issue directly. It seems like the data was originally stored in a different format than UTF-8 so that when you brought it into the table that was set up for UTF-8, it garbled the text. If you have the opportunity, go back to your original backup to determine the format the data was stored in. If you can't do that, you will probably need to do a bit of trial and error to figure out which format the data is in. However, once you know that, conversion is easy. Read the following article's section on Repairing:
http://www.istognosis.com/en/mysql/35-garbled-data-set-utf8-characters-to-mysql-
Basically you are going to set the column to BINARY and then set it to the original charset. That should make the text appear properly (a good check to know you are using the correct charset). Once that is done, set the column to UTF-8. This will convert the data properly and it will correct the problems you are currently experiencing.