Using the Icelandic Thorn character as a delimiter in Hive - encoding

I'm currently trying to import some DoubleClick advertising logs into Hadoop.
These logs are stored in a gzip delimited file which is encoding using page 1252 (Windows-ANSI?) and which uses the Icelandic Thorn character as a delimiter.
I can happily import these logs into a single column, but I can't seem to find a way to get Hive to understand the Thorn character - I think maybe because it doesn't understand the 1252 encoding?
I've looked at the Create Table documentation - http://hive.apache.org/docs/r0.9.0/language_manual/data-manipulation-statements.html - but can't seem to find any way to get this encoding/delimiter working.
I've also seen from https://karmasphere.com/karmasphere-analyst-faq a suggestion that the encoding for these files is ISO-8859-1 - but I don't see how to use that info in Hive or HDFS.
I know I can do a map job after import to split these rows into multiple records.
But is there an easier way to use this delimiter directly?
Thanks
Stuart

use '\-2'
the char is a signed byte.
apparently hive devs don't think it is a problem:
https://issues.apache.org/jira/browse/HIVE-237

Related

PostgreSQL Escape Microsoft Special Characters In Select Query

PostgreSQL, DBvisualizer and Salesforce
I'm selecting records from a database table and exporting them to a csv file: comma-separated and UTF8 encoded. I send the file to a user who is uploading the data into Saleforce. I do not know Salesforce, so I'm totally ignorant on that side of this. She is reporting that some data in the file is showing up as gibberish (non UTF8) characters (see below).
It seems that some of our users are copy/pasting emails into a web form which then inserts them into our db. Dates from the email headers (I believe) are the text that are showing as gibberish.
11‎/‎17‎/‎2015‎ ‎7‎:‎26‎:‎26‎ ‎AM
becomes
‎11‎/‎16‎/‎2015‎ ‎07‎:‎26‎:‎26‎ ‎AM
The text in the db field looks normal. It's when it is exported to a csv file and then that file is viewed in a text-editor like Wordpad or Salesforce. Then she sees the odd characters.
This only happens with dates from the text that is copy/pasted into the form/db. I have no idea how, or if there is a way, remove these "unseen" characters.
It's the same three-characters each time: ‎ I did a regex_replace() on these to strip them out, but it doesn't work. I think since they are not seen in the db field, the regex does see them.
It seems like even though I cannot see these characters, they must be there in some form that is making them show in text-editors like Wordpad or the Salesforce client after being exported to csv.
I can probably do a mass search/find/replace in the text editor, but it would be nice to do this in the sql and avoid the extra step each time.
Hoping someone has seen this and knows an easy fix.
Thanks for any ideas or pointers that may help.
The sequence ‎ is a left-to-right mark, encoded in UTF-8 (as 0xE2 0x80 0x8E), but being read as if it were in Windows-1252.
A left-to-right mark is invisible, so the fact that you can't see it in the database suggests that it's encoded correctly, but without knowing precisely what path the data took after that, it's hard to guess exactly where it was misinterpreted.
In any case, you should be able to replace the character in your Postgres query by using its Unicode escape sequence: E'\u200E'

Why can not Google Dataprep handle the encoding in my log files?

We are receiving big log files each month. Before loading them into Google BigQuery they need to be converted from fixed with to delimited. I found a good article on how to do that in Google Dataprep. However, there seems to be something wrong with the encoding.
Each time a Swedish Character appears in the log file, the Split function seems to add another space. This messes up the rest of the columns, as can be seen in the attached screenshot.
I can't determine the correct encoding of the log files, but I know they are being created by pretty old Windows servers in Poland.
Can anyone advice on how to solve this challenge?
Screenshot of the issue in Google Dataprep.
What us the exact recipe you are using ? Do you use (split every x ) ?
When I used in a test case an ISO Latin1 text and ingested it as ISO 8859-1, the output was as expected and only the display was off
Can you try the same ?
Would it be possible to share an example input file with one or two rows ?
As a workaround you can use the RegEx, which should work.
It's unfortunately a bit more complex, because you would have to use multiple regex splits. Here's an example for the first two splits after 10 characters each /.{10}/ and split on //

Incorrect Special Character Handling in Informatica Powercenter 9.1

I am currently working on a project in my organisation where we are migrating Informatica Powercenter in our application from v8.1 to v9.1.
Informatica PC is loading data from datafiles but is not able to maintain certain special characters present in few of the input dat files.
The data was is getting loaded correctly in v8.1.
Tried changing characterset settings in Informatica as below -
CodePage movement = Unicode
NLS_LANG = AMERICAN_AMERICA.UTF8 to ENGLISH_UNITEDKINGDOM.UTF8
"DataMovementMode" = Unicode
After making the above settings I am getting the below error in the in Informatica log:
READER_1_2_1> FR_3015 Warning! Row [2258], field [exDestination]: Data [TO] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [exDestination]: Data [IOMR] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [parentOID]: Data [O-MS1109ZTRD00:esm4:iomr-2_20040510_0_0] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2268], field [exDestination]: Data [IOMR] was truncated.
The special character that are being sent in the data are and not being handled correctly -
Ø
Ù
Ɨ
¿
Á
Can somebody please guide how to resolve this issue? What else is required at Informatica end to be changed.
Does it need any session parameters to be set in database?
I posted this in another thread about special characters. Please check if this is of any help.
Start with the Source in designer. Are you able to see the data correctly in the source qualifier preview? If not, you might want to set ff source definition encoding to UTF-8.
The Integration service should be running in Unicode mode and not ASCII mode. You can check this from the Integration service properties in Admin Console.
The target should be UTF-8 encoding.
Check the relational connection ( if the target is a database) encoding in workflow manager to ensure it is UTF-8
If the problem persists, write the output to a UTF-8 flat file and check if the data is loading properly. If yes, then the issue is with writing to the database.
Check the database settings like NLS_LANG, NLS_CHARACTERSET (for oracle) etc.
Also set your integration service (IS) to run in Unicode mode for best results apart from configuring ODBC and relational connections to use Unicode
Details for Unicode & ASCII
a) Unicode - IS allows 2 bytes for each character and uses additional byte for each non-ascii character (such as Japanese/chinese characters)
b) ASCII - IS holds all data in a single byte
Make sure that the size of the variable is big enough to hold the data. Some times the warnings mentioned will be received when the size is small to hold the incoming data.

How can I check if a character is allowed to be uploaded in Teradata?

How can I check if a character is allowed to be uploaded in Teradata ?
Recently I was uploading (using jdbc) a .csv file that contained some weird SUB characters. The upload failed. Later i found out that those weird characters were the older version of the end of file marker. So, where can I get a list of all allowed characters so that I could pre clean my csv files and be sure that they get uploaded ?
Thanks
Found this in the Teradata International Character Set Support documentation to explain why you are encountering the error with the SUB data in your. Which is what I believe the user logic linked to in his/her answer.
The characters 0x1A in LATIN/KANJI1/KANJISJIS and U+FFFD in
UNICODE/GRAPHIC are used internally by Teradata as the error
character; therefore, they are unusable as user data. The user cannot
store or retrieve these values through Teradata.
The list of supported UNICODE characters can be found here: UNICODE Server Character Set (direct download of text file)
#Alex You could try looking here? :)
http://www.info.teradata.com/templates/eSrchResults.cfm?txtrelno=&prodline=all&frmdt=&srtord=Asc&todt=&txtsrchstring=character&rdsort=Title&txtpid=
Update! The link leads to multiple lists of Teradata's supported characters

Toad unicode input problem

In toad, I can see unicode characters that are coming from oracle db. But when I click one of the fields in the data grid into the edit mode, the unicode characters are converted to meaningless symbols, but this is not the big issue.
While editing this field, the unicode characters are displayed correctly as I type. But as soon as I press enter and exit edit mode, they are converted to the nearest (most similar) non-unicode character. So I cannot type unicode characters on data grids. Copy & pasting one of the unicode characters also does not work.
How can I solve this?
Edit: I am using toad 9.0.0.160.
We never found a solution for the same problems with toad. In the end most people used Enterprise Manager to get around the issues. Sorry I couldn't be more help.
Quest officially states, that they currently do not fully support Unicode, but they promise a full Unicode version of Toad in 2009: http://www.quest.com/public-sector/UTF8-for-Toad-for-Oracle.aspx
An excerpt from know issues with Toad 9.6:
Toad's data layer does not support UTF8 / Unicode data. Most non-ASCII characters will display as question marks in the data grid and should not produce any conversion errors except in Toad Reports. Toad Reports will produce errors and will not run on UTF8 / Unicode databases. It is therefore not advisable to edit non-ASCII Unicode data in Toad's data grids. Also, some users are still receiving "ORA-01026: multiple buffers of size > 4000 in the bind list" messages, which also seem to be related to Unicode data.