sqoop import - control M character - special-characters

All,
Working on importing data from DB2 using sqoop import, it worked fine for the most part except one table, which seemed to have some special characters ( control-M = ^M ) in contents, hence while sqooping, these characters are treated as newline and hence everything after it will be on the next line in the imported files, which will affect all the records after one bad record.
I am unable to guess how to fix the imports? is there any eazy way?

As possible solution you can remove special characters in the import query:
replace(replace(source_db_column,chr(13),''),chr(10), '') as source_db_column
In this case it will be done by source database before sqoop will fetch data.
Another option is to use --hive-drop-import-delims parameter to the sqoop import command along with --map-column-java option to specify that column is of type String:
sqoop import
...
--hive-drop-import-delims
--map-column-java your_column=String
And one more option is to use --hive-delims-replacement "some_character" again with --map-column-java your_column=String to replace special characters (\n , \t , and \01) with some other character.

Related

What options would load an escape character into Redshift?

Having a tough time playing with Redshift's COPY options to load a field that has an escape character immediately followed by a delimiter ('|'). Data looks like this:
00b9e290000f8350b9c780832a210000|MY DATA\|AB
So that has 3 fields that I'm trying to load. When I run with just ESCAPE, Redshift seems to properly add \ to doubleescape, but then the pipe delimiter gets ignored. So Redshift ends up trying to load all of the following into the second field: MY DATA|AB. Error message is that the delimiter was not found, since that's read as the second field with no following delimiter
I've tried running COPY with just the ESCAPE option, the CSV + ESCAPE options and a few others with no luck. Is there anything else I should try? Or should I be adding some pre-process step to doubleescape?

Uploading data to RedShift using COPY

I am trying to upload data to RedShift using COPY command.
On this row:
4072462|10013868|default|2015-10-14 21:23:18.0|0|'A=0
I am getting this error:
Delimited value missing end quote
This is the COPY command:
copy test
from 's3://test/test.gz'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx' removequotes escape gzip
First, I hope you know why you are getting the mentioned error: You have a a single quote in one of the column values. While using the removequotes option, Redshift documentation clearly says that:
If a string has a beginning single or double quotation mark but no corresponding ending mark, the COPY command fails to load that row and returns an error.
One thing is certain: removequotes is certainly not what you are looking for.
Second, so what are your options?
If preprocessing the S3 file is in your control, consider using the escape option. Per the documentation,
When this parameter is specified, the backslash character (\) in input data is treated as an escape character.
So your input row in S3 should change to something like:
4072462|10013868|default|2015-10-14 21:23:18.0|0|\'A=0
See if the CSV DELIMITER '|' works for you. Check documentation here.

SQLite update to change double-double quotes ("") to regular quotation marks (")?

working on an iPhone app. I just imported some records into a SQL Lite database, and all my regular quote marks have been "doubled". An example:
Desired final format:
The song "ABC" will play at 3 PM.
The record is currently appearing in the database as:
The song ""ABC"" will play at 3 PM.
Does anyone know how to do a SQL update to change all "double-double" quotes to just regular quotation marks?
Just to clarify, I'm looking directly at the database, not via code. The code will just display these as "double-double" quotes just as they appear in the database, so I want to remove them. The "double-double" quotes are actually in the import file as well, but if I try to remove them, then the import fails. So I kept them there, and now that the records are successfully imported into the database, now I just want to correct the "double-double" quote thing with a mass SQL update if it's possible. Thanks in advance for any insight!
SQLite uses single quotes to escape string literals. It escapes single quotes by adding another single quote (likewise for double quotes). So technically as long as your SQL is well constructed, the import process should work properly. The strings should be enclosed in single quotes, and not double quotes. I suspect that your code may be constructing the SQL by hand instead of binding/properly escaping the values.
SQLite has a built in function to quote string's. It's called quote. Here are some sample inputs, and the corresponding output:
sqlite> SELECT quote("foo");
'foo'
sqlite> SELECT quote("foo ""bar""");
'foo "bar"'
sqlite> SELECT quote("foo 'bar'");
'foo ''bar'''
So you could remove the twice escaped double quote before it even goes to SQLite using NSString methods.
[#"badString\"\"" stringByReplacingOccurrencesOfString:#"\"\"" withString:#"\""];
If the database already contains bad values, then you could run the following update SQL to clean it up:
UPDATE table SET column = REPLACE(column, '""', '"');

Postgres using FOREIGN TABLE and data include "\"

My text file look like:
\home\stanley:123456789
c:/kobe:213
\tej\home\ant:222312
and create FOREIGN TABLE Steps:
CREATE FOREIGN TABLE file_check(txt text) SERVER file_server OPTIONS (format 'text', filename '/home/stanley/check.txt');
after select file_check (using: select * from file_check)
my console show me
homestanley:123456789
c:/kobe:213
ejhomeant:222312
Anyone can help me??
The file foreign-data-wrapper uses the same rules as COPY (presumably because it's the same code underneath). You've got to consider that backslash is an escape character...
http://www.postgresql.org/docs/9.2/static/sql-copy.html
Any other backslashed character that is not mentioned in the above table will be taken to represent itself. However, beware of adding backslashes unnecessarily, since that might accidentally produce a string matching the end-of-data marker (.) or the null string (\N by default). These strings will be recognized before any other backslash processing is done.
So you'll either need to double-up the backslashes or perhaps try it as a single-column csv file and see if that helps

Using the Icelandic Thorn character as a delimiter in Hive

I'm currently trying to import some DoubleClick advertising logs into Hadoop.
These logs are stored in a gzip delimited file which is encoding using page 1252 (Windows-ANSI?) and which uses the Icelandic Thorn character as a delimiter.
I can happily import these logs into a single column, but I can't seem to find a way to get Hive to understand the Thorn character - I think maybe because it doesn't understand the 1252 encoding?
I've looked at the Create Table documentation - http://hive.apache.org/docs/r0.9.0/language_manual/data-manipulation-statements.html - but can't seem to find any way to get this encoding/delimiter working.
I've also seen from https://karmasphere.com/karmasphere-analyst-faq a suggestion that the encoding for these files is ISO-8859-1 - but I don't see how to use that info in Hive or HDFS.
I know I can do a map job after import to split these rows into multiple records.
But is there an easier way to use this delimiter directly?
Thanks
Stuart
use '\-2'
the char is a signed byte.
apparently hive devs don't think it is a problem:
https://issues.apache.org/jira/browse/HIVE-237