String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence: - amazon-redshift

Team,
I am using redshift version *(8.0.2 ). while loading data using COPY command, I get an error: - "String contains invalid or unsupported UTF8 codepoints, Bad UTF8 hex sequence: bf (error 3)".
It seems COPY trying to load UTF-8 "bf" into VARCHAR field. As per Amazon redshift, this error code 3 defines below:
error code3:
The UTF-8 single-byte character is out of range. The starting byte must not be 254, 255
or any character between 128 and 191 (inclusive).
Amazon recommnds this as solution - we need to go replace the character with a valid UTF-8 code sequence or remove the character.
could you please help me how to replace the character with valid UTF-8 code ?
when i checked database properties in PG-ADMIN, it shows the encoding as UTF-8.
Please guide me how to replace the character in the input delimited file.
Thanks...

I've run into this issue in RedShift while loading TPC-DS datasets for experiments.
Here is the documentation and forum chatter I found via AWS:https://forums.aws.amazon.com/ann.jspa?annID=2090
And here is the explicit commands you can use to solve data conversion errors:http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-acceptinvchars
You can explicitly replace the invalid UTF-8 characters or disregard them all together during the COPY phase by stating ACCEPTINVCHARS.
Try this:
copy table from 's3://my-bucket/my-path
credentials 'aws_iam_role=<your role arn>'
ACCEPTINVCHARS
delimiter '|' region 'us-region-1';
Warnings:
Load into table 'table' completed, 500000 record(s) loaded successfully.
Load into table 'table' completed, 4510 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.
0 rows affected
COPY executed successfully
Execution time: 33.51s

Sounds like the encoding of your file might not be utf-8. You might try this technique that we use sometimes
cat myfile.tsv| iconv -c -f ISO-8859-1 -t utf8 > myfile_utf8.tsv

For many people loading CSVs into databases, they get their files from someone using Excel or they have access to Excel. If so, this problem is quickly solved by:
First saving the file out of Excel using the Save As and selecting CSV UTF-8 (Comma Delimited) (*.csv) format, by requesting/training those giving you the files to use this export format. Note many people by default export to csv using the CSV (Comma delimited) (*.csv) format and there is a difference.
Loading the csv into Excel and then immediately Saving As to the UTF-8 csv format.
Of course it wouldn't work for files unusable by Excel, ie. larger than 1 million rows, etc. Then I would use the iconv suggestion by mike_pdb

Noticed Athena external table is able to parse data which Redshift copy command unable to do. We can use below alternative approach when encountering - String contains invalid or unsupported UTF8 codepoints Bad UTF8 hex sequence: 8b (error 3).
Follow below steps, if you want to load data into redshift database db2 and table table2.
Have a Glue crawler IAM role ready which has access to S3.
Run crawler.
Validate table and database in Athena created by Glue crawler, say external db1_ext, table1_ext
Login to redshift and create linking with Glue Catalog by creating Redshift schema (db1_schema) using below command.
CREATE EXTERNAL SCHEMA db1_schema
FROM DATA CATALOG
DATABASE 'db1_ext'
IAM_ROLE 'arn:aws:iam:::role/my-redshift-cluster-role';
Load from external table
INSERT INTO db2.table2 (SELECT * FROM db1_schema.table1_ext)

Related

Restore a PostgreSQL-9.6 database from an old complete dump and/or a up-to-date just base directory and other rescued files

I'm trying to restore/rescue a database from some that I have:
I have all the recent files in PGDATA/base (/var/lib/postgresql/9.6/main/base/), but I have not the complete /var/lib/postgresql/9.6/main/
I have all files from an old backup (and not much different) dump that I restored in a new installation of PostgreSQL-9.6.
I have a lot of rescued files from the hard drive (from ddrescue) and I got thousand of files without a name (having a "#" and then a number instead and in lost+found directory), so, for instance:
I have the pg_class file
I have the pg_clog directory with 0000 file
Edit:
Probably I have the content of pg_xlog, but I don't have the name of the files. I have 5 files sized 16777216 bytes:
#288294 (date 2019-04-01)
#288287 (date 2019-05-14)
#288293 (date 2019-07-02)
#261307 (date 2019-11-27)
#270185 (date 2020-01-28)
Also my old dump is from 2019-04-23, so the first one could
be the same?
So my next step is going to try to read those files with pg_xlogdump
and/or trying to name them with those namefiles (beginning with
00000001000000000000000A by date and put them to the new one pg_xlog directory, that I saw that the system filenaming them, could be?). Also I realized that the last one has the date of the day hard drive crashed, so I have the last one.
The PGDATA/base directory I rescued from the hard drive (damaged) contains directories 1, 12406, 12407 and 37972 with a lot of files inside. I check with pg_filedump -fi that my updated data is stored on files in directory 37972.
Same (but old) data is stored in files in directory PGDATA/base/16387 in the restored dump.
I tried directly to copy the files from one to other mixing the updated data over the old database but it doesn't work. After solved permission errors I can go in to the "Frankenstein" database in that way:
postgres#host:~$ postgres --single -P -D /var/lib/postgresql/9.6/main/ dbname
And I tried to do something, like reindex, and I get this error:
PostgreSQL stand-alone backend 9.6.16
backend> reindex system dbname;
ERROR: could not access status of transaction 136889
DETAIL: Could not read from file "pg_subtrans/0002" at offset 16384: Success.
CONTEXT: while checking uniqueness of tuple (1,7) in relation "pg_toast_2619"
STATEMENT: reindex system dbname;
Certainly pg_subtrans/0002 file is part of the "Frankenstein" and not the good one (because I didn't find it yet, not with that name), so I tried: to copied another one that seems similar first and then, to generate 8192 zeroes with dd to that file, in both cases I get the same error (and in case that the file doesn't exist get the DETAIL: Could not open file "pg_subtrans/0002": No such file or directory.). Anyway I have not idea that what should be that file. Do you think could I get that data from other file? Or could I find the missing file using some tool? So pg_filedump show me empty for the other file in that directory pg_subtrans/0000.
Extra note: I found this useful blog post that talk about restore from just rescued files using pg_filedump, pg_class's file, reindex system and other tools and but is so hard for me to understand how to adapt it to my concrete and easier problem (I think that my problem is easier because I have a dump): https://www.commandprompt.com/blog/recovering_a_lost-and-found_database/
Finally we restored completely database based on PGDATA/base/37972 directory after 4 parts:
Checking and "greping" with pg_filedump -fi which file correspond
to each table. To "greping" easier we made a script.
#!/bin/bash
for filename in ./*; do
echo "$filename"
pg_filedump -fi "$filename"|grep "$1"
done
NOTE: Only useful with small strings.
Executing the great tool pg_filedump -D. -D is a new option (from postgresql-filedump version ≥10), to decode tuples using given comma separated list of types.
As we know types because we made the database, we "just" need to give a comma separated list of types related to the table. I wrote "just" because in some cases it could be a little bit complicated. One of our tables need this kind of command:
pg_filedump -D text,text,text,text,text,text,text,text,timestamp,text,text,text,text,int,text,text,int,text,int,text,text,text,text,text,text,text,text,text,text,int,int,int,int,int,int,int,int,text,int,int 38246 | grep COPY > restored_table1.txt
From pg_filedump -D manual:
Supported types:
bigint
bigserial
bool
char
charN -- char(n)
date
float
float4
float8
int
json
macaddr
name
oid
real
serial
smallint
smallserial
text
time
timestamp
timetz
uuid
varchar
varcharN -- varchar(n)
xid
xml
~ -- ignores all attributes left in a tuple
All those text for us were type character varying(255) but varcharN didn't work for us, so after other tests we finally change it for text.
timestamp for us was type timestamp with time zone but timetz didn't work for us, so after other tests we finally change it for timestamp and we opted to lose the time zone data.
This changes work perfect for this table.
Other tables were much easier:
pg_filedump -D int,date,int,text 38183 | grep COPY > restored_table2.txt
As we get just "raw" data we have to re-format to CSV format. So we made a python program for format from pg_filedump -D output to CSV.
We inserted each CSV to the PostgreSQL (after create each empty table again):
COPY scheme."table2"(id_comentari,id_observacio,text,data,id_usuari,text_old)
FROM '<path>/table2.csv' DELIMITER '|' CSV HEADER;
I hope this will help other people :)
That is doomed. Without the information in pg_xlog and (particularly) pg_clog, you cannot get the information back.
A knowledgeable forensics expert might be able to salvage some of your data, but it is not a straightforward process.

BigQuery - create table via UI from cloud storage results in integer error

I am trying to test out BigQuery but am getting stuck on creating a table from data stored in google cloud storage. I am able to reduce the data down to just one value, but it is not making sense.
I have a text file I uploaded to google cloud storage with just one integer value in it, 177790884
I am trying to create a table via the BigQuery web UI, and go through the wizard. When I get to the schema definition section, I enter...
ID:INTEGER
The load always fails with...
Errors:
File: 0 / Line:1 / Field:1: Invalid argument: 177790884 (error code: invalid)
Too many errors encountered. Limit is: 0. (error code: invalid)
Job ID trusty-hangar-120519:job_LREZ5lA8QNdGoG2usU4Q1jeMvvU
Start Time Jan 30, 2016, 12:43:31 AM
End Time Jan 30, 2016, 12:43:34 AM
Destination Table trusty-hangar-120519:.onevalue
Source Format CSV
Allow Jagged Rows true
Ignore Unknown Values true
Source URI gs:///onevalue.txt
Schema
ID: INTEGER
If I load with a schema of ID:STRING it works fine. The number 177790884 is not larger than a 64 bit signed int, I am really unsure what is going on.
Thanks,
Craig
Your input file likely contains a UTF-8 byte order mark (3 "invisible" bytes at the beginning of the file that indicate the encoding) that can cause BigQuery's CSV parser to fail.
https://en.wikipedia.org/wiki/Byte_order_mark
I'd suggest Googling for a platform-specific method for view and remove the byte order mark. (A hex editor would do.)
The issue is definitely with file's encoding. I was able to reproduce error.
And then "fixed" it by saving "problematic" file as ANSI (just for test) and now it was loaded successfully.

How can I change character code from Shift-JIS to UTF-8 when I copy data from DB2 to Postgres?

I'm trying to migrate data from DB2 to Postgres using pentaho ETL now.
character code on DB2 is Shift-JIS (Japanese specific character code) and Postgres is UTF-8.
I could migrate data from DB2 to Postgres successfully, but Japanese character has not been transformed properly (it has been changed to strange characters..)
How can I change character code from Shift-Jis to UTF-8 when I transfer data?
It was bit though problem for me, but I could solve it finally.
first, you need to choose "Modified Java Script value" from job list and write the script as below.
(I'm assuming that the value in the table is column1 and new value is value1)
here is the example of the source code. (You can specify multiple values if you need)
var value1 = new Packages.java.lang.String(new
Packages.java.lang.String(column1).getBytes("ISO8859_1"),"Shift-JIS").replaceAll(" ","");
//you don't need to use replaceAll() if you don't need to trim the string.
Finally, click "Get variables" and the value will be shown in the table below.
then, you can choose the "value1" in the next job and it has been converted to correct encode. (which you specified)

Postgresql COPY encoding, how to?

I am importing a .txt file that contains imdb information(such as moviename, movieid, actors, directors, rating votes etc) I imported it by using the COPY Statement. I am using Ubuntu 64 bit. The problem is, that there are actors having different names, such as Jonas Åkerlund. That is why postgresql throws an error:
ERROR: missing data for column "actors"
CONTEXT: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas Ã"
********** Error **********
ERROR: missing data for column "actors"
SQL state: 22P04
Context: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas Ã"
My copy statement looks like this:
COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt' (DELIMITER E'\t', FORMAT CSV, NULL '');
I do not exactly know, how to use the collation statement. Could you help me please? As always, thank you.
Collation only determines how strings are sorted. The important thing when loading and saving them is the encoding.
By default, Postgres uses your client_encoding setting for COPY commands; if it doesn't match the encoding of the file, you'll run into problems like this.
You can see from the message that while trying to read the "Å", Postgres first read an "Ã", and then encountered some kind of error. The UTF8 byte sequence for "Å" is C3 85. C3 happens to be "Ã" in the LATIN1 codepage, while 85 is undefined*. So it's highly likely that the file is UTF8, but being read as if it were LATIN1.
It should be as simple as specifying the appropriate encoding in the COPY command:
COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt'
(DELIMITER E'\t', FORMAT CSV, NULL '', ENCODING 'UTF8');
*I believe Postgres actually maps these "gaps" in LATIN1 to the corresponding Unicode code points. 85 becomes U+0085, a.k.a. "NEXT LINE", which explains why it was treated as a CSV row terminator.

How should I open a PostgreSQL dump file and add actual data to it?

I have a pretty basic database. I need to drop a good size users list into the db. I have the dump file, need to convert it to a .pg file and then somehow load this data into it.
The data I need to add are in CSV format.
I assume you already have a .pg file, which I assume is a database dump in the "custom" format.
PostgreSQL can load data in CSV format using the COPY statement. So the absolute simplest thing to do is just add your data to the database this way.
If you really must edit your dump, and the file is in the "custom" format, there is unfortunately no way to edit the file manually. However, you can use pg_restore to create a plain SQL backup from the custom format and edit that instead. pg_restore with no -d argument will generate an SQL script for insertion.
As suggested by Daniel, the simplest solution is to keep your data in CSV format and just import into into Postgres as is.
If you're trying to to merge this CSV data into a 3rd party Postgres dump file, then you'll need to first convert the data into SQL insert statements.
One possible unix solution:
awk -F, '{printf "INSERT INTO TABLE my_tab (\"%s\",\"%s\",\"%s\");\n",$1,$2,$3}' data.csv