Postgresql COPY encoding, how to? - postgresql

I am importing a .txt file that contains imdb information(such as moviename, movieid, actors, directors, rating votes etc) I imported it by using the COPY Statement. I am using Ubuntu 64 bit. The problem is, that there are actors having different names, such as Jonas Åkerlund. That is why postgresql throws an error:
ERROR: missing data for column "actors"
CONTEXT: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas Ã"
********** Error **********
ERROR: missing data for column "actors"
SQL state: 22P04
Context: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas Ã"
My copy statement looks like this:
COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt' (DELIMITER E'\t', FORMAT CSV, NULL '');
I do not exactly know, how to use the collation statement. Could you help me please? As always, thank you.

Collation only determines how strings are sorted. The important thing when loading and saving them is the encoding.
By default, Postgres uses your client_encoding setting for COPY commands; if it doesn't match the encoding of the file, you'll run into problems like this.
You can see from the message that while trying to read the "Å", Postgres first read an "Ã", and then encountered some kind of error. The UTF8 byte sequence for "Å" is C3 85. C3 happens to be "Ã" in the LATIN1 codepage, while 85 is undefined*. So it's highly likely that the file is UTF8, but being read as if it were LATIN1.
It should be as simple as specifying the appropriate encoding in the COPY command:
COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt'
(DELIMITER E'\t', FORMAT CSV, NULL '', ENCODING 'UTF8');
*I believe Postgres actually maps these "gaps" in LATIN1 to the corresponding Unicode code points. 85 becomes U+0085, a.k.a. "NEXT LINE", which explains why it was treated as a CSV row terminator.

Related

SQLAlchemy/MariaDB - incorrectly using Windows-1252 and UTF-8 encodings

I am currently trying to migrate an Oracle database to MariaDB. I have a CHAR column in the oracle database, let's call it my_col, which I believe is encoded using latin-1.
I have fetched the raw bytes of the column, and decoded them successfully using Python, with:
my_col.decode('latin-1')
No error was thrown, which leads me to believe the column is, in fact, encoded with latin-1.
Now, even though the column on the MariaDB table has the latin1 charset, I found that SQLAlchemy was trying to insert UTF-8 encoded strings into it; so I specified ?charset=latin1 on my MariaDB connection string.
STILL, whenever I try inserting the decoded strings into MariaDB, I get the following error:
UnicodeDecodeError: 'charmap' codec can't encode character \x96 ...`
and at the end of my error trace:
File "usr/lib64/python3.9/encodings/cp1252.py", line 12, in encode
This raises two questions:
why is Python trying to use the cp-1252 encoding instead of latin-1, as specified?
how can I tell SQLAlchemy to use latin-1 when fetching the strings from Oracle? I would rather not have to fetch the bytes and decode them myself.
Thanks.

Restore a PostgreSQL-9.6 database from an old complete dump and/or a up-to-date just base directory and other rescued files

I'm trying to restore/rescue a database from some that I have:
I have all the recent files in PGDATA/base (/var/lib/postgresql/9.6/main/base/), but I have not the complete /var/lib/postgresql/9.6/main/
I have all files from an old backup (and not much different) dump that I restored in a new installation of PostgreSQL-9.6.
I have a lot of rescued files from the hard drive (from ddrescue) and I got thousand of files without a name (having a "#" and then a number instead and in lost+found directory), so, for instance:
I have the pg_class file
I have the pg_clog directory with 0000 file
Edit:
Probably I have the content of pg_xlog, but I don't have the name of the files. I have 5 files sized 16777216 bytes:
#288294 (date 2019-04-01)
#288287 (date 2019-05-14)
#288293 (date 2019-07-02)
#261307 (date 2019-11-27)
#270185 (date 2020-01-28)
Also my old dump is from 2019-04-23, so the first one could
be the same?
So my next step is going to try to read those files with pg_xlogdump
and/or trying to name them with those namefiles (beginning with
00000001000000000000000A by date and put them to the new one pg_xlog directory, that I saw that the system filenaming them, could be?). Also I realized that the last one has the date of the day hard drive crashed, so I have the last one.
The PGDATA/base directory I rescued from the hard drive (damaged) contains directories 1, 12406, 12407 and 37972 with a lot of files inside. I check with pg_filedump -fi that my updated data is stored on files in directory 37972.
Same (but old) data is stored in files in directory PGDATA/base/16387 in the restored dump.
I tried directly to copy the files from one to other mixing the updated data over the old database but it doesn't work. After solved permission errors I can go in to the "Frankenstein" database in that way:
postgres#host:~$ postgres --single -P -D /var/lib/postgresql/9.6/main/ dbname
And I tried to do something, like reindex, and I get this error:
PostgreSQL stand-alone backend 9.6.16
backend> reindex system dbname;
ERROR: could not access status of transaction 136889
DETAIL: Could not read from file "pg_subtrans/0002" at offset 16384: Success.
CONTEXT: while checking uniqueness of tuple (1,7) in relation "pg_toast_2619"
STATEMENT: reindex system dbname;
Certainly pg_subtrans/0002 file is part of the "Frankenstein" and not the good one (because I didn't find it yet, not with that name), so I tried: to copied another one that seems similar first and then, to generate 8192 zeroes with dd to that file, in both cases I get the same error (and in case that the file doesn't exist get the DETAIL: Could not open file "pg_subtrans/0002": No such file or directory.). Anyway I have not idea that what should be that file. Do you think could I get that data from other file? Or could I find the missing file using some tool? So pg_filedump show me empty for the other file in that directory pg_subtrans/0000.
Extra note: I found this useful blog post that talk about restore from just rescued files using pg_filedump, pg_class's file, reindex system and other tools and but is so hard for me to understand how to adapt it to my concrete and easier problem (I think that my problem is easier because I have a dump): https://www.commandprompt.com/blog/recovering_a_lost-and-found_database/
Finally we restored completely database based on PGDATA/base/37972 directory after 4 parts:
Checking and "greping" with pg_filedump -fi which file correspond
to each table. To "greping" easier we made a script.
#!/bin/bash
for filename in ./*; do
echo "$filename"
pg_filedump -fi "$filename"|grep "$1"
done
NOTE: Only useful with small strings.
Executing the great tool pg_filedump -D. -D is a new option (from postgresql-filedump version ≥10), to decode tuples using given comma separated list of types.
As we know types because we made the database, we "just" need to give a comma separated list of types related to the table. I wrote "just" because in some cases it could be a little bit complicated. One of our tables need this kind of command:
pg_filedump -D text,text,text,text,text,text,text,text,timestamp,text,text,text,text,int,text,text,int,text,int,text,text,text,text,text,text,text,text,text,text,int,int,int,int,int,int,int,int,text,int,int 38246 | grep COPY > restored_table1.txt
From pg_filedump -D manual:
Supported types:
bigint
bigserial
bool
char
charN -- char(n)
date
float
float4
float8
int
json
macaddr
name
oid
real
serial
smallint
smallserial
text
time
timestamp
timetz
uuid
varchar
varcharN -- varchar(n)
xid
xml
~ -- ignores all attributes left in a tuple
All those text for us were type character varying(255) but varcharN didn't work for us, so after other tests we finally change it for text.
timestamp for us was type timestamp with time zone but timetz didn't work for us, so after other tests we finally change it for timestamp and we opted to lose the time zone data.
This changes work perfect for this table.
Other tables were much easier:
pg_filedump -D int,date,int,text 38183 | grep COPY > restored_table2.txt
As we get just "raw" data we have to re-format to CSV format. So we made a python program for format from pg_filedump -D output to CSV.
We inserted each CSV to the PostgreSQL (after create each empty table again):
COPY scheme."table2"(id_comentari,id_observacio,text,data,id_usuari,text_old)
FROM '<path>/table2.csv' DELIMITER '|' CSV HEADER;
I hope this will help other people :)
That is doomed. Without the information in pg_xlog and (particularly) pg_clog, you cannot get the information back.
A knowledgeable forensics expert might be able to salvage some of your data, but it is not a straightforward process.

How can I change character code from Shift-JIS to UTF-8 when I copy data from DB2 to Postgres?

I'm trying to migrate data from DB2 to Postgres using pentaho ETL now.
character code on DB2 is Shift-JIS (Japanese specific character code) and Postgres is UTF-8.
I could migrate data from DB2 to Postgres successfully, but Japanese character has not been transformed properly (it has been changed to strange characters..)
How can I change character code from Shift-Jis to UTF-8 when I transfer data?
It was bit though problem for me, but I could solve it finally.
first, you need to choose "Modified Java Script value" from job list and write the script as below.
(I'm assuming that the value in the table is column1 and new value is value1)
here is the example of the source code. (You can specify multiple values if you need)
var value1 = new Packages.java.lang.String(new
Packages.java.lang.String(column1).getBytes("ISO8859_1"),"Shift-JIS").replaceAll(" ","");
//you don't need to use replaceAll() if you don't need to trim the string.
Finally, click "Get variables" and the value will be shown in the table below.
then, you can choose the "value1" in the next job and it has been converted to correct encode. (which you specified)

String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence:

Team,
I am using redshift version *(8.0.2 ). while loading data using COPY command, I get an error: - "String contains invalid or unsupported UTF8 codepoints, Bad UTF8 hex sequence: bf (error 3)".
It seems COPY trying to load UTF-8 "bf" into VARCHAR field. As per Amazon redshift, this error code 3 defines below:
error code3:
The UTF-8 single-byte character is out of range. The starting byte must not be 254, 255
or any character between 128 and 191 (inclusive).
Amazon recommnds this as solution - we need to go replace the character with a valid UTF-8 code sequence or remove the character.
could you please help me how to replace the character with valid UTF-8 code ?
when i checked database properties in PG-ADMIN, it shows the encoding as UTF-8.
Please guide me how to replace the character in the input delimited file.
Thanks...
I've run into this issue in RedShift while loading TPC-DS datasets for experiments.
Here is the documentation and forum chatter I found via AWS:https://forums.aws.amazon.com/ann.jspa?annID=2090
And here is the explicit commands you can use to solve data conversion errors:http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-acceptinvchars
You can explicitly replace the invalid UTF-8 characters or disregard them all together during the COPY phase by stating ACCEPTINVCHARS.
Try this:
copy table from 's3://my-bucket/my-path
credentials 'aws_iam_role=<your role arn>'
ACCEPTINVCHARS
delimiter '|' region 'us-region-1';
Warnings:
Load into table 'table' completed, 500000 record(s) loaded successfully.
Load into table 'table' completed, 4510 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.
0 rows affected
COPY executed successfully
Execution time: 33.51s
Sounds like the encoding of your file might not be utf-8. You might try this technique that we use sometimes
cat myfile.tsv| iconv -c -f ISO-8859-1 -t utf8 > myfile_utf8.tsv
For many people loading CSVs into databases, they get their files from someone using Excel or they have access to Excel. If so, this problem is quickly solved by:
First saving the file out of Excel using the Save As and selecting CSV UTF-8 (Comma Delimited) (*.csv) format, by requesting/training those giving you the files to use this export format. Note many people by default export to csv using the CSV (Comma delimited) (*.csv) format and there is a difference.
Loading the csv into Excel and then immediately Saving As to the UTF-8 csv format.
Of course it wouldn't work for files unusable by Excel, ie. larger than 1 million rows, etc. Then I would use the iconv suggestion by mike_pdb
Noticed Athena external table is able to parse data which Redshift copy command unable to do. We can use below alternative approach when encountering - String contains invalid or unsupported UTF8 codepoints Bad UTF8 hex sequence: 8b (error 3).
Follow below steps, if you want to load data into redshift database db2 and table table2.
Have a Glue crawler IAM role ready which has access to S3.
Run crawler.
Validate table and database in Athena created by Glue crawler, say external db1_ext, table1_ext
Login to redshift and create linking with Glue Catalog by creating Redshift schema (db1_schema) using below command.
CREATE EXTERNAL SCHEMA db1_schema
FROM DATA CATALOG
DATABASE 'db1_ext'
IAM_ROLE 'arn:aws:iam:::role/my-redshift-cluster-role';
Load from external table
INSERT INTO db2.table2 (SELECT * FROM db1_schema.table1_ext)

UTF-8 - Oracle issue

I set my NLS_LANG variable as 'AMERICAN_AMERICA.AL32UTF8' in the perl file that connects to oracle and tries to insert the data.
However when I insert a record with one value having this 'ñ' character the sql fails.
But if I use 'Ñ' it inserts just fine.
What am I doing wrong here?
Additional info:
If I change my NLS_LANG to 'AMERICAN_AMERICA.UTF8' I can insert 'ñ' just fine...
What does it fail with ?
Generally if there is a problem in character conversion it fails quietly (eg recording a character with an inappropriate translation). Sometimes you get an error which indicates that the column isn't large enough. This is typically when trying to store, for example, a character that takes up two or three bytes in a column that only allows one byte.
First step is to confirm the database settings
select * from V$NLS_PARAMETERS where parameter like ‘%CHARACTERSET%’;
Then check the byte composition of the strings with a:
select dump('ñ',16), dump('Ñ',16) from dual;
The first query gives me:
1 NLS_CHARACTERSET AL32UTF8
2 NLS_NCHAR_CHARACTERSET AL16UTF16
The second query gives me:
1 Typ=96 Len=2: c3,b1 Typ=96 Len=2: c3,91
My exact db and perl settings are listed in this question:
https://stackoverflow.com/questions/3016128/dbdoracle-and-utf8-issue