How to fix "the octet sequence #(130) cannot be decoded." in pgloader - postgresql

I'm trying to migrate a database from sqlite to postgresql using pgloader.
My sqlite db is data.db, so i try this
pgloader ./var/data.db postgres://***#ec2-54-83-50-174.compute-1.amazonaws.com:5432/mydb?sslmode=require
Output:
pgloader version 3.6.1
sb-impl::*default-external-format* :UTF-8
tmpdir: #P"/var/folders/65/x6spw10s4jgd3qkhdq96bk8c0000gn/T/"
KABOOM!
2019-04-11T19:22:47.022000+01:00 NOTICE Starting pgloader, log system is ready.
FATAL error: :UTF-8 stream decoding error on #<SB-SYS:FD-STREAM for "file /Users/mackbookpro/Desktop/dev/www/Beyti/var/data.db" {1005892A93}>: the octet sequence #(130) cannot be decoded.
Date/time: 2019-04-11-18:22An unhandled error condition has been signalled: :UTF-8 stream decoding error on #<SB-SYS:FD-STREAM for "file /Users/mackbookpro/Desktop/dev/www/Beyti/var/data.db" {1005892A93}>: the octet sequence #(130) cannot be decoded.
An idea about this problem? thank you in advance

This is a character encoding issue.
The culprit "octet sequence #(130)" corresponded to "é" in my case, which was encoded as \x82.
iconv failed.
I replaced in the byte stream those corrupted \x82 with \x65 (ascii char "e"), and I got out of it.
<bad_file xxd -c1 -p | sed s/82/65/ | xxd -r -p > good_new_file
(cheers to Natacha on irc freenode #gcu :) )
Edit : French issues? same problem with #133 "à", same solution \x85 -> \x61
Edit 2 : A little generalization I just found :
The "octet sequence" pgloader refers to, is the decimal ranking of the ascii table. When you get higher than 127 in the "octet sequence", you step in the extended ascii table and generate errors.
I just got an issue with #144? It is \x90. replace :)

Related

How do I fix an encoding error in order to upload a 5GB text file using psql? (ERROR: invalid byte sequence for encoding "UTF8": 0x92)

I am trying to upload a series of tables (.txt files) into a PostgreSQL database that runs on my Windows 10 desktop. I use psql upload the files. I have successfully uploaded a couple of tables but the largest one (5GB with over 20 million rows) is giving me trouble:
databasename=# \copy table1 FROM 'C:\Users\tablename.txt' DELIMITER ',' CSV HEADER;
ERROR: character with byte sequence 0x9d in encoding "WIN1252" has no equivalent in encoding "UTF8"
CONTEXT: COPY table1, line 581330
I found an answer here which suggested I check the client encoding...
databasename=# SHOW client_encoding;
client_encoding
-----------------
WIN1252
(1 row)
and then change it, which I tried:
databasename=# SET CLIENT_ENCODING TO 'utf8';
SET
I then try the same copy command again and get the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0x92
CONTEXT: COPY table1, line 206051
I've read a little about 0x92 here. It sounds like there is a character in the file which cannot be encoded when I try and perform the \copy command.
Some background:
I was able to upload about 1 million rows into SQL Server 2019 (free version) using the SQL Server Import and Export Wizard. (I stopped the import because it was taking too long.) I was also able to view the file in R using read.csv. Not sure if any of this is helpful. Thank you all in advance.

Does the postgres COPY function support utf 16 encoded files?

I am trying to use the postgreSQL COPY function to insert a UTF 16 encoded csv into a table. However, when running the below query:
COPY temp
FROM 'C:\folder\some_file.csv'
WITH (
DELIMITER E'\t',
FORMAT csv,
HEADER);
I get the error below:
ERROR: invalid byte sequence for encoding "UTF8": 0xff
CONTEXT: COPY temp, line 1
SQL state: 22021
and when I run the same query, but adding the encoding settings Encoding 'UTF-16' or Encoding 'UTF 16' to the with block, I get the error below:
ERROR: argument to option "encoding" must be a valid encoding name
LINE 13: ENCODING 'UTF 16' );
^
SQL state: 22023
Character: 377
I've looked through the postgres documentation to try to find the correct encoding, but haven't managed to find anything. Is this because the copy function does support UTF 16 encoded files? I would have thought that this would almost certainly have been possible!
I'm running postgres 12, on windows 10 pro
Any help would be hugely appreciated!
No, you cannot do that.
UTF-16 is not in the list of supported encodings.
PostgreSQL will never support an encoding that is not an extension of ASCII.
You will have to convert the file to UTF-8.

Postgresql invalid byte encoding "UTF8" 0xa9

Tried to insert Chinese character and failed. I used my MacOS to brew install the latest version 11.1 of psql, the locale was automatically set as zh_CN UTF8. Then when I tried to insert words like '店长', it showed 'invalid byte sequence for encoding "UTF8": 0xa9'
Can anyone help fix this T^T

Google Storage: Invalid Unicode path encountered

I'm trying to upload some files to GCS and i get this:
Building synchronization state...
Caught non-retryable exception while listing file:///media/Respaldo: CommandExce ption: Invalid Unicode path encountered
(u'/media/Respaldo/Documentos/Trabajo/Traducciones/Servicio
Preventivo Semanal Hs Rev3 - Ingl\xe9s.doc'). gsutil cannot
proceed with such files present. Please remove or rename this file and
try again. NOTE: the path printed above replaces the problematic
characters with a hex-encoded printable representation. For more
details (including how to convert to a gsutil-compatible encoding) see
`gsutil help encoding`.
But when i run:
convmv -f ISO-8859-1 -t UTF-8 -r --replace /media/Respaldo
And says all the non English files are already UTF-8. How should I proceed?
Edit: example of convmv output:
Skipping, already UTF-8: /media/Respaldo/Multimedia/Mis Imágenes/NOKIA/Memoria/Videoclips/Vídeo004.3gp
Skipping, already UTF-8: /media/Respaldo/Multimedia/Mis Imágenes/NOKIA/Memoria/Videoclips/Vídeo009.3gp
Skipping, already UTF-8: /media/Respaldo/Multimedia/Mis Imágenes/NOKIA/Memoria/Videoclips/Vídeo00133.3gp
Skipping, already UTF-8: /media/Respaldo/Multimedia/Mis Imágenes/NOKIA/Memoria/Videoclips/Vídeo023.3gp
Skipping, already UTF-8: /media/Respaldo/Multimedia/Mis Imágenes/NOKIA/Memoria/Videoclips/Vídeo026.3gp

PostgreSql Russian dict gor fulltext search

I tried to add Russian dictionary for fulltext search in postgresql db. I' ve downloaded dict files, converted them to UTF-8 and tried to create new dict
$ iconv -f koi8-r -t utf-8 < ru_RU.aff > /opt/local/share/postgresql93/tsearch_data/russian.affix
$ iconv -f koi8-r -t utf-8 < ru_RU.dic > /opt/local/share/postgresql93/tsearch_data/russian.dict
CREATE TEXT SEARCH DICTIONARY russian_ispell (
TEMPLATE = ispell,
DictFile = russian,
AffFile = russian,
StopWords = russian
);
But got an ERROR:
ERROR: invalid byte sequence for encoding "UTF8": 0xd1
CONTEXT: line 341 of configuration file "/opt/local/share/postgresql93/tsearch_data/russian.affix": "SFX Y хаться шутся хаться"
Then tried with other Russian dicts, but the same error occurred. How can I handle with this error? Thanks.
You can try to execute the following command:
export LC_ALL=C
I think you have a locale issue. This command should be executed in the same command line session where you execute the command for creating dictionary.