invalid byte sequence for encoding "UTF8" - Talend - talend

It is a cosmic age problem, I am getting data from MySQL (Latin1) to Postgres (UTF8) and getting the invalid byte error.
My setup for all solutions:
additiona jdbc parameter for Postgres: "characterEncoding=utf8"
tDBRow_1: "SET NAMES 'utf8'"
And yes, I've checked Stack on the matter. So far nothing worked.
Options tried:
Only - "SET NAMES 'utf8'"
convert(cast(convert(data using latin1) as binary) using utf8) as data - in iot SQL query
CONVERT(CAST(data as BINARY) USING utf8) as data - in iot SQL query
CAST(CONVERT(data USING utf8) AS binary) - in iot SQL query
trim(both CHAR(0x00) from data) - in iot SQL query
row1.data.replace("\x00", " ") - in tMap
data.replace('\0', ' ') - in tJava
data.replaceAll("\0", "") - in tJava
What is left:
-change additional params in target to: noDatetimeStringSync=true&characterEncoding=utf8
-change additional params in target to: useOldUTF8Behavior=true
-change tDBRow_1 to SET CLIENT_ENCODING TO utf8
But I run out of the ideas at the moment, so as the Internet.

Related

Where exactly do we place this postgresql.conf configuration file in spring boot application?

I am trying to encrypt a column in my prostrgres DB. The column name is "test" of type "bytea".
My enity code is below,
#ColumnTransformer(read = "pgp_sym_decrypt(" + " test, "
+ " current_setting('encrypt.key')"
+ ")", write = "pgp_sym_encrypt( " + " ?, "
+ " current_setting('encrypt.key')" + ") ")
#Column(columnDefinition = "bytea")
private String test;
postgresql.conf configuration file:
encrypt.key = 'Wow! So much security.
Placed the postgresql.conf configuration file in src/main/resources of spring boot appln. But the encryption.key value is not being picked up. And is there a way to pass the key using application.properties?
postgresql.conf is the configuration file for the Postgres server. It's stored inside the data directory (aka "cluster") on the server.
You can't put it on the client side (where your application runs). It has no meaning there.
To change values in there, you need to edit the file (on the server) or use ALTER SYSTEM.
If you want to change a configuration setting for the current session, use SET or set_config()
The latter two are probably the ones you are looking for to initialize the custom property for your encryption/decryption functions.
The way to use encrypt.key, not only for current session, it's store it in postgresql.conf.
The correct place is at the end of this file, in the "Customized Options" section:
#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------
# Add settings for extensions here
encrypt.key=123456
Reload the configuration of the database server:
systemctl reload postgresql.service
To testing if it's working correctly. Open a pgsql session and type:
mydb=# show encrypt.key;
encrypt.key
-------------
123456
(1 row)
Example of encrypt:
mydb=# select pgp_sym_encrypt('Hola mundo',current_setting('encrypt.key'));
pgp_sym_encrypt
------------------------------------------------------------------------------------------------------------------------------------------------------------
\xc30d04070302255230e388dfe25e7dd23b01c5b8e62d148088a3417d3c27ed2cc11655d863b271672b9f076fffb82f1a7f074f2ecbe973df04642cd7a4f76ca5cff4a13b9a71e7cc6e693827
(1 row)
Example of decrypt:
mydb=# select pgp_sym_decrypt('\xc30d04070302255230e388dfe25e7dd23b01c5b8e62d148088a3417d3c27ed2cc11655d863b271672b9f076fffb82f1a7f074f2ecbe973df04642cd7a4f76ca5cff4a13b9a71e7cc6e693827',current_setting('encrypt.key'));
pgp_sym_decrypt
-----------------
Hola mundo
(1 row)

Convert Character Set from WIN1252 to UTF8 - Firebird 3

I'm facing problems trying to convert a Firebird 3 database with character set WIN1252 to UTF8.
I've performed the following procedures:
Extract the DDL from the database and the definitions, so I created the new database with UTF8 Character Set, Collate UNICODE_CI_AI. The database structure was created correctly.
After when I try using fbcopy to copy data from WIN1252 database to new UTF8 database the process is aborted returning the error:
Message: isc_dsql_execute2 failed
SQL Message: -104
can not format message 13: 896 - message file C: \ WINDOWS \ SYSTEM32 \ firebird.msg not found
Engine Code: 335544849
Engine Message:
Malformed string
Enabling triggers ... done.
Before using the FbCopy tool, I tried to perform the following commands through backup and restore of the WIN1252 database:
-FIX_FSS_D UTF8 -FIX_FSS_M UTF8
or
-FIX_FSS_D WIN1252 -FIX_FSS_M WIN1252
However, I still get the same error.

Loading Data from PostgreSQL into Stata

When I load data from PostgreSQL into Stata some of the data has unexpected characters appended. How can I avoid this?
Here is the Stata code I am using:
odbc query mydatabase, schema $odbc
odbc load, exec("SELECT * FROM my_table") $odbc allstring
Here is an example of the output I see:
198734/0 one/0/r April/0/0/0
893476/0 two/0/r May/0/0/0
324192/0 three/0/r June/0/0/0
In Postgres the data is:
198734 one April
893476 two May
324192 three June
I see this in mostly in larger tables and with fields of all datatypes in PostgreSQL. If I export the data to a csv there are no trailing characters.
The odbci.ini file I am using looks like this:
[ODBC Data Sources]
mydatabase = PostgreSQL
[mydatabase]
Debug = 1
CommLog = 1
ReadOnly = no
Driver = /usr/lib64/psqlodbcw.so
Servername = myserver
Servertype = postgres
FetchBufferSize = 99
Port = 5432
Database = mydatabase
[Default]
Driver = /usr/lib64/psqlodbcw.so
I am using odbc version unixODBC 2.3.1 and PostgreSQL version 9.4.9 with server encoding UTF8 and Stata version 14.1.
What is causing the unexpected characters in the data imported into Stata? I know that I can clean the data once it’s in Stata but I would like to avoid this.
I was able to fix this by adding the line
set odbcdriver ansi
to the Stata code.

PostgreSQL: Export data from SQL Server 2008 R2 to PostgreSQL 9.5

I have a table to export data from SQL Server to PostgreSQL.
Steps I followed:
Step 1: Export data from SQL Server:
Source: SQL Server Table
Destination: Flat file Destination
Table Or Query to copy: Query
Query:
SELECT
COALESCE(convert(varchar(max),id),'NULL') + '|'
+COALESCE(convert(varchar(max),Name),'NULL') + '|'
COALESCE(convert(varchar(max),EDate,121),'NULL') AS A
FROM tbl_Employee;
File Name: file.txt
Step 2: Copy to PostgreSQL.
Command:
\COPY tbl_employee FROM '$FilePath\file.txt' DELIMITER '|' NULL AS 'NULL' ENCODING 'LATIN1'
Getting Following error message:
ERROR: invalid byte sequence for encoding "UTF8": 0xc1 0x20
You tell Postgres the source would be encoded as LATIN1:
\copy ... ENCODING 'LATIN1'
But that's either not the case or the file is damaged. Else we would not see the error message. What is the true encoding of '$FilePath\file.txt'?
The current client_encoding is not relevant for this since, quoting the manual on COPY:
ENCODING
Specifies that the file is encoded in the encoding_name. If this option is omitted, the current client encoding is used.
(\copy is jut a wrapper for SQL COPY in psql.)
And your server_encoding is largely irrelevant, too - as long as Postgres can use a built-in conversion and the target encoding contains all characters of the source encoding - which is the case for LATIN1 -> UTF8: iso_8859_1_to_utf8.
So the remaining source of error is your file, which is almost certainly not valid LATIN1.

Can not insert German characters in Postgres

I am using UTF8 as encoding for my Postgres 8.4.11 database:
CREATE DATABASE test
WITH OWNER = postgres
ENCODING = 'UTF8'
TABLESPACE = mydata
LC_COLLATE = 'de_DE.UTF-8'
LC_CTYPE = 'de_DE.UTF-8'
CONNECTION LIMIT = -1;
ALTER DATABASE test SET default_tablespace='mydata';
ALTER DATABASE test SET temp_tablespaces=mydata;
And the output of \l
test | postgres | UTF8 | de_DE.UTF-8 | de_DE.UTF-8 |
When I try to insert a German character:
create table x(a text);
insert into x values('ä,ß,ö');
ERROR: invalid byte sequence for encoding "UTF8": 0xe42cdf
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
I am using puTTY to connect. Any idea?
The key element is the client_encoding - the encoding the server expects from your client. It has to match what is actually sent. What do you get for show client_encoding? Is it UNICODE?
Read more in the chapter Automatic Character Set Conversion Between Server and Client of the manual.
If you are using psql as client, you can set client_encoding with \encoding. Check the encoding your local system users (on Linux type locale in the shell) and set a matching client_encoding in psql. You can avoid such complications if you use the same locale on your system as you use as encoding for your PostgreSQL server.
If you use puTTY (on Windows), make sure to set its "Translation" accordingly. Have a look at Settings: Window - Translation. Must match client_encoding. You can right-click in a running session and chose Change Settings. You can also save these settings with your saved sessions.