How to save Non-English characters in an ADF Copy Sink to a CSV dataset - azure-data-factory

In ADF Copy activity, I am reading data from Databricks Delta tables, columns of which may contain non-english characters. Its reading the data perfectly, as I can see it in preview data. Next, I am saving (sink) the data in a CSV file.
When I open the CSV file, non-english characters are showing as either non readable characters or question marks depends upon what encoding I am using.
When encoding is UTF-8 (default), non-english characters become non-readable.
When encoding is ISO 8859-15, it becomes question marks.
Below is the sample non-english characters
With encoding UTF-8 (default)
with encoding ISO 8859-15
Any suggestions please

when you create the sink table, you need to define the encode UTF-8 as below:
create table xxxx
(
varchar(500) collate Latin1_General_100_CI_AI_SC_UTF8
)

Related

What will be the change in Application behavior after altering table with CODEUNITS32 for supporting unicode behavior?

we are in a phase of migration of some tables from AS400 DB to DB2 LUW(V11.1).
While migrating we found some special character(€) in the source database(AS400)- (Column with CHAR) and that lead to error if we are unable to alter table column with CODEUNITS32, DB2 LUW Database configuration Byte Encoding Set at UTF-8.
We want to understand, what would be the behavior of the application after changing the char column to CODEUNITS32, Do I need to update any Configuration at the application level (C & Java Application) to handle both Character Encoding Set?
After changing to CODEUNITS32
- My C application able to compile and able to handle change in Character byte from 8 bit per character(UTF-8) to 4 Byte Per Character(CODEUNITS32)?
- My Java application is able to handle change in Character byte from 8 bit per character(UTF-8) to 4 Byte Per Character(CODEUNITS32)?
We did some pilot testing by inserting Special character manually to the table after setting column definition to CODEUNITS32 from CHAR and testing was successful.
Using a string units specification of CODEUNITS32 for a column does not change the encoding of a column, the data is still stored in UTF-8 for CHAR/VARCHAR columns.
It alters the physical length (CHAR) or max length (VARCHAR) of the column by a factor of 4.
It also enables "character semantics" in some functions such as SUBSTR(), such that they work on characters, not bytes when processing CODEUNITS32 columns. (SUBSTRING() will always use character semantics (unless processing a FOR BIT DATA column))
So a CHAR(4) is CHAR(4 OCTETS) is 4 bytes long, and can hold at most 4 characters if they are all single byte in UTF-8. For € which is 3 bytes long, it could only hold say €4 but not €42
ACHAR(4 CODEUNTIS32) is 16 bytes long, and is allowed to hold up to 4 characters. It could hold €€€€ but not €2345
It is worth considering avoiding CHAR(x CODEUNITS32) and prefering VARCHAR(x CODEUNITS32). UTF-8 does not really play well with fixed width data types. The more common UTF-8 characters are 1 or 2 bytes long, so typically a CHAR(x CODEUNITS32) column will hold be more than 50% space padding.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008470.html
CODEUNITS32
Indicates that the units for the length attribute are Unicode UTF-32 code units which approximate counting in characters.
This unit of length does not affect the underlying code page of the data type.
The actual length of a data value is determined by counting the UTF-32
code units as if the data was converted to UTF-32.
A string unit of CODEUNITS32 can be used only in a Unicode database.
CODEUNITS32 can be
explicitly specified or determined based on an environment setting.
Also, out of interest, GRAPHIC/VARGRAPHIC and columns are stored in UTF-16, and default to CODEUNITS16, but can also use CODEUNITS32.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008471.html

How to import csv file containing german characters into postgresql db (remote db)

I have excel files containing German characters (äöüß)
I am converting these excel files and creating the csv files with UTF-8 encoding using the method mentioned here, In the section "How to convert Excel to CSV UTF-8"
I am using these csv files to import the data to the postgreSQL database using HeidiSQL tool. But once the data gets imported to the database all the German characters are getting converted to weird characters (for eg: ö to ö and ü to ü)
How else can I import the csv files so that all the German characters remain the same when it gets imported to the postgreSQL database?

The pgsql2shp.exe cuts-off text to max 254 characters (varchar(254))

im using the pgsql2shp tool to generate *.shp files from geometries in Postgres. The thing is that I have a description colomn with a lot of text. In the Postgres DB it is of type text. But when I use pgsql2shp these columns are cut-off to max 254 characters it makes a varchar(254) of this column.
Any ideas to make this work?
After some more googling and asking around, i found out that the accompanying dbf file with *.shp is based on a dBase IV format. This has a maximum length of a text field = 254 characters. Therefore it cuts the text off.
So I need to find some other solution.
As you have discovered, this is a limitation of Shapefiles. To get more characters in the output, you need to export to a different format.
You can use ogr2ogr to convert the spatial data into several different formats, such as Spatialite, GeoJSON, etc.

SQL Server 2008 R2 : detect encoding in nvarchar field

I have a 1,000,000 rows plus string table, that has some garbage inside due to encoding errors.
The garbage is minimal, but needs to be found.
The column in question is a NVARCHAR column that normally contains text in one of 11 languages.
All of the text should be unicode (utf-8 when we process it application side).
The corrupt columns contain ? characters and or a very limited unusual glyph set, by eye they can be very easily seen not to be valid language. It is likely that these columns have been encoded backwards and forwards into total garbage.
So in the name of speed, is there anything I can do on SQL Server to detect bad encoding / string garbage?
Thanks.
EDIT to add garbage example:
This was Russian или Ð˜Ð¼Ñ Ð£Ñ‡Ð°Ñтника

SQL- unreadable special characters

I don't have much experience with MS SQL server 2008 R2 but here is the issue if you would help me please:
I have a table with a column/field (type : nvarchar) that stores text. The text is read from a text file and written to the database using vb.net application.
The text in the text file contains Turkish characters such as the u with 2 dots on top(in the future it will be in different languages )
When I open the table, the text in the column is not readable. It converts the Turkish special character to some unreadable characters.
Is there anyway to make the text readable in the table?
Thank you so much.
SQL Server doesn't change any character stored in tables, I think the problem is displaying the text in different character set. Try using UTF-8 character set.