Databricks and Polybase cannot parse CSV including polygon

Databricks and Polybase cannot parse CSV including polygon - azure-data-factory

I have Azure Data Factory, which read CSV via HTTP Connection and store data to Azure Storage Gen2. File format is UTC-8. It seem like file get somehow corrupted because of polygon definitions.
File content is followings:
Shape123|"MULTIPOLYGON (((496000 6908000, 495000 6908000, 495000 6909000, 496000 6909000, 496000 6908000)))"|"Red"|"Long"|"208336"|"5"|"-1"
Problem 1:
Polybase complain about encoding and cannot read file.
Problem 2:
Databricks data frame cannot handle this and it can cuts row and reads only "Shape123|"MULTIPOLYGON (((496000 6908000,"
Quick solution:
Open CSV file with Notepad++ and reconfirm encoding as UTC-8. Then Polybase is able to handle.
Question:
What are automatic way to fix CSV file?
How to make dataframe to handle entire row if CSV file cannot not be fixed?

Polybase can cope perfectly well with UTF8 files and various delimiters. Did you create an external file format with pipe delimiter, double-quote as string delimiter, something like this?
CREATE EXTERNAL FILE FORMAT ff_pipeFileFormatSHAPE
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
STRING_DELIMITER = '"',
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE shape_data (
col1 VARCHAR(20),
col2 VARCHAR(8000),
col3 VARCHAR(20),
col4 VARCHAR(20),
col5 VARCHAR(20),
col6 VARCHAR(20),
col7 VARCHAR(20)
)
WITH (
LOCATION = 'yourPath/shape/shape working.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_pipeFileFormatSHAPE,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
My results:

Related

Snowfalke csv copy failures

I'm trying to load csv data into snowflake using snowflake copy. CSV column separator is pipe character (|). One of the columns in the input csv has | character in data which is escaped by backslash . The datatype in the target snowflake database is VARCHAR(n). With the addition of the escape character(\) in the data the data size exceeds the target column definition size which causes copy to fail.
Is there a way I can remove the escape character (\) from the data before loaded into the table?
copy into table_name from 's3path.csv.gz' file_format=(type = 'csv', field_optionally_enclosed_by ='"' escape_unenclosed_field = NONE empty_field_as_null = true escape ='\' field_delimter ='|' skip_header=1 NULL_IF = ('\\N', '', '\N')) ON_ERROR ='ABORT_STATEMENT' PURGE = FALSE
Sample data that causes the failure: "Data 50k | $200K "

Are the bucket hash algorithms of tez and MR different?

I'm using Hive 3.1.2 and tried to create a bucket with bucket version=2.
When I created a bucket and checked the bucket file using hdfs dfs -cat, I could see that the hashing result was different.
Are the hash algorithms of Tez and MR different? Shouldn't it be the same if bucket version=2?
Here's the test method and its results.
1. Create Bucket table & Data table
CREATE EXTERNAL TABLE `bucket_test`(
`id` int COMMENT ' ',
`name` string COMMENT ' ',
`age` int COMMENT ' ',
`phone` string COMMENT ' ')
CLUSTERED BY (id, name, age) SORTED BY(phone) INTO 2 Buckets
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
'bucketing_version'='2',
'orc.compress'='ZLIB');
CREATE TABLE data_table (id int, name string, age int, phone string)
row format delimited fields terminated by ',';
2. Insert data into DATA_TABLE
INSERT INTO TABLE data_table
select stack
( 20
,1, 'a', 11, '111'
,1,'a',11,'222'
,3,'b',14,'333'
,3,'b',13,'444'
,5,'c',18,'555'
,5,'c',18,'666'
,5,'c',21,'777'
,8,'d',23,'888'
,9,'d',24,'999'
,10,'d',26,'1110'
,11,'d',27,'1112'
,12,'e',28,'1113'
,13,'f',28,'1114'
,14,'g',30,'1115'
,15,'q',31,'1116'
,16,'w',32,'1117'
,17,'e',33,'1118'
,18,'r',34,'1119'
,19,'t',36,'1120'
,20,'y',36,'1130')
3. Create Bucket with MR
set hive.enforce.bucketing = true;
set hive.execution.engine = mr;
set mapreduce.job.queuename=root.test;
Insert overwrite table bucket_test
select * from data_table ;
4. Check Bucket contents
# bucket0 : 6 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000000_0
10d261110
11d271112
18r341119
3b13444
5c18555
5c18666
# bucket1 : 14 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000001_0
1a11111
12e281113
13f281114
14g301115
15q311116
16w321117
17e331118
19t361120
20y361130
1a11222
3b14333
5c21777
8d23888
9d24999
5. Create Bucket with Tez
set hive.enforce.bucketing = true;
set hive.execution.engine = tez;
set tez.queue.name=root.test;
Insert overwrite table bucket_test
select * from data_table ;
6. Check Bucket contents
# bucket0 : 11 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000000_0
1a11111
10d261110
11d271112
13f281114
16w321117
17e331118
18r341119
20y361130
1a11222
5c18555
5c18666
# bucket1 : 9 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000001_0
12e281113
14g301115
15q311116
19t361120
3b14333
3b13444
5c21777
8d23888
9d24999

Snowflake null values quoted in CSV breaks PostgreSQL unload

I am trying to shift data from Snowflake to Postgresql and to do so I first load it into s3 in CSV format. In the table, comas in text could appear, I therefore use FIELD_OPTIONALLY_ENCLOSED_BY snowflake unloading option to quote the content of the problematic cells. However when this happen + null values, I can't manage to have a valid CSV for PostgreSQL.
I created a simple table for you to understand the issue. Here it is :
CREATE OR REPLACE TABLE PUBLIC.TEST(
TEXT_FIELD VARCHAR(),
NUMERIC_FIELD INT
);
INSERT INTO PUBLIC.TEST VALUES
('A', 1),
(NULL, 2),
('B', NULL),
(NULL, NULL),
('Hello, world', NULL)
;
COPY INTO #STAGE/test
FROM PUBLIC.TEST
FILE_FORMAT = (
COMPRESSION = NONE,
TYPE = CSV,
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ''
)
OVERWRITE = TRUE;
Snowflake will from that create the following CSV
"A",1
"",2
"B",""
"",""
"Hello, world",""
But after that, it is for me impossible to copy this CSV inside a PostgreSQL Table as it is.
Even thought from PostgreSQL documentation we have next to NULL option :
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format.
Not setting COPY Option in PostgreSQL COPY INTO will result in a failed unloading. Indeed it won't work as we also have to specify the quote used using QUOTE. Here it'll be QUOTE '"'
Therefore during POSTGRESQL unloading, using :
FORMAT csv, HEADER false, QUOTE '"' will give :
DataError: invalid input syntax for integer: "" CONTEXT: COPY test, line 3, column numeric_field: ""
FORMAT csv, HEADER false, NULL '""', QUOTE '"' will give :
NotSupportedError: CSV quote character must not appear in the NULL specification
FYI, To test the unloading in s3 I will use this command in PostgreSQL:
CREATE IF NOT EXISTS TABLE PUBLIC.TEST(
TEXT_FIELD VARCHAR(),
NUMERIC_FIELD INT
);
CREATE EXTENSION IF NOT EXISTS aws_s3 CASCADE;
SELECT aws_s3.table_import_from_s3(
'PUBLIC.TEST',
'',
'(FORMAT csv, HEADER false, NULL ''""'', QUOTE ''"'')',
'bucket',
'test_0_0_0.csv',
'aws_region'
)
Thanks a lot for any ideas on what I could do to make it happen? I would love to find a solution that don't requires modifying the csv between snowflake and postgres. I think it is an issue more on the Snowflake side as it don't really make sense to quote null values. But PostgreSQL is not helping either.

When you set the NULL_IF value to '', you are actually telling Snowflake to convert NULLS to a BLANK, which then get quoted. When you are copying out of Snowflake, the copy options are "backwards" in a sense and NULL_IF acts more like an IFNULL.
This is the code that I'd use on the Snowflake side, which will result in an unquoted empty string in your CSV file:
FILE_FORMAT = (
COMPRESSION = NONE,
TYPE = CSV,
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ()
)

Copy into snowflake table from raw data file using Perl DBI

There's not much info out there for perl dbi and snowflake so I'll give this a shot. I have a raw file, of which the headers are contained in line 1. This exact 'copy into' command works from the snowflake gui. I'm not sure if I can just take this exact command and put it into a perl prepare and execute.
COPY INTO DBTABLE.LND_LND_STANDARD_DATA FROM (
SELECT SPLIT_PART(METADATA$FILENAME,'/',4) as SEAT_ID,
$1:auction_id_64 as AUCTION_ID_64,
DATEADD(S,\$1:date_time,'1970-01-01') as DATE_TIME,
$1:user_tz_offset as USER_TZ_OFFSET,
$1:creative_width as CREATIVE_WIDTH,
$1:creative_height as CREATIVE_HEIGHT,
$1:media_type as MEDIA_TYPE,
$1:fold_position as FOLD_POSITION,
$1:event_type as EVENT_TYPE
FROM #DBTABLE.lnd.S3_STAGE_READY/pr/data/standard/data_dt=20200825/00/STANDARD_FILE.gz.parquet)
pattern = '.*.parquet' file_format = (TYPE = 'PARQUET' SNAPPY_COMPRESSION = TRUE)
ON_ERROR = 'SKIP_FILE_10%'
my $SQL = "COPY INTO DBTABLE.LND_LND_STANDARD_DATA FROM (
SELECT SPLIT_PART(METADATA\$FILENAME,'/',4) as SEAT_ID,
\$1:auction_id_64 as AUCTION_ID_64,
DATEADD(S,\$1:date_time,'1970-01-01') as DATE_TIME,
\$1:user_tz_offset as USER_TZ_OFFSET,
\$1:creative_width as CREATIVE_WIDTH,
\$1:creative_height as CREATIVE_HEIGHT,
\$1:media_type as MEDIA_TYPE,
\$1:fold_position as FOLD_POSITION,
\$1:event_type as EVENT_TYPE
FROM \#DBTABLE.lnd.S3_STAGE_READY/pr/data/standard/data_dt=20200825/00/STANDARD_FILE.gz.parquet)
pattern = '.*.parquet' file_format = (TYPE = 'PARQUET' SNAPPY_COMPRESSION = TRUE)
ON_ERROR = 'SKIP_FILE_10%'";
my $sth = $dbh->prepare($sql);
$sth->execute;
In looking at the output from snowflake I see this error
syntax error line 3 at position 4 unexpected '?'.
syntax error line 4 at position 13 unexpected '?'.
COPY INTO DBTABLE.LND_LND_STANDARD_DATA FROM (
SELECT SPLIT_PART(METADATA$FILENAME,'/',4) as SEAT_ID,
$1? as AUCTION_ID_64,
DATEADD(S,$1?,'1970-01-01') as DATE_TIME,
$1? as USER_TZ_OFFSET,
$1? as CREATIVE_WIDTH,
$1? as CREATIVE_HEIGHT,
$1? as MEDIA_TYPE
Do I need to create bind variables for each of the columns? I usually pull in the data from the file and put them into variables but this is different as I can't read the raw file first, it has to come directly from the copy into command.
Any help would be appreciated.

It was interpreting the : as a bind variable value, rather than a value in a variant. I used the bracket notation, instead like the following:
my $SQL = "COPY INTO DBTABLE.LND_LND_STANDARD_DATA FROM (
SELECT SPLIT_PART(METADATA\$FILENAME,'/',4) as SEAT_ID,
\$1['auction_id_64'] as AUCTION_ID_64,
DATEADD(S,\$1['date_time,'1970-01-01') as DATE_TIME,
\$1['user_tz_offset'] as USER_TZ_OFFSET,
\$1:creative_width'] as CREATIVE_WIDTH,
etc...
That worked

How to import data into teradata tables from delimited file using BTEQ import?

I am trying to execute following bteq command on linux environment but couldn't load data properly into Teradata DB server. Can someone please advise me to resolve the below issue that I am facing while loading.
BTEQ Command used :
.SET width 64000;
.SET session transaction btet;
.logmech ldap
.logon XXXXXXX/XXXXXXXX,********;
DATABASE corecm;
.PACK 1000
.IMPORT VARTEXT '~' FILE=/v/global/user/application_event_bus_evt
.REPEAT *
USING(APPLICATION_EVENT_ID CHAR(24),BUS_EVT_ID CHAR(24),BUS_EVT_VID BIGINT,BUS_EVT_RESTATE_IN SMALLINT)
insert into corecm.application_event_bus_evt (APPLICATION_EVENT_ID
, BUS_EVT_ID
, BUS_EVT_VID
, BUS_EVT_RESTATE_IN
)
values
( COALESCE(:APPLICATION_EVENT_ID,1)
, COALESCE(:BUS_EVT_ID,1)
, COALESCE(:BUS_EVT_VID,1)
, COALESCE(:BUS_EVT_RESTATE_IN,1)
) ;
.LOGOFF;
.EXIT;
SAMPLE INPUT FILE DELIMITTER "~" [ /v/global/user/application_event_bus_evt ] :
Ckn3gMxLEeOgIQBQVgErYA==~g+GDDtlaY3n7BdUrYshDFA==~1~1
CL1kEcxLEeOgIQBQVgErYA==~qoKoiuGDbClpcGt/z6RKGw==~1~1
oYIVcMxKEeOgIQBQVgErYA==~mfmQiwl7yAteevzJfilMvA==~1~1
5N7ME5bM4xGhM7exj3ykUw==~yFM2FZbM4xGhM7exj3ykUw==~1~0
JLBH4JfM4xGDH9s5+Ds/8w==~doZ/7pfM4xGDH9s5+Ds/8w==~1~0
fGvpoMxKEeOgIQBQVgErYA==~mQUQIK2mY6WIPcszfp5BTQ==~1~1
Table Definition :
CREATE MULTISET TABLE CORECM.APPLICATION_EVENT_BUS_EVT ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
APPLICATION_EVENT_ID CHAR(26) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
BUS_EVT_ID CHAR(26) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
BUS_EVT_VID BIGINT NOT NULL,
BUS_EVT_RESTATE_IN SMALLINT)
UNIQUE PRIMARY INDEX ( APPLICATION_EVENT_ID ,BUS_EVT_ID ,BUS_EVT_VID )
INDEX APPLICATION_EVENT_BUS_EVT_IDX1 ( APPLICATION_EVENT_ID )
INDEX APPLICATION_EVENT_BUS_EVT_IDX2 ( BUS_EVT_ID ,BUS_EVT_VID );
Results set in DB server as,
APPLICATION_EVENT_ID BUS_EVT_ID BUS_EVT_VID BUS_EVT_RESTATE_IN
1 Ckn3gMxLEeOgIQBQVgErYA == g+GDDtlaY3n7BdUrYshD 85,849,873,219,141,958 12,544
2 CL1kEcxLEeOgIQBQVgErYA == qoKoiuGDbClpcGt/z6RK 85,849,873,219,155,783 12,544
3 oYIVcMxKEeOgIQBQVgErYA == mfmQiwl7yAteevzJfilM 85,849,873,219,142,006 12,544
4 5N7ME5bM4xGhM7exj3ykUw == JAf0GpbM4xGhM7exj3yk 85,849,873,219,155,797 12,288
5 JLBH4JfM4xGDH9s5+Ds/8w == Du6T7pfM4xGDH9s5+Ds/ 85,849,873,219,155,768 12,288
6 fGvpoMxKEeOgIQBQVgErYA == mQUQIK2mY6WIPcszfp5B 85,849,873,219,146,068 12,544
If we look at the Data, we can see two issues as,
First two column data length is 24 CHARACTERS ( as per input file ), but the issue is that it been shifted two characters in next column.
Column BUS_EVT_VID and BUS_EVT_RESTATE_IN has wrong data 85,849,873,219,141,958 and 12,544 instead of 1 and 1 respectively (this may be because first two column data got shifted)
I tried following options to resolve the above issue but couldn't resolve the issue,
Modified the Table Definition, i.e. changed datatype to
CHAR(28),CHAR(24),CHAR(26)
Modified the Table Definition column
datatypes to VARCHAR(24), VARCHAR(26)
Modified BTEQ command, i.e. altered datatype in below line,
USING(APPLICATION_EVENT_ID CHAR(24),BUS_EVT_ID CHAR(24),BUS_EVT_VID BIGINT,BUS_EVT_RESTATE_IN SMALLINT)
Thanks in advance.

When you define VARTEXT all input columns must be defined as VARCHAR, but you used CHAR and INT.
This should work, VARCHAR length based on the definition of your target table:
USING(
APPLICATION_EVENT_ID VARCHAR(26),
BUS_EVT_ID VARCHAR(26),
BUS_EVT_VID VARCHAR(19),
BUS_EVT_RESTATE_IN VARCHAR(6)
)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Databricks and Polybase cannot parse CSV including polygon - azure-data-factory

Related

Snowfalke csv copy failures

Are the bucket hash algorithms of tez and MR different?

Snowflake null values quoted in CSV breaks PostgreSQL unload

Copy into snowflake table from raw data file using Perl DBI

How to import data into teradata tables from delimited file using BTEQ import?

Categories

Resources