How to query parquet data files from Azure Synapse when data may be structured and exceed 8000 bytes in length - tsql

I am having trouble reading, querying and creating external tables from Parquet files stored in Datalake Storage gen2 from Azure Synapse.
Specifically I see this error while trying to create an external table through the UI:
"Error details
New external table
Previewing the file data failed. Details: Failed to execute query. Error: Column 'members' of type 'NVARCHAR' is not compatible with external data type 'JSON string. (underlying parquet nested/repeatable column must be read as VARCHAR or CHAR)'. File/External table name: [DELETED] Total size of data scanned is 1 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes.
. If the issue persists, contact support and provide the following id :"
My main hunch is that since a couple columns were originally JSON types, and some of the rows are quite long (up to 9000 characters right now, which could increase at any point in time during my ETL), this is some kind of conflict with some possible default limit's I have seen referenced in the documentation (enter link description here). Data appears internally like the following example, please bear in mind sometimes this would be way longer
["100.001", "100.002", "100.003", "100.004", "100.005", "100.006", "100.023"]
If I try to manually create the external table (which has worked every other time I have tried following code similar to this
CREATE EXTERNAL TABLE example1(
[id] bigint,
[column1] nvarchar(4000),
[column2] nvarchar(4000),
[column3] datetime2(7)
)
WITH (
LOCATION = 'location/**',
DATA_SOURCE = [datasource],
FILE_FORMAT = [SynapseParquetFormat]
)
GO
the table is created with no error nor warnings but trying to make a very simple select
SELECT TOP (100) [id] bigint,
[column1] nvarchar(4000),
[column2] nvarchar(4000),
[column3] datetime2(7)
FROM [schema1].[example1]
The following error is shown:
"External table 'dbo' is not accessible because content of directory cannot be listed."
It can also show the equivalent:
"External table 'schema1' is not accessible because content of directory cannot be listed."
This error persists even when creating external table with the argument "max" as it appears in this doc
Summary: How to create external table from parquet files with fields exceeding 4000, 8000 bytes or even up to 2gb, which would be the maximum size according to this
Thank you all in advance

Related

SQL3116W The field value in row and column is missing, but the target column is not nullable. How to specify to use Column Default

I'm using LOAD command to get data into a table where one of the columns has the default value of the current timestamp. I had NULL value in the data being read as I thought it would cause the table to use the default value but based on above error that's not the case. How do I avoid the above error in this case?
Here is the full command, input file is text file: LOAD FROM ${LOADDIR}/${InputFile}.exp OF DEL MODIFIED BY COLDEL| INSERT INTO TEMP_TABLE NONRECOVERABLE
Try:
LOAD FROM ${LOADDIR}/${InputFile}.exp OF DEL MODIFIED BY USEDEFAULTS COLDEL| INSERT INTO TEMP_TABLE NONRECOVERABLE
This modifier usedefaults has been available in Db2-LUW since V7.x, as long as they are fully serviced (i.e. have had the final fixpack correctly applied).
Note that some Db2-LUW versions place restrictions on usage of usedefaults modifier, as detailed in the documentation. For example, restrictions relating to use with other modifiers, or modes or target table type.
Always specify your Db2-server version and platform when asking for help because the answer can depende on these facts.
You can specify which columns from the input file go into which columns of the table using METHOD P - if you omit the column you want the default for it will throw a warning but the default will be populated:
$ db2 "create table testtab1 (cola int, colb int, colc timestamp not null default)"
DB20000I The SQL command completed successfully.
$ cat tt1.del
1,1,1
2,2,2
3,3,99
$ db2 "load from tt1.del of del method P(1,2) insert into testtab1 (cola, colb)"
SQL27967W The COPY NO recoverability parameter of the Load has been converted
to NONRECOVERABLE within the HADR environment.
SQL3109N The utility is beginning to load data from file
"/home/db2inst1/tt1.del".
SQL3500W The utility is beginning the "LOAD" phase at time "07/12/2021
10:14:04.362385".
SQL3112W There are fewer input file columns specified than database columns.
SQL3519W Begin Load Consistency Point. Input record count = "0".
SQL3520W Load Consistency Point was successful.
SQL3110N The utility has completed processing. "3" rows were read from the
input file.
SQL3519W Begin Load Consistency Point. Input record count = "3".
SQL3520W Load Consistency Point was successful.
SQL3515W The utility has finished the "LOAD" phase at time "07/12/2021
10:14:04.496670".
Number of rows read = 3
Number of rows skipped = 0
Number of rows loaded = 3
Number of rows rejected = 0
Number of rows deleted = 0
Number of rows committed = 3
$ db2 "select * from testtab1"
COLA COLB COLC
----------- ----------- --------------------------
1 1 2021-12-07-10.14.04.244232
2 2 2021-12-07-10.14.04.244232
3 3 2021-12-07-10.14.04.244232
3 record(s) selected.

How to Resolve the Maximum Rejected Threshold was reached while reading data from SRCTable to TGTTable using ADF

I am getting below mentioned error while loading data from synapse src Table to synapse TGT Table.
SQLServerException: Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.\nColumn ordinal: 26, Expected data type: VARCHAR(255) collate SQL_Latin1_General_CP1_CI_AS NOT
Requesting you to suggest how to overcome the above mentioned issue.
Regards,
Ashok
The error may be due to data truncation from the column 26 of your source file.
As a first check I would suggest increasing the destination table column from VARCHAR(255) to VARCHAR(MAX) and then try to run this copy again.
ALTER TABLE TGT ALTER COLUMN [column 29] VARCHAR(MAX);
If it successes, you can easily run a max on that destination table column to determine how big it should be.
SELECT MAX(LEN([column 29])) FROM TGT
Some related reading about polybase copy:
https://medium.com/microsoftazure/azure-synapse-data-load-using-polybase-or-copy-command-from-vnet-protected-azure-storage-da8aa6a9ac68

Snowflake : Unsupported subquery type cannot be evaluated

I am using snowflake as a data warehouse. I have a CSV file at AWS S3. I am writing a merge sql to merge data received in CSV to the table in snowflake. I have a column in time dimension table with data type as Number(38,0) data type in SF. This table holds all dates time, one e.g. is of column
time_id= 232 and time=12:00
In CSV I am getting a column with the label as time and value as 12:00.
In merge sql I am fetching this value and trying to get time_id for it.
update table_name set start_time_dim_id = (select time_id from time_dim t where t.time_name = csv_data.start_time_dim_id)
On this statement I am getting this error "SQL compilation error: Unsupported subquery type cannot be evaluated"
I am struggling to solve it, during this I google for it and got one reference for it
https://github.com/snowflakedb/snowflake-connector-python/issues/251
So want to make sure if anyone have encountered this issue? If yes, will appreciate pointers over it.
It seems like a conversion issue. I suggest you to check the data in CSV file. Maybe there is a wrong or missing value. Please check your data, and make sure it returns numeric values
create table simpleone ( id number );
insert into simpleone values ( True );
The last statement fails with:
SQL compilation error: Expression type does not match column data type, expecting NUMBER(38,0) but got BOOLEAN for column ID
If you provide sample data, and SQL to produce this error, maybe we can provide a solution.
unfortunately correlated and nested subqueries in Snowflake are a bit limited at this stage.
I would try running something like this:
update table_name
set start_time_dim_id= time_id
from time_dim
where t.time_name=csv_data.start_time_dim_id

Trim/whitespace issue when load data from Db2 source to Postgresql DB using Talend Open source

We are seeing issue in table value which are populated from DB2 (source) to Postgres (Target).
I have including here all the job details for each component.
Based on the above approach and once the data has been populated, when we run the below query in the Postgres DB.
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where cust_gssn_cd='XY03666699' ;
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where cust_cntry_cd='847' ;
There will be no records were returned however, when we run the same query with Trim as below it works.
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where trim(cust_gssn_cd)='XY03666699' ;
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where trim(cust_cntry_cd)='847' ;
Below are the ways we have tried to overcome this but no luck.
Used tmap between source and target component.
Used trim in source component under Advanced setting.
Change the datatype in Postgres DB of cust_cntry_cd from char(5) to Character varying, this will allow value without any length restriction.
Please suggest what is missing as we have this issue in almost all the table where we have character/varchar columns.
We are using TOS.
The data type is probably character(5) in DB2.
That means that the trailing spaces are part of the column and will be migrated. You have to compare with
cust_cntry_cd = '847 '
or cast the right argument to character(5):
cust_cntry_cd = CAST ('847' AS character(5))
Maybe you could delete all spaces in the advanced settings of the tDB2Input component.
Like the screen :

I am getting an error while using 'mysql' on Google cloud shell : ERROR 2020 (HY000): Got packet bigger than 'max_allowed_packet' bytes

Basically, I am trying to import a csv file into a table in Google Cloud SQL database.. The query, I used to create a table is:
CREATE TABLE Mobike(
orderid INT(30),
bikeid INT(30),
userid INT(30),
start_time DATETIME,
start_location_x FLOAT(30,30),
start_location_y FLOAT(30,30),
end_time DATETIME,
end_location_x FLOAT(30,30),
end_location_y FLOAT(30,30),
track TEXT(65000));
Like Table, CSV file also contains the same columns having same data Types. Kindly, help me how I can solve this problem especially when I am doing it on Google cloud shell?? And how can I remove this error??