Azure Data Factory, MalformedInputException on a copy data activity - azure-data-factory

My copy data activity gets data in .snappy.parquet format (from Azure Data Lake Storage Gen2) and brings it to Azure Synapse Analytics.
I keep receiving this:
Copy Command operation failed with error 'HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: MalformedInputException: Input length = 1'
I use a pre-copy script that has this structure:
IF OBJECT_ID('[SCHEMA].[TABLE]') IS NOT NULL BEGIN
DROP TABLE [SCHEMA].[TABLE] END
CREATE TABLE [SCHEMA].[TABLE] ( [FIELD] VARCHAR(4386) ,[FIELD] DECIMAL(18,8), ...)
WITH (
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
What is the problem?

Related

Azure Synapse Upsert Record into Dedicated Sql Pool

We have requirement that we need to fetch json data from the datalake storage and insert/update data into synapse tables based on the lastmodified field in the source json and table column.
we need to perform either insert/update record based on following conditions.
if(sourceJson.id==table.id) //assume record already exists
{
if (SourceJson.lastmodified > table.lastmodified){
//update existing record
}
else if(SourceJson.lastmodified<table.lastmodified){
//ignore record
}
}
else{
//insert record
}
is there any way to achieve this, if there please help me on this by sharing any sample flow.
Thanks
The copy data activity and azure dataflows both have an option to Upsert. But they would not help your requirement.
Since you have a key column id and also have a special condition based on which you want to either insert or ignore a record, you can create a stored procedure first in your azure synapse dedicated pool.
The following is the data available in my table:
The following is the data available in my JSON:
[
{
"id":1,
"first_name":"Ana",
"lastmodified":"2022-09-10 07:00:00"
},
{
"id":2,
"first_name":"Cassy",
"lastmodified":"2022-09-07 07:00:00"
},
{
"id":5,
"first_name":"Topson",
"lastmodified":"2022-09-10 07:00:00"
}
]
Use lookup to read the input JSON file. Create a dataset, uncheck first row only and run it. The following is my debug output:
Now, create a stored procedure. If I have created it directly in my Synapse pool (You can use use script activity to create it).
CREATE PROCEDURE mymerge
#array varchar(max)
AS
BEGIN
--inserting records whose id are not present in table
insert into demo1 SELECT * FROM OPENJSON(#array) WITH ( id int,first_name varchar(30),lastmodified datetime) where id not in (select id from demo1);
--using MERGE to update records based on matching id and lastmodified column condition
MERGE into demo1 as tgt
USING (SELECT * FROM OPENJSON(#array) WITH ( id int,first_name varchar(30),lastmodified datetime) where id in (select id from demo1)) as ip
ON (tgt.id = ip.id and ip.lastmodified>tgt.lastmodified)
WHEN MATCHED THE
UPDATE SET tgt.first_name = ip.first_name, tgt.lastmodified = ip.lastmodified;
END
Create a stored procedure activity. Select the above created Stored procedure and pass the lookup output array as a string parameter to stored procedure to get the required result.
#string(activity('Lookup1').output.value)
Running this would give the required result.

ADF Lookup query create schema and select

I am trying to run a create Shema/table query in the ADF lookup activity, with a dummy select in the end.
CREATE SCHEMA [schemax] AUTHORIZATION [auth1];
SELECT 0 AS dummyValue
but I got the below error
A database operation failed with the following error: 'Parse error at line: 2, column: 1: Incorrect syntax near 'SELECT'.',Source=,''Type=System.Data.SqlClient.SqlException,Message=Parse error at line: 2, column: 1: Incorrect syntax near 'SELECT'.,Source=.Net SqlClient Data Provider,SqlErrorNumber=103010,Class=16,ErrorCode=-2146232060,State=1
Data factory pipeline
I was able to run a similar query without SELECT query in the end but got another error mandating lookup must return a value.
You can only write select statements in lookup activity query settings.
To create schema or table, use copy data activity pre-copy script in sink settings. You can select a dummy table for the source and sink dataset and write your create script in pre-copy activity as below.
Source settings: (using dummy table which pulls 0 records)
Sink settings:
Pre-copy script: CREATE SCHEMA test1 AUTHORIZATION [user]

How to query parquet data files from Azure Synapse when data may be structured and exceed 8000 bytes in length

I am having trouble reading, querying and creating external tables from Parquet files stored in Datalake Storage gen2 from Azure Synapse.
Specifically I see this error while trying to create an external table through the UI:
"Error details
New external table
Previewing the file data failed. Details: Failed to execute query. Error: Column 'members' of type 'NVARCHAR' is not compatible with external data type 'JSON string. (underlying parquet nested/repeatable column must be read as VARCHAR or CHAR)'. File/External table name: [DELETED] Total size of data scanned is 1 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes.
. If the issue persists, contact support and provide the following id :"
My main hunch is that since a couple columns were originally JSON types, and some of the rows are quite long (up to 9000 characters right now, which could increase at any point in time during my ETL), this is some kind of conflict with some possible default limit's I have seen referenced in the documentation (enter link description here). Data appears internally like the following example, please bear in mind sometimes this would be way longer
["100.001", "100.002", "100.003", "100.004", "100.005", "100.006", "100.023"]
If I try to manually create the external table (which has worked every other time I have tried following code similar to this
CREATE EXTERNAL TABLE example1(
[id] bigint,
[column1] nvarchar(4000),
[column2] nvarchar(4000),
[column3] datetime2(7)
)
WITH (
LOCATION = 'location/**',
DATA_SOURCE = [datasource],
FILE_FORMAT = [SynapseParquetFormat]
)
GO
the table is created with no error nor warnings but trying to make a very simple select
SELECT TOP (100) [id] bigint,
[column1] nvarchar(4000),
[column2] nvarchar(4000),
[column3] datetime2(7)
FROM [schema1].[example1]
The following error is shown:
"External table 'dbo' is not accessible because content of directory cannot be listed."
It can also show the equivalent:
"External table 'schema1' is not accessible because content of directory cannot be listed."
This error persists even when creating external table with the argument "max" as it appears in this doc
Summary: How to create external table from parquet files with fields exceeding 4000, 8000 bytes or even up to 2gb, which would be the maximum size according to this
Thank you all in advance

Unable to ingest JSON data with MemSQL PIPELINE INTO PROCEDURE

I am facing issue while ingesting a JSON data via PIPELINE to a table using Store Procedure.
I see NULL values are getting inserted in the table.
Stored Procedure SQL:
DELIMITER //
CREATE OR REPLACE PROCEDURE ops.process_users(GENERIC_BATCH query(GENERIC_JSON json)) AS
BEGIN
INSERT INTO ops.USER(USER_ID,USERNAME)
SELECT GENERIC_JSON::USER_ID, GENERIC_JSON::USERNAME
FROM GENERIC_BATCH;
END //
DELIMITER ;
MemSQL Pipeline Command used:
CREATE OR REPLACE PIPELINE ops.tweet_pipeline_with_sp AS LOAD DATA KAFKA ‘<KAFKA_SERVER_IP>:9092/user-topic’
INTO PROCEDURE ops.process_users FORMAT JSON ;
JSON Data Pushed to Kafka topic: {“USER_ID”:“111”,“USERNAME”:“Test_User”}
Table DDL Statement: CREATE TABLE ops.USER (USER_ID INTEGER, USERNAME VARCHAR(255));
It looks like you're getting help in the MemSQL Forums at https://www.memsql.com/forum/t/unable-to-ingest-json-data-with-pipeline-into-procedure/1702/3 In particular it looks like a difference of :: (which yields JSON) and ::$ (which converts to SQL types).
Got solution from MemSQL forum!
Below are the Pipeline and Stored Procedure scripts that worked for me,
CREATE OR REPLACE PIPELINE OPS.TEST_PIPELINE_WITH_SP
AS LOAD DATA KAFKA '<KAFKA_SERVER_IP>/TEST-TOPIC'
INTO PROCEDURE OPS.PROCESS_USERS(GENERIC_JSON <- %) FORMAT JSON ;
DELIMITER //
CREATE OR REPLACE PROCEDURE ops.process_users(GENERIC_BATCH query(GENERIC_JSON json)) AS
BEGIN
INSERT INTO ops.USER(USER_ID,USERNAME)
SELECT GENERIC_JSON::USER_ID, json_extract_string(GENERIC_JSON,'USERNAME')
FROM GENERIC_BATCH;
END //
DELIMITER ;

Bulk import into Azure

For a Bulk Insert, I have got a data file and a format file (xml);
File.dat
File.xml
This is working OnPremises with a Bulk Insert statement, however in Azure it seems to have a problem with the format file. Below are the steps I have taken
Set Storage Access
Created a Shared Access Signature
Set the container Access Policy to 'Blob (anonymous read access for blobs only)
Create an Database Scoped Credential to the Storage
CREATE DATABASE SCOPED CREDENTIAL StorageCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'This is my secret' (Shared Access Signature Key)
Create an external Data Source
CREATE EXTERNAL DATA SOURCE Storage
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'https://<storagename>.blob.core.windows.net/<containername>',
CREDENTIAL = StorageCredential
);
File Query (Bulk insert or Openrowset)
BULK INSERT <Schema>.<Table>
FROM 'File.dat'
WITH (
DATA_SOURCE = 'Storage',
FORMATFILE = 'File.xml'
)
or
SELECT * FROM OPENROWSET(
BULK 'File.dat',
DATA_SOURCE = 'Storage',
FORMATFILE = 'File.xml'
) AS DataFile;
They are both not working with the error;
'Cannot bulk load because the file is inclomplete or could not be read'
However if I can succesfully run the following query;
SELECT * FROM OPENROWSET(
BULK 'File.xml',
DATA_SOURCE = 'Storage',
SINGLE_NClob) AS DataFile
I have found the answer and I will post it myself (In case other people also run into this problem).
The datasource of the format file should be specified individually. I tried the way specified in the documentation of Microsoft; Bulk Insert
However there is an error in the parameter name. It states that the correct parameter is 'FORMATFILE_DATASOURCE', however it should be 'FORMATFILE_DATA_SOURCE'. (This is commented at the bottom)
BULK INSERT <Schema>.<Table>
FROM 'File.dat'
WITH (
DATA_SOURCE = 'Storage',
FORMATFILE = 'File.xml',
FORMATFILE_DATA_SOURCE = 'Storage'
)