Bulk import into Azure - tsql

For a Bulk Insert, I have got a data file and a format file (xml);
File.dat
File.xml
This is working OnPremises with a Bulk Insert statement, however in Azure it seems to have a problem with the format file. Below are the steps I have taken
Set Storage Access
Created a Shared Access Signature
Set the container Access Policy to 'Blob (anonymous read access for blobs only)
Create an Database Scoped Credential to the Storage
CREATE DATABASE SCOPED CREDENTIAL StorageCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'This is my secret' (Shared Access Signature Key)
Create an external Data Source
CREATE EXTERNAL DATA SOURCE Storage
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'https://<storagename>.blob.core.windows.net/<containername>',
CREDENTIAL = StorageCredential
);
File Query (Bulk insert or Openrowset)
BULK INSERT <Schema>.<Table>
FROM 'File.dat'
WITH (
DATA_SOURCE = 'Storage',
FORMATFILE = 'File.xml'
)
or
SELECT * FROM OPENROWSET(
BULK 'File.dat',
DATA_SOURCE = 'Storage',
FORMATFILE = 'File.xml'
) AS DataFile;
They are both not working with the error;
'Cannot bulk load because the file is inclomplete or could not be read'
However if I can succesfully run the following query;
SELECT * FROM OPENROWSET(
BULK 'File.xml',
DATA_SOURCE = 'Storage',
SINGLE_NClob) AS DataFile

I have found the answer and I will post it myself (In case other people also run into this problem).
The datasource of the format file should be specified individually. I tried the way specified in the documentation of Microsoft; Bulk Insert
However there is an error in the parameter name. It states that the correct parameter is 'FORMATFILE_DATASOURCE', however it should be 'FORMATFILE_DATA_SOURCE'. (This is commented at the bottom)
BULK INSERT <Schema>.<Table>
FROM 'File.dat'
WITH (
DATA_SOURCE = 'Storage',
FORMATFILE = 'File.xml',
FORMATFILE_DATA_SOURCE = 'Storage'
)

Related

Azure Synapse Upsert Record into Dedicated Sql Pool

We have requirement that we need to fetch json data from the datalake storage and insert/update data into synapse tables based on the lastmodified field in the source json and table column.
we need to perform either insert/update record based on following conditions.
if(sourceJson.id==table.id) //assume record already exists
{
if (SourceJson.lastmodified > table.lastmodified){
//update existing record
}
else if(SourceJson.lastmodified<table.lastmodified){
//ignore record
}
}
else{
//insert record
}
is there any way to achieve this, if there please help me on this by sharing any sample flow.
Thanks
The copy data activity and azure dataflows both have an option to Upsert. But they would not help your requirement.
Since you have a key column id and also have a special condition based on which you want to either insert or ignore a record, you can create a stored procedure first in your azure synapse dedicated pool.
The following is the data available in my table:
The following is the data available in my JSON:
[
{
"id":1,
"first_name":"Ana",
"lastmodified":"2022-09-10 07:00:00"
},
{
"id":2,
"first_name":"Cassy",
"lastmodified":"2022-09-07 07:00:00"
},
{
"id":5,
"first_name":"Topson",
"lastmodified":"2022-09-10 07:00:00"
}
]
Use lookup to read the input JSON file. Create a dataset, uncheck first row only and run it. The following is my debug output:
Now, create a stored procedure. If I have created it directly in my Synapse pool (You can use use script activity to create it).
CREATE PROCEDURE mymerge
#array varchar(max)
AS
BEGIN
--inserting records whose id are not present in table
insert into demo1 SELECT * FROM OPENJSON(#array) WITH ( id int,first_name varchar(30),lastmodified datetime) where id not in (select id from demo1);
--using MERGE to update records based on matching id and lastmodified column condition
MERGE into demo1 as tgt
USING (SELECT * FROM OPENJSON(#array) WITH ( id int,first_name varchar(30),lastmodified datetime) where id in (select id from demo1)) as ip
ON (tgt.id = ip.id and ip.lastmodified>tgt.lastmodified)
WHEN MATCHED THE
UPDATE SET tgt.first_name = ip.first_name, tgt.lastmodified = ip.lastmodified;
END
Create a stored procedure activity. Select the above created Stored procedure and pass the lookup output array as a string parameter to stored procedure to get the required result.
#string(activity('Lookup1').output.value)
Running this would give the required result.

Azure Data Factory, MalformedInputException on a copy data activity

My copy data activity gets data in .snappy.parquet format (from Azure Data Lake Storage Gen2) and brings it to Azure Synapse Analytics.
I keep receiving this:
Copy Command operation failed with error 'HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: MalformedInputException: Input length = 1'
I use a pre-copy script that has this structure:
IF OBJECT_ID('[SCHEMA].[TABLE]') IS NOT NULL BEGIN
DROP TABLE [SCHEMA].[TABLE] END
CREATE TABLE [SCHEMA].[TABLE] ( [FIELD] VARCHAR(4386) ,[FIELD] DECIMAL(18,8), ...)
WITH (
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
What is the problem?

How to use a very long JSON as text parameter in Power Shell?

TL;DR: I'm overflowing a string parameter with an over 500k character JSON.
I'm using an Azure based solution to:
In Logic Apps, go through a list of SharePoint Lists stored in over 200 SharePoint subsites.
Send HTTP Request to SharePoint API and download each list as JSON.
Call a Stored Procedure on SQL Database that transforms and loads the data to the database.
After having some issues with step 3, namely, timeout issues with the connection between Logic Apps, I've added step:
2.5: Call an Automation Runbook that calls the Stored Procedure without timing out. This is based on this solution. Basically, it's a PowerShell script that creates an ADO.NET connection with the Azure SQL Database and then executes the Stored Procedure, with the SP parameters being in turn requested as parameters in Logic Apps. Like this:
But with a few of the lists I'm getting an error indicating that I've busted the character limit on a PowerShell string variable:
{
"code": "BadRequest",
"message": "{\"Message\":\"The request is invalid.\",\"ModelState\":{\"job.properties.parameters\":[\"Job parameter values too long. Max allowed length:524288. Parameter names: Json\"]}}"
}
Here's the core of it: "Job parameter values too long. Max allowed length:524288. Parameter names: Json". This Parameter is declared on Power Shell as follows:
[parameter(Mandatory=$True)]
[string] $Json,
Is there another data type I could declare for this that would not run into this limitation?
Following up on the suggestion by David Browne on the comments, I passed my large JSON responses from the SharePoint API into a Blob Storage, and then passed the blob SAS URI as the parameter for the Stored Procedure. I also hade some Authorization issues executing this procedure and the solution my boss pointed to me was to create a Master Key. This was executed once:
USE ***DataBase***
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '123'
GO
And then on the stored procedure, the part concerned with opening the Blob content and reading it into a table variable and finally into a text variable, was wrapped in a OPEN MASTER KEY - CLOSE MASTER KEY:
SET #vURI = '***Json BLOB URI***'
SET #SplitChar = CHARINDEX('?', #vURI)
SET #vFileName = REPLACE(SUBSTRING(#vURI, 1, #SplitChar -1),'https://***StorageAcc***.blob.core.windows.net/***ContainterName***/','')
OPEN MASTER KEY DECRYPTION BY PASSWORD = '123' 
SET #vSQL = 'ALTER DATABASE SCOPED CREDENTIAL dbscopedcredential WITH IDENTITY = ''SHARED ACCESS SIGNATURE'',
SECRET = ''***BLOB AUTHENTICATION***'';'
EXEC sp_executesql #stmt = #vSQL
SET #vSQL = '
DECLARE #vTable TABLE (BulkColumn NVARCHAR(MAX));
INSERT INTO #vTable
SELECT * FROM OPENROWSET(
BULK ''' + #vFileName + ''',
DATA_SOURCE = ''externaldatasource_import'',
SINGLE_CLOB) AS DataFile;
SELECT * FROM #vTable
'
INSERT INTO #vTable
EXEC sp_executesql #stmt = #vSQL
SELECT #vJson = [BulkColumn] FROM #vTable
CLOSE MASTER KEY 

Importing a BCP file in Azure database

I have an Azure function that retrieves a zip file that contains multiple BCP files which unzips them and adds them as blobs.
I now want to import the BCP files into my SQL database but not sure how to go about it. I know I can use following script and run an SqlCommand:
BULK INSERT RegPlusExtract.dbo.extract_class
FROM 'D:\local\data\extract_class.bsp'
WITH ( FIELDTERMINATOR = '#**#',ROWTERMINATOR = '*##*')
But this obviously does not work as the SQL server doesn't have access to the local function's D: drive.
How should I go about loading the data? Is it possible to load the BCP file into memory and then pass the SQLCommand? Or can I pass the file direct to SQL server?
I've found out that for backup/restore I can do FROM URL = ''. If I could use this for bulk insert then I can just reference the blob url, but doesn't look like I can?
You will need to use BLOB storage..below are the steps and these are documented here Microsoft/sql-server-samples
--create an external data source
CREATE EXTERNAL DATA SOURCE MyAzureBlobStorage
WITH ( TYPE = BLOB_STORAGE,
LOCATION = 'https://sqlchoice.blob.core.windows.net/sqlchoice/samples/load-from-azure-blob-storage',
-- CREDENTIAL= MyAzureBlobStorageCredential --> CREDENTIAL is not required if a blob storage is public!
);
You also can upload files to a container and reference it like below.Here week3 is a container
CREATE EXTERNAL DATA SOURCE MyAzureInvoicesContainer
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'https://newinvoices.blob.core.windows.net/week3',
CREDENTIAL = UploadInvoices
);
Now you can use OpenRowset and BulkInsert like below
-- 2.1. INSERT CSV file into Product table
BULK INSERT Product
FROM 'product.csv'
WITH ( DATA_SOURCE = 'MyAzureBlobStorage',
FORMAT='CSV', CODEPAGE = 65001, --UTF-8 encoding
FIRSTROW=2,
TABLOCK);
-- 2.2. INSERT file exported using bcp.exe into Product table
BULK INSERT Product
FROM 'product.bcp'
WITH ( DATA_SOURCE = 'MyAzureBlobStorage',
FORMATFILE='product.fmt',
FORMATFILE_DATA_SOURCE = 'MyAzureBlobStorage',
TABLOCK);
-- 2.3. Read rows from product.dat file using format file and insert it into Product table
INSERT INTO Product WITH (TABLOCK) (Name, Color, Price, Size, Quantity, Data, Tags)
SELECT Name, Color, Price, Size, Quantity, Data, Tags
FROM OPENROWSET(BULK 'product.bcp',
DATA_SOURCE = 'MyAzureBlobStorage',
FORMATFILE='product.fmt',
FORMATFILE_DATA_SOURCE = 'MyAzureBlobStorage') as products;
-- 2.4. Query remote file
SELECT Color, count(*)
FROM OPENROWSET(BULK 'product.bcp',
DATA_SOURCE = 'MyAzureBlobStorage',
FORMATFILE='data/product.fmt',
FORMATFILE_DATA_SOURCE = 'MyAzureBlobStorage') as data
GROUP BY Color;

SSIS Import Files with changing layouts

I'm using SSIS 2008 and trying to work on a package for importing a specified file into a table created for its layout. It will take in the destination table & source file as package variables.
The main problem I'm running into is that the file layouts are subject to change, they're not consistent. The table I'd be importing into will match the file though. I had initial success, but soon after changing the source file/destination it throws the vs_needsnewmetadata error.
Are there any workarounds discovered that could potentially be used here for files not fitting the layout the package was designed with?
Edit: These are .txt files, tab-delimited.
Edit2: Tried fiddling with OPENROWSET as well, hit a security error on our server.
I am assuming here that said file is a CSV file.
I have just been faced with the exact same problem a couple of weeks ago. You need to use dynamic SQL to achieve this.
Create a stored procedure on your database with the code below (change the 2 "C:\Folder\" locations to the location of your file):
CREATE PROCEDURE [dbo].[CreateAndImportCSVs] (#FILENAME NVARCHAR(200))
AS
BEGIN
SET NOCOUNT ON;
DECLARE #PATH NVARCHAR(4000) = N'C:\Folder\' + #FILENAME + ''
DECLARE #TABLE NVARCHAR(50) = SUBSTRING(#FILENAME,0,CHARINDEX('.',#FILENAME))
DECLARE #SQL NVARCHAR(4000) = N'IF OBJECT_ID(''dbo.' + #TABLE + ''' , ''U'') IS NOT NULL DROP TABLE dbo.[' + #TABLE + ']
SELECT * INTO [' + #TABLE + ']
FROM OPENROWSET(''MSDASQL''
,''Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\Folder;''
,''SELECT * FROM ' + #FILENAME + ''')'
EXEC(#SQL)
END
You might need to download the Microsoft Access Database Engine from:
https://www.microsoft.com/en-gb/download/details.aspx?id=13255
and install on your machine/server for the Microsoft Access Text Driver to work.
Then create an Execute SQL Task in SSIS with the relevant connection details to your SQL server database. Then pass the file name to the stored procedure you created:
EXEC dbo.CreateAndImportCSVs 'filename.csv'
It will then create the table based on the structure and data contained within the CSV, it also names the table the same as the csv file name.
*This stored procedure can also be used to run through a list of files.
Hope this helps!