We have JSON's that contain timestamps in the format:
2016-11-03T03:05:21.673Z
2016-11-03T03:05:21.63Z
So the appropriate format to parse the data is yyyy-MM-ddTHH:mm:ss.FFF\Z
I tried all of these variants to explain to ADF how to parse it:
"structure": [
{
"name": "data_event_time",
"type": "DateTime",
"format": "yyyy-MM-ddTHH:mm:ss.FFF\\Z"
},
...
]
"structure": [
{
"name": "data_event_time",
"type": "DateTimeOffset",
"format": "yyyy-MM-ddTHH:mm:ss.FFFZ"
},
...
]
"structure": [
{
"name": "data_event_time",
"type": "DateTimeOffset"
},
...
]
"structure": [
{
"name": "data_event_time",
"type": "DateTime"
},
...
]
In all of these cases above ADF fails with the error:
Copy activity encountered a user error at Sink side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'data_event_time' contains an invalid value '2016-11-13T00:44:50.573Z'. Cannot convert '2016-11-13T00:44:50.573Z' to type 'DateTimeOffset' with format 'yyyy-MM-dd HH:mm:ss.fffffff zzz'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=String was not recognized as a valid DateTime.,Source=mscorlib,'.
What am i doing wrong? How to fix it?
The previous issue has been fixed. Thanx wBob.
But now i have a new issue at the sink level.
I'm trying to load data from Azure Blob Storage to Azure DWH via ADF + PolyBase:
"sink": {
"type": "SqlDWSink",
"sqlWriterCleanupScript": "$$Text.Format('DELETE FROM [stage].[events] WHERE data_event_time >= \\'{0:yyyy-MM-dd HH:mm}\\' AND data_event_time < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)",
"writeBatchSize": 6000000,
"writeBatchTimeout": "00:15:00",
"allowPolyBase": true,
"polyBaseSettings": {
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "AppInsight-Stage-BlobStorage-LinkedService"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "..."
}
But the process fails with error:
Database operation failed. Error message from database execution : ErrorCode=FailedDbOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error happened when loading data into SQL Data Warehouse.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=107091;Query aborted-- the maximum reject threshold (10 %) was reached while reading from an external source: 6602 rows rejected out of total 6602 rows processed. Rows were rejected while reading from external source(s). 52168 rows rejected from external table [ADFCopyGeneratedExternalTable_0530887f-f870-4624-af46-249a39472bf3] in plan step 2 of query execution: Location: '/13/2cd1d10f-4f62-4983-a38d-685fc25c40a2_20161102_135850.blob' Column ordinal: 0, Expected data type: DATETIMEOFFSET(7) NOT NULL, Offending value: 2016-11-02T13:56:19.317Z (Column Conversion Error), Error: Conversion failed when converting the NVARCHAR value '2016-11-02T13:56:19.317Z' to data type DATETIMEOFFSET. Location: '/13/2cd1d10f-4f62-4983-a38d-685fc25c40a2_20161102_135850.blob' Column ordinal: 0, Expected ...
I read the Azure SQL Data Warehouse loading patterns and strategies
If the DATE_FORMAT argument isn’t designated, the following default formats are used:
DateTime: ‘yyyy-MM-dd HH:mm:ss’
SmallDateTime: ‘yyyy-MM-dd HH:mm’
Date: ‘yyyy-MM-dd’
DateTime2: ‘yyyy-MM-dd HH:mm:ss’
DateTimeOffset: ‘yyyy-MM-dd HH:mm:ss’
Time: ‘HH:mm:ss’
Looks like i have no ability at ADF level to specify the datetime format for PolyBase.
Does someone know any workaround?
We looked at a similar issue recently here:
What's reformatting my input data before I get to it?
JSON does not have a Datetime format as such, so leave the type and format elements out. Then your challenge is with the sync. Inserting these values into an Azure SQL Database for example should work.
"structure": [
{
"name": "data_event_time"
},
...
Looking at your error message, I would expect that to work inserting into a DATETIME column in SQL Data Warehouse (or SQL Database or SQL Server on a VM) but it is ordinary DATETIME data, not DATETIMEOFFSET.
If you have issues inserting into the target sink, you may have to workaround by not using the Polybase checkbox and code that side of the process yourself, eg
Copy raw files to blob storage or Azure Data Lake (now Polybase supports ADLS)
Create external tables over the files where the datetime data is set as varchar data-type
CTAS the data into an internal table, also converting the string datetime format to a proper DATETIME using T-SQL
Related
I'm building recommender system using AWS Personalize. User-personalization recipe has 3 dataset inputs: interactions, user_metadata and item_metadata. I am having trouble importing user metadata which contains boolean field.
I created the following schema:
user_schema = {
"type": "record",
"name": "Users",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "USER_ID",
"type": "string"
},
{
"name": "type",
"type": [
"null",
"string"
],
"categorical": True
},
{
"name": "lang",
"type": [
"null",
"string"
],
"categorical": True
},
{
"name": "is_active",
"type": "boolean"
}
],
"version": "1.0"
}
dataset csv file content looks like:
USER_ID,type,lang,is_active
1234#gmail.com ,,geo,True
01027061015#mail.ru ,facebook,eng,True
03dadahda#gmail.com ,facebook,geo,True
040168fadw#gmail.com ,facebook,geo,False
I uploaded given csv file on s3 bucket.
When I am trying create dataset import job it gives me the following exception:
InvalidInputException: An error occurred (InvalidInputException) when calling the CreateDatasetImportJob operation: Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.
I tested and it works without boolean field is_active. There are no NaN values in given column!
It'd be nice to have an ability to directly test if your pandas dataframe or csv file conforms given schema and possibly get more detailed error message.
Does anybody know how to format boolean field to fix that issue?
I found a solution through many trials. Checked the AWS Personalization documentation (https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html#dataset-requirements) which says that: boolean (values true and false must be lower case in your data).
Then I tried several things to find a solution, and one of them really worked. But still the hard way to find a solution and spent hours.
Solution:
Convert column in pandas DataFrame into string (Object) format.
lowercase True and False string values to get true and false.
store pandas DataFrame as csv file
it results in lowercase values of true and false.
USER_ID,type,lang,is_active
1234#gmail.com ,,geo,true
01027061015#mail.ru ,facebook,eng,true
03dadahda#gmail.com ,facebook,geo,true
040168fadw#gmail.com ,facebook,geo,false
That's all! There is no need to change "boolean" type in schema to "string"!
Hopefully they'll solve that issue soon since I contacted with AWS technical support with the same issue.
Using Azure Data Factory and a data transformation flow. I have a csv that contains a column with a json object string, below an example including the header:
"Id","Name","Timestamp","Value","Metadata"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-18 05:53:00.0000000","0","{""unit"":""%""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 05:53:00.0000000","4","{""jobName"":""RecipeB""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-16 02:12:30.0000000","state","{""jobEndState"":""negative""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 06:33:00.0000000","23","{""unit"":""kg""}"
Want to store the data in a json like this:
{
"id": "99c9347ab7c34733a4fe0623e1496ffd",
"name": "data1",
"values": [
{
"timestamp": "2021-03-18 05:53:00.0000000",
"value": "0",
"metadata": {
"unit": "%"
}
},
{
"timestamp": "2021-03-19 05:53:00.0000000",
"value": "4",
"metadata": {
"jobName": "RecipeB"
}
}
....
]
}
The challenge is that metadata has dynamic content, meaning, that it will be always a json object but the content can vary. Therefore I cannot define a schema. Currently the column "metadata" on the sink schema is defined as object, but whenever I run the transformation I run into an exception:
Conversion from ArrayType(StructType(StructField(timestamp,StringType,false),
StructField(value,StringType,false), StructField(metadata,StringType,false)),true) to ArrayType(StructType(StructField(timestamp,StringType,true),
StructField(value,StringType,true), StructField(metadata,StructType(StructField(,StringType,true)),true)),false) not defined
We can get the output you expected, we need the expression to get the object Metadata.value.
Please ref my steps, here's my source:
Derived column expressions, create a JSON schema to convert the data:
#(id=Id,
name=Name,
values=#(timestamp=Timestamp,
value=Value,
metadata=#(unit=substring(split(Metadata,':')[2], 3, length(split(Metadata,':')[2])-6))))
Sink mapping and output data preview:
The key is that your matadata value is an object and may have different schema and content, may be 'value' or other key. We only can manually build the schema, it doesn't support expression. That's the limit.
We can't achieve that within Data Factory.
HTH.
i have try to read Json file using copy activity and write data in sql server.
my json file available in blob store.
i have set file fromat-JSON format
when i try to import schema i got error-Error occurred when
deserializing source JSON data. Please check if the data is in valid
JSON object format.. Activity ID:2f799221-f037-4f72-8e6c-385778929110
myjsonData
{
"id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
"context": {
"device": {
"type": "PC"
},
"custom": {
"dimensions": [
{
"TargetResourceType": "Microsoft.Compute/virtualMachines"
},
{
"ResourceManagementProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
},
{
"OccurrenceTime": "1/13/2017 11:24:37 AM"
}
]
}
}
}
Regards,
Manish
Based on your description and your sample source data, you could import the schema directly,however the column is nested.
If you want to flatten the nested json before you store them into sql server database as rows,you could execute Azure Function Activity before the Copy Activity.
Or you could execute the stored procedure in sql server dataset.
Good Morning,
Objective: I am working on trying to add new columns to a SSAS Tabular Model table. With a long-term aim to programmaticly made large-batch changes when needed.
Resources I've found:
https://learn.microsoft.com/en-us/sql/analysis-services/tabular-models-scripting-language-commands/create-command-tmsl
This one gives the template I've been following but seems to not work.
What I have tried so far:
{
"create": {
"parentObject": {
"database": "TabularModel_1_dev"
, "table": "TableABC"
},
"columns": [
{
"name": "New Column"
, "dataType": "string"
, "sourceColumn": "Column from SQL Source"
}
]
}
}
This first one is the most true to the example but returns the following error:
"The JSON DDL request failed with the following error: Unrecognized JSON property: columns. Check path 'create.columns', line 7, position 15.."
Attempt Two:
{
"create": {
"parentObject": {
"database": "TabularModel_1_dev"
, "table": "TableABC"
},
"table": {
"name": "Item Details by Branch",
"columns": [
{
"name": "New Column"
, "dataType": "string"
, "sourceColumn": "New Column"
}
]
}
}
}
Adding table within the child list returns error too;
"...Cannot execute the Create command: the specified parent object cannot have a child object of type Table.."
Omitting the table within the parentObject is unsuccessful as well.
I know it's been three years since the post, but I too was attempting the same thing and stumbled across this post in my quest. I ended up reaching out to microsoft and was told that the Add Column example they gave in their documentation was a "doc bug". In fact, you can't add just a column, you have to feed it an entire table definition via createOrReplace.
SSAS Error Creating Column with TMSL
I have created simple table called as test3
create table if not exists test3(
Studies varchar(300) not null,
Series varchar(500) not null
);
I got some json data
{
"Studies": [{
"studyinstanceuid": "2.16.840.1.114151",
"studydescription": "Some study",
"studydatetime": "2014-10-03 08:36:00"
}],
"Series": [{
"SeriesKey": "abc",
"SeriesInstanceUid": "xyz",
"studyinstanceuid": "2.16.840.1.114151",
"SeriesDateTime": "2014-10-03 09:05:09"
}, {
"SeriesKey": "efg",
"SeriesInstanceUid": "stw",
"studyinstanceuid": "2.16.840.1.114151",
"SeriesDateTime": "0001-01-01 00:00:00"
}],
"ExamKey": "exam-key",
}
and here is my json_path
{
"jsonpaths": [
"$['Studies']",
"$['Series']"
]
}
Both the json data and json path is uploaded to s3.
I try to execute the following copy command in redshift consule.
copy test3
from 's3://mybucket/redshift_demo/input.json'
credentials 'aws_access_key_id=my_key;aws_secret_access_key=my_access'
json 's3://mybucket/redsift_demo/json_path.json'
I get the following error. Can anyone please help been stuck on this for sometime now.
Amazon](500310) Invalid operation: Number of jsonpaths and the number of columns should match. JSONPath size: 1, Number of columns in table or column list: 2
Details:
-----------------------------------------------
error: Number of jsonpaths and the number of columns should match. JSONPath size: 1, Number of columns in table or column list: 2
code: 8001
context:
query: 1125432
location: s3_utility.cpp:670
process: padbmaster [pid=83747]
-----------------------------------------------;
1 statement failed.
Execution time: 1.58s
Redshift's error is misleading. The issue is that your input file is wrongly formatted: you have an extra comma after the last JSON entry.
Copy succeeds if you change "ExamKey": "exam-key", to "ExamKey": "exam-key"