InvalidInputException: AWS Personalize error importing boolean fields in user or item metadata - amazon-personalize

I'm building recommender system using AWS Personalize. User-personalization recipe has 3 dataset inputs: interactions, user_metadata and item_metadata. I am having trouble importing user metadata which contains boolean field.
I created the following schema:
user_schema = {
"type": "record",
"name": "Users",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "USER_ID",
"type": "string"
},
{
"name": "type",
"type": [
"null",
"string"
],
"categorical": True
},
{
"name": "lang",
"type": [
"null",
"string"
],
"categorical": True
},
{
"name": "is_active",
"type": "boolean"
}
],
"version": "1.0"
}
dataset csv file content looks like:
USER_ID,type,lang,is_active
1234#gmail.com ,,geo,True
01027061015#mail.ru ,facebook,eng,True
03dadahda#gmail.com ,facebook,geo,True
040168fadw#gmail.com ,facebook,geo,False
I uploaded given csv file on s3 bucket.
When I am trying create dataset import job it gives me the following exception:
InvalidInputException: An error occurred (InvalidInputException) when calling the CreateDatasetImportJob operation: Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.
I tested and it works without boolean field is_active. There are no NaN values in given column!
It'd be nice to have an ability to directly test if your pandas dataframe or csv file conforms given schema and possibly get more detailed error message.
Does anybody know how to format boolean field to fix that issue?

I found a solution through many trials. Checked the AWS Personalization documentation (https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html#dataset-requirements) which says that: boolean (values true and false must be lower case in your data).
Then I tried several things to find a solution, and one of them really worked. But still the hard way to find a solution and spent hours.
Solution:
Convert column in pandas DataFrame into string (Object) format.
lowercase True and False string values to get true and false.
store pandas DataFrame as csv file
it results in lowercase values of true and false.
USER_ID,type,lang,is_active
1234#gmail.com ,,geo,true
01027061015#mail.ru ,facebook,eng,true
03dadahda#gmail.com ,facebook,geo,true
040168fadw#gmail.com ,facebook,geo,false
That's all! There is no need to change "boolean" type in schema to "string"!
Hopefully they'll solve that issue soon since I contacted with AWS technical support with the same issue.

Related

Apache Kafka & JSON Schema

I am starting to get into Apache Kafka (Confluent) and have some questions regarding the use of schemas.
First, is my general understanding correct that a schema is used for validating the data? My understanding of schemas is that when the data is "produced", it checks if the Keys and Values fit the predefined concept and splits them accordingly.
My current technical setup is as follows:
Python:
from confluent_kafka import Producer
from config import conf
import json
# create producer
producer = Producer(conf)
producer.produce("datagen-topic", json.dumps({"product":"table","brand":"abc"}))
producer.flush()
in Confluent, i set up a json key schema for my topic:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"properties": {
"brand": {
"type": "string"
},
"product": {
"type": "string"
}
},
"required": [
"product",
"brand"
],
"type": "object"
}
Now, when I produce the data, the message in Confluent contains only content in "Value". Key and Header are null:
{
"product": "table",
"brand": "abc"
}
Basically it doesn't make a difference if I have this schema set up or not, so I guess it's just not working as I set it up. Can you help me where my way of thinking is wrong or where my code is lacking input?
The Confluent Python library Producer class doesn't interact with the Registry in any way, so your message wouldn't be validated.
You'll want to use SerializingProducer like in the example - https://github.com/confluentinc/confluent-kafka-python/blob/master/examples/json_producer.py
If you want non-null keys and headers, you'll need to pass those on to the send method

How to map a json string into object type in sink transformation

Using Azure Data Factory and a data transformation flow. I have a csv that contains a column with a json object string, below an example including the header:
"Id","Name","Timestamp","Value","Metadata"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-18 05:53:00.0000000","0","{""unit"":""%""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 05:53:00.0000000","4","{""jobName"":""RecipeB""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-16 02:12:30.0000000","state","{""jobEndState"":""negative""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 06:33:00.0000000","23","{""unit"":""kg""}"
Want to store the data in a json like this:
{
"id": "99c9347ab7c34733a4fe0623e1496ffd",
"name": "data1",
"values": [
{
"timestamp": "2021-03-18 05:53:00.0000000",
"value": "0",
"metadata": {
"unit": "%"
}
},
{
"timestamp": "2021-03-19 05:53:00.0000000",
"value": "4",
"metadata": {
"jobName": "RecipeB"
}
}
....
]
}
The challenge is that metadata has dynamic content, meaning, that it will be always a json object but the content can vary. Therefore I cannot define a schema. Currently the column "metadata" on the sink schema is defined as object, but whenever I run the transformation I run into an exception:
Conversion from ArrayType(StructType(StructField(timestamp,StringType,false),
StructField(value,StringType,false), StructField(metadata,StringType,false)),true) to ArrayType(StructType(StructField(timestamp,StringType,true),
StructField(value,StringType,true), StructField(metadata,StructType(StructField(,StringType,true)),true)),false) not defined
We can get the output you expected, we need the expression to get the object Metadata.value.
Please ref my steps, here's my source:
Derived column expressions, create a JSON schema to convert the data:
#(id=Id,
name=Name,
values=#(timestamp=Timestamp,
value=Value,
metadata=#(unit=substring(split(Metadata,':')[2], 3, length(split(Metadata,':')[2])-6))))
Sink mapping and output data preview:
The key is that your matadata value is an object and may have different schema and content, may be 'value' or other key. We only can manually build the schema, it doesn't support expression. That's the limit.
We can't achieve that within Data Factory.
HTH.

SSAS Tabular Add Column via TMSL

Good Morning,
Objective: I am working on trying to add new columns to a SSAS Tabular Model table. With a long-term aim to programmaticly made large-batch changes when needed.
Resources I've found:
https://learn.microsoft.com/en-us/sql/analysis-services/tabular-models-scripting-language-commands/create-command-tmsl
This one gives the template I've been following but seems to not work.
What I have tried so far:
{
"create": {
"parentObject": {
"database": "TabularModel_1_dev"
, "table": "TableABC"
},
"columns": [
{
"name": "New Column"
, "dataType": "string"
, "sourceColumn": "Column from SQL Source"
}
]
}
}
This first one is the most true to the example but returns the following error:
"The JSON DDL request failed with the following error: Unrecognized JSON property: columns. Check path 'create.columns', line 7, position 15.."
Attempt Two:
{
"create": {
"parentObject": {
"database": "TabularModel_1_dev"
, "table": "TableABC"
},
"table": {
"name": "Item Details by Branch",
"columns": [
{
"name": "New Column"
, "dataType": "string"
, "sourceColumn": "New Column"
}
]
}
}
}
Adding table within the child list returns error too;
"...Cannot execute the Create command: the specified parent object cannot have a child object of type Table.."
Omitting the table within the parentObject is unsuccessful as well.
I know it's been three years since the post, but I too was attempting the same thing and stumbled across this post in my quest. I ended up reaching out to microsoft and was told that the Add Column example they gave in their documentation was a "doc bug". In fact, you can't add just a column, you have to feed it an entire table definition via createOrReplace.
SSAS Error Creating Column with TMSL

ADF cannot parse DateTimeOffset

We have JSON's that contain timestamps in the format:
2016-11-03T03:05:21.673Z
2016-11-03T03:05:21.63Z
So the appropriate format to parse the data is yyyy-MM-ddTHH:mm:ss.FFF\Z
I tried all of these variants to explain to ADF how to parse it:
"structure": [
{
"name": "data_event_time",
"type": "DateTime",
"format": "yyyy-MM-ddTHH:mm:ss.FFF\\Z"
},
...
]
"structure": [
{
"name": "data_event_time",
"type": "DateTimeOffset",
"format": "yyyy-MM-ddTHH:mm:ss.FFFZ"
},
...
]
"structure": [
{
"name": "data_event_time",
"type": "DateTimeOffset"
},
...
]
"structure": [
{
"name": "data_event_time",
"type": "DateTime"
},
...
]
In all of these cases above ADF fails with the error:
Copy activity encountered a user error at Sink side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'data_event_time' contains an invalid value '2016-11-13T00:44:50.573Z'. Cannot convert '2016-11-13T00:44:50.573Z' to type 'DateTimeOffset' with format 'yyyy-MM-dd HH:mm:ss.fffffff zzz'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=String was not recognized as a valid DateTime.,Source=mscorlib,'.
What am i doing wrong? How to fix it?
The previous issue has been fixed. Thanx wBob.
But now i have a new issue at the sink level.
I'm trying to load data from Azure Blob Storage to Azure DWH via ADF + PolyBase:
"sink": {
"type": "SqlDWSink",
"sqlWriterCleanupScript": "$$Text.Format('DELETE FROM [stage].[events] WHERE data_event_time >= \\'{0:yyyy-MM-dd HH:mm}\\' AND data_event_time < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)",
"writeBatchSize": 6000000,
"writeBatchTimeout": "00:15:00",
"allowPolyBase": true,
"polyBaseSettings": {
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "AppInsight-Stage-BlobStorage-LinkedService"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "..."
}
But the process fails with error:
Database operation failed. Error message from database execution : ErrorCode=FailedDbOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error happened when loading data into SQL Data Warehouse.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=107091;Query aborted-- the maximum reject threshold (10 %) was reached while reading from an external source: 6602 rows rejected out of total 6602 rows processed. Rows were rejected while reading from external source(s). 52168 rows rejected from external table [ADFCopyGeneratedExternalTable_0530887f-f870-4624-af46-249a39472bf3] in plan step 2 of query execution: Location: '/13/2cd1d10f-4f62-4983-a38d-685fc25c40a2_20161102_135850.blob' Column ordinal: 0, Expected data type: DATETIMEOFFSET(7) NOT NULL, Offending value: 2016-11-02T13:56:19.317Z (Column Conversion Error), Error: Conversion failed when converting the NVARCHAR value '2016-11-02T13:56:19.317Z' to data type DATETIMEOFFSET. Location: '/13/2cd1d10f-4f62-4983-a38d-685fc25c40a2_20161102_135850.blob' Column ordinal: 0, Expected ...
I read the Azure SQL Data Warehouse loading patterns and strategies
If the DATE_FORMAT argument isn’t designated, the following default formats are used:
DateTime: ‘yyyy-MM-dd HH:mm:ss’
SmallDateTime: ‘yyyy-MM-dd HH:mm’
Date: ‘yyyy-MM-dd’
DateTime2: ‘yyyy-MM-dd HH:mm:ss’
DateTimeOffset: ‘yyyy-MM-dd HH:mm:ss’
Time: ‘HH:mm:ss’
Looks like i have no ability at ADF level to specify the datetime format for PolyBase.
Does someone know any workaround?
We looked at a similar issue recently here:
What's reformatting my input data before I get to it?
JSON does not have a Datetime format as such, so leave the type and format elements out. Then your challenge is with the sync. Inserting these values into an Azure SQL Database for example should work.
"structure": [
{
"name": "data_event_time"
},
...
Looking at your error message, I would expect that to work inserting into a DATETIME column in SQL Data Warehouse (or SQL Database or SQL Server on a VM) but it is ordinary DATETIME data, not DATETIMEOFFSET.
If you have issues inserting into the target sink, you may have to workaround by not using the Polybase checkbox and code that side of the process yourself, eg
Copy raw files to blob storage or Azure Data Lake (now Polybase supports ADLS)
Create external tables over the files where the datetime data is set as varchar data-type
CTAS the data into an internal table, also converting the string datetime format to a proper DATETIME using T-SQL

datatype of complextype entity is null when returning an array of complex types

We have created a complextype field "carriers" which is an array of Carrier objects. See below metadata
"dataProperties": [
{
"name": "carriers",
"complexTypeName":"Carrier#Test",
"isScalar":false
}]
The Carrier entity is defined as below:
{
"shortName": "Carrier",
"namespace": "Test",
"isComplexType": true,
"dataProperties": [
{
"name": "Testing",
"isScalar":true,
"dataType": "String"
}
]
}
We are trying to return an array of complextype in breeze from a REST service call. We get an error in breeze.debug.js in the method proto._updateTargetFromRaw. The error is because the datatype is null.
Any idea how to fix this issue?
I'm guessing the problem is in your "complexTypeName". You wrote "Carrier#Test" when I think you meant to write "Carrier:#Test". The ":#" combination separates the "short name" from the namespace; you omitted the colon.
Hope that's the explanation.