How to change an dataset schema without deleting dataset and having to remove filters on Personalize? - amazon-personalize

I noticed that one of my schema fields that I need to filter on is a booleaN and as you can't filter on a boolean I need to change the schema.
I was able to create a new schema using the new Python SDK but can't see how I can update the schema?
You can delete the dataset, but then that would mean having to delete all the filters, which would mean our service has to go down? (Everything in the API uses a filter).

Unfortunately you cannot change a schema for an existing dataset since (as you found) it impacts existing immutable resources like filters. A workaround is to create a new dataset group to hold your new dataset. This means you'll need to import your data (conforming to the new schema) in the new dataset, train models, create new filters, and a new campaign. Once the campaign is ready in the new dataset group, you can switch your app to use the new campaign and tear down the old dataset group. This sort of blue/green dataset group approach is somewhat common with Personalize for reasons such as this.

Related

QuickSight Multiple Datasets Join In Cloudformation Template

Im trying to create dataset from cfn in the way of joining 3 datasets like AxB, AxC to create base dataset (I can create these through web GUI) through cloudformation in LogicalTableMap and JoinInstruction in Dataset template. However, I cannot create the dataset with many errors in many deploys showing like “LogicalTableMap must have a single root” or “Circular Dependency” joining from base dataset. it there any suggests?
Thank in advance

can we create a CloudFormation template for a QuickSight composite dataset?

(composite dataset meaning when an existing dataset is directly joined with another existing dataset)
In the Quicksight, I have created a dataset which is made out of two existing datasets. And I want to create CloudFormation Template for that. But In the CloudFormation AWS::QuickSight::DataSet syntax: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-quicksight-dataset.html I am not able to find a way where I can put those two datasetids as a input. Can someone please help me with this!

How to configure a Synapse Mapping Data Flow for INSERT/UPDATE/DELETE when the destination table does not yet exist

I am trying to build a generic Mapping Data Flow for some basic cleansing on tables in my Data Lake. I need it to be able to work both on an ongoing basis after data already exists in my cleansed tables as well as when new tables are added (it would detect them automatically and create and populate the destination). Both the Source and Destination tables with be Delta tables.
The approach I have taken is to have Sources configured to both my actual source and to the target and use either JOIN transformations or EXISTS transformations to identify the new, updated and removed rows.
This works fine for INSERTS and UPDATES, however my issues is dealing with DELETES when there is no data currently in the destination. Obviously there will be nothing to DELETE - that is as expected. However, because I reference the key column that will exist once data is loaded to the table I get an error on an initial run that states:
ERROR Dataflow AppManager: name=BatchJobListener.failed, opId=xxx, message=Job 'xxx failed due to reason: DF-SINK-007 at Sink 'cleansedTableWithDeletes': Sink results in 0 output columns. Please ensure at least one column is mapped.
The overall process looks as follows:
Has anyone developed a pattern that works for a generic flow (this one is parameter driven and ensures schema drift is accommodated) or a way for the Data Flow to think that there IS a column in the destination that it can refer to and get past this issue?
In Source options check Allow no files found.
You can also provide date dynamically in Filter by last modified option.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#sink-settings

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.

Create price rules in WCS using OOB commands

Currently I have a pricerule that has only one action element to fetch a price from pricelist.
In order to achieve this pricerule I'm adding entries into required tables like
PRICERULE,
PRELEMENT,
PRELEMENTATTR
and other tables.
Now I need to add more conditions and branches to this pricerule in order to fit for the requirements(something like this).
But I found forming this pricerule by inserting entries directly into tables (as I did for simple pricerule) is quite complex. Because after forming the pricerule it has to be updated on weekly basis. Updates will be like changing the markup/markdown percentage or changing the start and end date of this markup etc.
So my question is:
Instead of directly updating the tables, is there any IBM WCS OOB functionality to achieve this?