can we create a CloudFormation template for a QuickSight composite dataset? - aws-cloudformation

(composite dataset meaning when an existing dataset is directly joined with another existing dataset)
In the Quicksight, I have created a dataset which is made out of two existing datasets. And I want to create CloudFormation Template for that. But In the CloudFormation AWS::QuickSight::DataSet syntax: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-quicksight-dataset.html I am not able to find a way where I can put those two datasetids as a input. Can someone please help me with this!

Related

QuickSight Multiple Datasets Join In Cloudformation Template

Im trying to create dataset from cfn in the way of joining 3 datasets like AxB, AxC to create base dataset (I can create these through web GUI) through cloudformation in LogicalTableMap and JoinInstruction in Dataset template. However, I cannot create the dataset with many errors in many deploys showing like “LogicalTableMap must have a single root” or “Circular Dependency” joining from base dataset. it there any suggests?
Thank in advance

How to change an dataset schema without deleting dataset and having to remove filters on Personalize?

I noticed that one of my schema fields that I need to filter on is a booleaN and as you can't filter on a boolean I need to change the schema.
I was able to create a new schema using the new Python SDK but can't see how I can update the schema?
You can delete the dataset, but then that would mean having to delete all the filters, which would mean our service has to go down? (Everything in the API uses a filter).
Unfortunately you cannot change a schema for an existing dataset since (as you found) it impacts existing immutable resources like filters. A workaround is to create a new dataset group to hold your new dataset. This means you'll need to import your data (conforming to the new schema) in the new dataset, train models, create new filters, and a new campaign. Once the campaign is ready in the new dataset group, you can switch your app to use the new campaign and tear down the old dataset group. This sort of blue/green dataset group approach is somewhat common with Personalize for reasons such as this.

Azure Data Flow generic curation framework

I wanted to create a data curation framework using Data Flow that uses generic data flow pipelines.
I have multiple data feeds (raw tables) to validate (between 10-100) and write to sink as curated tables:
For each raw data feed, need to validate the expected schema (based on a parameterized file name)
For each raw data feed, need to provide the Data Flow Script with validation logic (some columns should not be null, some columns should have specifici data types and value ranges, etc.)
Using Python SDK, create Data Factory and mapping data flows pipelines using the Data Flow Script prepared with the parameters provided (for schema validation)
Trigger the python code that creates the pipelines for each feed, does validation, write the issues into Log Analytics workspace and tear off the resources at specific schedules.
Has anyone done something like this? What is the best approach for the above please?
My overall goal is to reduce the time to validate/curate the data feeds, thus I wanted to prepare the validation logic quickly for each feed and create python classes or Powershell scripts scheduled to run them on generic data pipelines at specific times of the day.
many thanks
CK
To validate the schema, you can have a reference dataset which will be having the same schema (first row) as of your main dataset. Then you need to use “Get Metadata” activity for each dataset and get the structure of each dataset. Your Get Metadata activity will look like this:
You can then use “If Condition” activity to matches the structure of both datasets using equal Logical Function. Your equal expression will look something like this:
If both datasets’ structure matches, your next required activity(like copy the dataset to another container) will be performed.
Your complete pipeline will look like this:
The script which you want to run on your inserted dataset could be performed using “Custom” activity. You again need to create the linked service and it’s corresponding dataset for your script which you will run to validate the raw data. Please refer: https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
To schedule the pipeline as per your specific pipeline will be take care by Triggers in Azure Data Factory. A schedule trigger will take care of your requirement of auto trigger your pipeline at any specific time.

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.

BIRT: Using information from one Dataset as parameter of an other

i'm creating some BIRT-Reports with Eclipse. Now i got the following problem.
I've got two datasets (Set one named diag, set two named risk). In my report i produce fpr every data in diag a region with an diag_id. Now i tried to use this diag_id as input parameter for the second dataset (risk). Is this possible, and how is this possible?
To link one dataset to another in BIRT, you can either:
Create a subreport within your report that links one dataset to another via an input parameter - see this Eclipse tutorial.
or:
Create a joint dataset that explicitly links the two datasets together - see the answer to this StackOverflow question.
Alternatively, if both datasets come from the same relational database, you could simply combine the two queries into a single query.
If you are using scripted data sources, you could use variables.
Add a variable through the Eclipse UI called "diag_id".
In the fetch script of diag, set diag_id:
vars["diag_id"] = ...; // store value in Variable.
Then, in the open script of risk, use the diag_id however you need to.
diag_id = vars["diag_id"];
This implies that placement of risk report elements are nested inside the diag repeating element so that diag.fetch will happen before each risk.open.