QuickSight Multiple Datasets Join In Cloudformation Template - aws-cloudformation

Im trying to create dataset from cfn in the way of joining 3 datasets like AxB, AxC to create base dataset (I can create these through web GUI) through cloudformation in LogicalTableMap and JoinInstruction in Dataset template. However, I cannot create the dataset with many errors in many deploys showing like “LogicalTableMap must have a single root” or “Circular Dependency” joining from base dataset. it there any suggests?
Thank in advance

Related

Azure Data Factory - run script on parquet files and output as parquet files

In Azure Data Factory I have a pipeline, created from the built-in copy data task, that copies data from 12 entities (campaign, lead, contact etc.) from Dynamics CRM (using a linked service) and outputs the contents as parquet files in account storage. This is run every day, into a folder structure based on the date. The output structure in the container looks something like this:
Raw/CRM/2022/05/28/campaign.parquet
Raw/CRM/2022/05/28/lead.parquet
Raw/CRM/2022/05/29/campaign.parquet
Raw/CRM/2022/05/29/lead.parquet
That's just an example, but there is a folder structure for every year/month/day that the pipeline runs, and a parquet file for each of the 12 entities I'm retrieving.
This involved creating a pipeline, dataset for the source and dataset for the target. I modified the pipeline to add the pipeline's run date/time as a column in the parquet files, called RowStartDate (which I'll need in the next stage of processing)
My next step is to process the data into a staging area, which I'd like to output to a different folder in my container. My plan was to create 12 scripts (one for campaigns, one for leads, one for contact etc.) that essentially does the following:
accesses all of the correct files, using a wildcard path along the lines of: Raw/CRM/ * / * / * /campaign.parquet
selects the columns that I need
Rename column headings
in some cases, just take the most recent data (using the RowStartDate)
in some cases, create a slowly changing dimension, ensuring every row has a RowEndDate
I made some progress figuring out how to do this in SQL, by running a query using OPENROWSET with wildcards in the path as per above - but I don't think I can use my SQL script in ADF to move/process the data into a separate folder in my container.
My question is, how can I do this (preferably in ADF pipelines):
for each of my 12 entities, access each occurrence in the container with some sort of Raw/CRM///*/campaign.parquet statement
Process it as per the logic I've described above - a script of some sort
Output the contents back to a different folder in my container (each script would produce 1 output)
I've tried:
Using Azure Data Factory, but when I tell it which dataset to use, I point it to the dataset I created in my original pipeline - but this dataset has all 12 entities in the dataset and the data flow activity produces the error: "No value provided for Parameter 'cw_fileName" - but I don't see any place when configuring the data flow to specify a parameter (its not under source settings, source options, projection, optimize or inspect)
using Azure Data Factory, tried to add a script - but in trying to connect to my SQL script in Synapse - I don't know my Service Principal Key for the synapse workspace
using a notebook Databricks, I tried to mount my container but got an error along the lines that "adding secret to Databricks scope doesn't work in Standard Tier" so couldn't proceed
using Synapse, but as expected, it wants things in SQL whereas I'm trying to keep things in a container for now.
Could anybody point me in the right direction. What's the best approach that I should take? And if its one that I've described above, how do I go about getting past the issue I've described?
Pass the data flow dataset parameter values from the pipeline data flow activity settings.

can we create a CloudFormation template for a QuickSight composite dataset?

(composite dataset meaning when an existing dataset is directly joined with another existing dataset)
In the Quicksight, I have created a dataset which is made out of two existing datasets. And I want to create CloudFormation Template for that. But In the CloudFormation AWS::QuickSight::DataSet syntax: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-quicksight-dataset.html I am not able to find a way where I can put those two datasetids as a input. Can someone please help me with this!

How to change an dataset schema without deleting dataset and having to remove filters on Personalize?

I noticed that one of my schema fields that I need to filter on is a booleaN and as you can't filter on a boolean I need to change the schema.
I was able to create a new schema using the new Python SDK but can't see how I can update the schema?
You can delete the dataset, but then that would mean having to delete all the filters, which would mean our service has to go down? (Everything in the API uses a filter).
Unfortunately you cannot change a schema for an existing dataset since (as you found) it impacts existing immutable resources like filters. A workaround is to create a new dataset group to hold your new dataset. This means you'll need to import your data (conforming to the new schema) in the new dataset, train models, create new filters, and a new campaign. Once the campaign is ready in the new dataset group, you can switch your app to use the new campaign and tear down the old dataset group. This sort of blue/green dataset group approach is somewhat common with Personalize for reasons such as this.

Azure Data Flow generic curation framework

I wanted to create a data curation framework using Data Flow that uses generic data flow pipelines.
I have multiple data feeds (raw tables) to validate (between 10-100) and write to sink as curated tables:
For each raw data feed, need to validate the expected schema (based on a parameterized file name)
For each raw data feed, need to provide the Data Flow Script with validation logic (some columns should not be null, some columns should have specifici data types and value ranges, etc.)
Using Python SDK, create Data Factory and mapping data flows pipelines using the Data Flow Script prepared with the parameters provided (for schema validation)
Trigger the python code that creates the pipelines for each feed, does validation, write the issues into Log Analytics workspace and tear off the resources at specific schedules.
Has anyone done something like this? What is the best approach for the above please?
My overall goal is to reduce the time to validate/curate the data feeds, thus I wanted to prepare the validation logic quickly for each feed and create python classes or Powershell scripts scheduled to run them on generic data pipelines at specific times of the day.
many thanks
CK
To validate the schema, you can have a reference dataset which will be having the same schema (first row) as of your main dataset. Then you need to use “Get Metadata” activity for each dataset and get the structure of each dataset. Your Get Metadata activity will look like this:
You can then use “If Condition” activity to matches the structure of both datasets using equal Logical Function. Your equal expression will look something like this:
If both datasets’ structure matches, your next required activity(like copy the dataset to another container) will be performed.
Your complete pipeline will look like this:
The script which you want to run on your inserted dataset could be performed using “Custom” activity. You again need to create the linked service and it’s corresponding dataset for your script which you will run to validate the raw data. Please refer: https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
To schedule the pipeline as per your specific pipeline will be take care by Triggers in Azure Data Factory. A schedule trigger will take care of your requirement of auto trigger your pipeline at any specific time.

Two pipelines writing into single dataset in AzureDataFactory

I'm trying to point two different copy activity pipelines into single output dataset. All pipelines and dataset have frequency/availability set to Day. I've tried configuring pipeline1 as "style": "StartOfInterval" and pipeline2 as "style": "EndOfInterval". But with that setup I'm getting error on publish:
The Activity schedule does not match the schedule of the output
Dataset. Activity: 'MyCopyActivity'. Dataset:
'MyDataset'.","code":"ActivityDataSetSchedulerMismatch"
As a workaround I could create two different datasets, and point them to the same resource.
Is it possible to achieve this with single output dataset?
If the reason is to merge multiple inputs into one output.
You could instead have a single copy activity pipeline have two separate inputs.
The data set inputs could have different availability windows and then the copy activity could combine them into one output dataset.
Both pipeline and output datasets availability/scheduling properties should be same in all the cases.
In you case, you have different "style" for Pipelines, but you are referring single output dataset which has only one style(default is Endofinterval).
For One pipeline it will match, but for other pipeline it will throw error.
To overcome this, you have to create two output datasets with same linked service. Don't forget to match the "style" of OutputDatasets with corresponding pipelines
No it is not possible to use a single output dataset in two copy activities. You need to create two datasets and point them to the same resource.