When are data sets realized in Azure Data Factory data flows?

When are data sets realized in Azure Data Factory data flows? - azure-data-factory

If I create a dataset with no schema, and then use it in a flow, where I have a Select schema modifier, when is the retrieval from the database done? Is it done on the full dataset including the complete schema and then the select projection is done or is it done in a just-in-time fashion, where the retrieval is done at the latest possible time in order to minimize the amount data retrieved?

Related

Azure Data Factory data flow - drops null columns

When using a data flow in azure data factory to move data, I've noticed that the data (at the sink) is missing columns that contains NULL values. When using the copy activity to copy the same data, the columns are present in the sink with their NULL values.
Record after a copy activity:
Record after a data flow:
Source is parquet, sink is azure cosmos db. My goal is to avoid defining any schemas, as I simply want to copy all of the data "as is". I've used the "allow schema drift" option on the source and sink.
I would just use the copy activity, but it doesn't appear to have the ability to define a maximum speed (RU consumption) like the data flow does, so the copy activity ends up consuming all of the cosmos db's RUs very quickly (as described here)
EDIT:
sink data preview shows all columns
sink inspect tab shows all columns

Dataflows always skip writing JSON tags with NULLs. There is no workaround currently other than copy activity.

This is really not a good design or behavior on Microsoft's part because you can't Standardize in Cosmos weather to "Keep" or "Remove" null fields in your JSON.
Querying Cosmos
Where field1 = NULL is completely different than where NOT IS_DEFINED (field1) and will yield an entirely different result set.
And if your users don't know if the ADF developer used a Dataflow with a Sink vs a Copy Activity in a Pipeline then the may get erroneous results in a query. The only way to ensure you get all the data is to always use:
Where field1 = NULL or where NOT IS_DEFINED (field1)
Users should not have to depend on knowing what kind of ADF functionality was chosen for a specific JSON document in a Cosmos NoSQL collection to do a query. Plus you can't standardize that you will "Keep" null across all Cosmos documents or you will "Remove" nulls across all Cosmos documents. Unless you force everyone to use Pipelines only or Dataflows only. Depending on the complexity using Pipeline only is not always possible. But using Dataflow only is also not always needed.

ELT pipeline for Mongo

I am trying to get my data into Amazon Redshift using Fivetran, but have some questions in general about the ELT/ETL process. My source database is Mongo but I want to perform deep analysis on the data using a 3rd party BI tool like Looker, but they integrate with SQL. I am new to the ELT/ETL process and was wondering would it look like this.
Extract data from Mongo (handled by Fivetran)
Load into Amazon Redshift (handled by Fivetran)
Perform Transformation - This is where my biggest knowledge gap is. I obviously have to convert objects and arrays into compatible SQL types. I can perform a transformation on all objects to extract those to columns and transform all arrays to a table. Is this the right idea? Should I design a MYSQL schema and write all the transformations according to that schema design?

as you state, Fivetran will load your data into Redshift putting individual fields in columns where it can and putting everything else into varchar columns as JSON. So at that point you basically have a Data Lake - all your data in an analytical platform but basically still in source format and available for you to do whatever you want with it.
Initially, if you don't know much about your data and just want to investigate it, you can probably leave it as it is. Redshift has SQL functions that allow you to query the elements of a JSON structure so there is no need to build additional tables and more ETL just to allow you to investigate your data - especially as these tables may get thrown away once you understand your data and decide what you want to do with it.
If you have proper reporting requirements then that is the point where you can start to design a schema that will support these requirements (I'm not sure why you suggested a MYSQL schema as MYSQL is a database vendor?). Traditionally an analytical schema would be designed as a Kimball Dimensional model (facts and dimensions) but the type of schema you decide to design will depend on:
The database platform you are using (in your case, Redshift) and the type of structures it works best with e.g. star schema or "flat" tables
The BI tool you are using and how it expects to have data presented to it
For example (and I'm not saying this is a real world example), if Redshift works ok with star schemas but better with flat tables and Looker has to have a star schema then it probably makes more sense to build star schemas in Redshift as this is a single modelling exercise - rather than model flat tables in Redshift and then have to model star schemas in Looker.
Hope this helps?

It depends on how you need the final stage of your data analysis presented, and what the purpose of your data analysis is. As stated by NickW, assuming you need to integrate your data into a BI tool the schema should be adapted according to the tool's data format requirements.
a mongodb ETL/ELT process might looks like this:
Select Connection: Select the set connection
Collection Name:Choose the collection by using the [database].[collection] format.
If you pulling data from your authentication database, only the [collection] name can be determined. Examples: ea sample.products east .
Extract Method:
All: pull the entire data in the table.
Incremental: pull data by incremental value.
Incremental Attributes: Set the name of the incremental attribute to run by. I.e: UpdateTime .
Incremental Type: Timestamp | Epoch. Choose the type of incremental attribute.
Choose Range:
In Timestamp, choose your date increment range to run by.
In Epoch, choose the value increment range to run by.
If no End Date/Value entered, the default is the last date/value in the table.
The increment will be managed automatically
Include End Value: Should the increment process take the end value or not
Interval Chunks: On what chunks the data will be pulled by. Split the data by minutes, hours, days, months or years.
Filter: Filter the data to pull. The filter format will be a MongoDB Extended JSON.
Limit: Limit the rows to pull.
Auto Mapping: You can choose the set of columns you want to bring, add a new column or leave it as it is.
Converting Entire Key Data As a STRING
In cases the data is not as expected by a target, like key names started with numbers, or flexible and inconsistent object data, You can convert attributes to a STRING format by setting their data types in the mapping section as STRING
Conversion exists for any value under that key.
Arrays and objects will be converted to JSON strings.
Use cases:
Here are few filtering examples:
{"account":{"$oid":"1234567890abcde"}, "datasource": "google", "is_deleted": {"$ne": true}}
date(MODIFY_DATE_START_COLUMN) >=date("2020-08-01")

Is BigQuery suitable for frequent updates of partial data?

I'm on GCP, I have a use case where I want to ingest large-volume events streaming from remote machines.
To compose a final event - I need to ingest and "combine" event of type X, with events of types Y and Z.
event type X schema:
SrcPort
ProcessID
event type Y schema:
DstPort
ProcessID
event type Z schema:
ProcessID
ProcessName
I'm currently using Cloud SQL (PostgreSQL) to store most of my relational data.
I'm wondering whether I should use BigQuery for this use case, since I'm expecting large volume of these kind of events, and I may have future plans for running analysis on this data.
I'm also wondering about how to model these events.
What I care about is the "JOIN" between these events, So the "JOIN"ed event will be:
SrcPort, SrcProcessID, SrcProcessName, DstPort, DstProcessID, DstProcessName
When the "final event" is complete, I want to publish it to PubSub.
I can create a de-normalized table and just update partially upon event (how is BigQuery doing in terms of update performance?), and then publish to pubsub when complete.
Or, I can store these as raw events in separate "tables", and then JOIN periodically complete events, then publish to pubsub.
I'm not sure how good PostgreSQL is in terms of storing and handling a large volume of events.
The thing that attracted me with BigQuery is the comfort of handling large volume with ease.

If you have this already on Postgres, I advise you should see BigQuery a complementary system to store a duplicate of the data for analyses purposes.
BigQuery offers you different ways to reduce costs and improve query performance:
read about Partitioning and Clustering, with this in mind you "scan" only the partitions that you are interested to perform the "event completion".
you can use scheduled queries to run MERGE statements periodically to have materialized table (you can schedule this as often as you want)
you can use Materialized Views for some of the situations

BigQuery works well with bulk imports and frequent inserts like http logging. Inserting into bigquery with segments of ~100 or ~1000 rows every few seconds works well.
Your idea of creating a final view will definitely help. Storing data in BigQuery is cheaper than processing it so it won't hurt to keep a raw set of your data.
How you model or structure your events is up to you.

How to get max of a given column from ADF Copy Data activity

I have a copy data activity for on-premise SQL Server as source and ADLS Gen2 as sink. There is a control table to pickup tableName, watermarkDateColumn and the watermarkDatetime to pull incremental data from the source database.
After data is pulled/loaded in sink, I want to get the max of the watermarkDateColumn in my dataset. Can it be obtained from #activity('copyActivity1').output?
I'm not allowed to use one extra lookup activity to query the source table for getting the max(watermarkDateColumn) in pipeline.

Copy activity only could be used for data transmission,not for any other aggregation feature. So #activity('copyActivity1').output won't help. Since you said you can't use lookup activity, i'm afraid your requirement is not available so far.
If you prefer not using additional activities, I suggest you using Data Flow Activity instead which is more flexible.There is built-in aggregation feature in the Data Flow Activity.

Row based database or Column based database

We are working on a audit system where auditor are given access to transaction processed in last quarter. Auditor performs various analysis on the data to find out invalid/erroneous transactions that have some exceptions.
Generally, these analysis requires data to be present on some charts to view the out-layers or sometime duplication detection are done based on multiple columns.
Sometime exception detection algorithm are pretty involved that require multiple processing steps using stored procedure.
Please note that analysis rarely involves aggregation on huge rows.
Occasionally , they can change some data if they find it missing or incorrect.
We are evaluating row based (sql & nosql databases) and column store (like data warehouse systems).
Is this a use case for datawarehouse or row based store, like nosql or some RDBMS?
In short, requirements are:
- Occasional update
- Mostly read queries over last 3/months of data
- Reading data my require several messaging steps, like creating temp table in step 1, forming join with another table in step rule, delete some rows ect.
Thanks

For your task, it does not really matter how the data is stored. You need to think instead how to create a solid dimensional model, populate it with data properly, and what reporting tools to use.
To give you an example, here are a couple of common setups I've used in my projects:
Microsoft stack setup:
SQL Server for data storage
SSIS for data ETL (or write your own stored procedures if you know what you are doing)
Publish dimensional model on the same SQL Server. If your data set is large (over billion records), use SSAS Tabular instead
Power Pivot or Power BI for interactive reporting, or SSRS for paginated reports.
Open-source setup:
PostgreSQL for data storage
Use stored procedures and/or Python to process data
Publish dimensional model to another PostgreSQL database. If your data is large, publish the dimensional model to Redshift or
other columnar database
Use Tableau or Power BI for interactive reporting, or build your own reporting interface.
I think NoSQL database is a wrong choice here because audit will require highly structured data.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse