Process more than 5000 rows from lookup activity in Azure Data factory

Process more than 5000 rows from lookup activity in Azure Data factory - azure-data-factory

I need to paas Id in another activity in data factory .Id is stored in blob storage in json format.
I am using Look-Up in my activity which will fetch data .But my pipeline gets failed when data is more than 5000.I need a solution for this.I didnt understand the existing solution in stack overflow.

Ah ok, well you cannot use OFFSET/LIMIT pagination sensibly in Cosmos and ADF cannot use continuation tokens. Also you cannot LOOKUP >5000 results from blob or paginate the blob output.
If I had this problem I would try the following based on this idea Azure Data Factory DYNAMICALLY partition a csv/txt file based on rowcount.
Use dataflow to get the data from cosmos and write to several json files using partitioning, each < 5000 rows (using the method described in the comment on the above link - using a surrogate and the MOD operator)
ForLoop over those blobs
Have a nested pipeline that does the lookup and calls the API, as you have now - now the lookup will only have max 5000 items

Related

Azure Synapse Pipeline copy data from the BigQuery, where the source schema is hierarchical with nested columns

Please help me with copying data from Google BigQuery to Azure Data Lake Storage Gen2 with Serverless SQL Pool.
I am using Azure Synapse's Copy data pipeline. The issue is I cannot figure out how to handle source table from the BigQuery with hierarchical schema. This result in missing columns and inaccurate datetime value at the sink.
The source is a Google BigQuery table, it is made of Google Cloud Billing export of a project's standard usage cost. The source table's schema is hierarchical with nested columns, such as service.id; service.description; sku.id; sku.description; Project.labels.key; Project.labels.value, etc.
When I click on Preview data from the Source tab of the Copy data pipeline, it only gives me the top of the column hierarchy, for example: It would only show the column name of [service] and with value of {\v":{"f":[{"v":"[service.id]"},{"v":"[service.descrpition]"}]}}
image description: Source with nested columns result in issues with Synapse Copy Data Pipline
I have tried to configure the Copy Pipline with the following:
Source Tab:
Use query - I think the solution lays in here, but I cannot figure out the syntax of selecting the proper columns. I watched a Youtube video from TechBrothersIT How to Pass Parameters to SQL query in Azure Data Factory - ADF Tutorial 2021, but still unable to do it.
Sink Tab:
1.Sink dataset in various format of csv, json and parquet - with csv and parquet getting similar result, and json format failed
2.Sink dataset to Azure SQL Database - failed because it is not supported with Serverless SQL Pool
3.Mapping Tab: note: edited on Jan22 with screenshot to show issue.
Tried with Import schemas, with Sink Tab copy behavior of None, Flatten Hierarchy and Preserve Hierarchy, but still unable to get source column to be recognized as Hierarchical. Unable to get the Collection reference nor the Advanced Editor configurations to show up. Ref: Screenshot of Source columns not detected as Hierarchical MS Doc on Schema and data type mapping in copy activity
I have also tried with the Data flow pipeline, but it does not support Google BigQueryData Flow Pipe Source do not support BigQuery yet
Here are the steps to reproduce / get to my situation:
Register Google cloud, setup billing export (of standard usage cost) to BigQuery.
At Azure Synapse Analytics, create a Linked service with user authentication. Please follow Data Tech's Youtube video
"Google BigQuery connection (or linked service) in Azure Synapse analytics"
At Azure Synapse Analytics, Integrate, click on the "+" sign -> Copy Data Tool
I believe the answer is at the Source tab with Query and Functions, please help me figure this out, or point me to the right direction.
Looking forward to your input. Thanks in advance!

ADF allows you to write the query in google bigquery source dataset. Therefore write the query to unnest the nested columns using unnest operator and then map it to the sink.
I tried to repro this with sample nested table.
img:1 nested table
img:2 sample data of nested table
Script to flatten the nested table:
select
user_id,
a.post_id,
a.creation_date
from `ds1.stackoverflow_nested`
cross join unnest(comments) a
img:3 flattened table.
Use this query in copy activity source dataset.
img:4 Source settings of copy activity.
Then take the sink dataset, do the mapping and execute the ADF pipeline.
Reference:
MS document on google bigquery as a source - ADF
GC document on unnest operator

Partition data by multiple partition keys - Azure ADF

I have some data on on-prem SQL table. The data is huge ~100GB. The data many columns but two important ones are d_type and d_date.
d_type unique elements are 1,10,100 and d_date ranges from 2022-01-01 - 2022-03-30
I want to load this data into Azure using copy activity or dataflow but in a partitioned fashion, like the following format:
someDir/d_type=1/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=10/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=100/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
I have tried with copy activity:
Copy activity can only use one partition key
If I partition by d_type, it creates parquet file with random bins i.e 1-20 (which contains only data for d_type=1), other file could have bins be 20-30 (which has no data)
Dataflow allows multiple partition keys, but I cannot use that sinceill have to copy the entrire data first from onprem sql to azure then process it. (As dataflow can only work with source link service which are linke via AzureIR and not SHIR).
Anyone got tips on how to solve this?

We ended up using custom python scripts because CopyActivity doesn't support partitions with multiple keys and we couldn't use the dataflow due to some business reasons as explained in the question.

Azure Data Factory data flow - drops null columns

When using a data flow in azure data factory to move data, I've noticed that the data (at the sink) is missing columns that contains NULL values. When using the copy activity to copy the same data, the columns are present in the sink with their NULL values.
Record after a copy activity:
Record after a data flow:
Source is parquet, sink is azure cosmos db. My goal is to avoid defining any schemas, as I simply want to copy all of the data "as is". I've used the "allow schema drift" option on the source and sink.
I would just use the copy activity, but it doesn't appear to have the ability to define a maximum speed (RU consumption) like the data flow does, so the copy activity ends up consuming all of the cosmos db's RUs very quickly (as described here)
EDIT:
sink data preview shows all columns
sink inspect tab shows all columns

Dataflows always skip writing JSON tags with NULLs. There is no workaround currently other than copy activity.

This is really not a good design or behavior on Microsoft's part because you can't Standardize in Cosmos weather to "Keep" or "Remove" null fields in your JSON.
Querying Cosmos
Where field1 = NULL is completely different than where NOT IS_DEFINED (field1) and will yield an entirely different result set.
And if your users don't know if the ADF developer used a Dataflow with a Sink vs a Copy Activity in a Pipeline then the may get erroneous results in a query. The only way to ensure you get all the data is to always use:
Where field1 = NULL or where NOT IS_DEFINED (field1)
Users should not have to depend on knowing what kind of ADF functionality was chosen for a specific JSON document in a Cosmos NoSQL collection to do a query. Plus you can't standardize that you will "Keep" null across all Cosmos documents or you will "Remove" nulls across all Cosmos documents. Unless you force everyone to use Pipelines only or Dataflows only. Depending on the complexity using Pipeline only is not always possible. But using Dataflow only is also not always needed.

Filter MongoDB source dataset within copy activity in Azure Data Factory

I have created a pipeline which uses a MongoDB JSON file as the source dataset and need to sink it into a SQL Database.
My problem is that the JSON file contains too many rows, so I am trying to only retrieve rows from the last n days.
Is it possible to filter a source dataset within the copy activity, so in other words without using the filter activity?

Yes, you can filter the source dataset within the copy activity.
You can ref the ticket: Azure Data Factory - filter Mongodb source dataset by date.
It has the same problem and the answer shows you how to filter the data with date.

How to get max of a given column from ADF Copy Data activity

I have a copy data activity for on-premise SQL Server as source and ADLS Gen2 as sink. There is a control table to pickup tableName, watermarkDateColumn and the watermarkDatetime to pull incremental data from the source database.
After data is pulled/loaded in sink, I want to get the max of the watermarkDateColumn in my dataset. Can it be obtained from #activity('copyActivity1').output?
I'm not allowed to use one extra lookup activity to query the source table for getting the max(watermarkDateColumn) in pipeline.

Copy activity only could be used for data transmission,not for any other aggregation feature. So #activity('copyActivity1').output won't help. Since you said you can't use lookup activity, i'm afraid your requirement is not available so far.
If you prefer not using additional activities, I suggest you using Data Flow Activity instead which is more flexible.There is built-in aggregation feature in the Data Flow Activity.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse