I am using Azure data factory to copy data from tables present in one database to tables present in another database. I am using the lookup table to get the list of tables that needs to be copied and after that using the foreach iterator to copy the data.
I am using the Below table to get a list of tables that needs to be copied.
The problem: I want to update the flag to 1 when a table is successfully copied. I tried using the log that is generated after the pipeline ran but I am unable to use it effectively.
Related
We are executing pipeline one by one in sequence manager and load the data on premise SQL.
but we would want to load the data all the copy activity in single trigger. which means we have to load the 15 tables data into on premise DB. if tomorrow, we have to add one more table, we should not change in pipeline. we would like dynamic table insert. kindly advise.
thanks to all
I reproduced the above scenario and got the below results.
Use two lookups one for your source database and one for on-prem SQL.
Here I have used Azure SQL database for both source and target. You can use your database with SQL Server linked service in lookup.
Use the below query in both lookups to get the list of tables.
SELECT TABLE_NAME
FROM
information_schema.tables;
lookup list of of source tables:
Give the same query in second lookup.
List of target tables with another lookup:
Use filter activity to get the list of new tables which are not copied to target.
Items: #activity('sql source lookup').output.value
Condition: #not(contains(activity('on-prem lookup').output.value,item()))
Filter result:
Give this Value array to a ForEach activity and use copy activity inside ForEach.
Copy Source:
Copy sink:
You can see new tables are copied to my target. Schedule this pipeline everyday so that your every new table gets copied to target.
I am interested in learning how to manage BigQuery post firestore backfills.
First off, I utilize the firebase/firestore-bigquery-export#0.1.22 function with a table named 'n'. After creating this table, 2 tables are generated n_raw_changelog, n_raw_latest.
Can I delete either of the tables, and why are the names generated automatically?
Then I ran a backfill, because the previous collection preceded the BigQuery table using:
npx #firebaseextensions/fs-bq-import-collection \
--non-interactive \
--project blah \
--source-collection-path users \
--dataset n_raw_latest \
--table-name-prefix pre \
--batch-size 300 \
-query-collection-group true
And now the script adds 2 more tables with added extensions
i.e. n_raw_latest_raw_latest, n_raw_latest_raw_changelog.
Am I supposed to send these records to the previous tables, and delete them post-backfill?
Is there a pointer, did I use incorrect naming conventions?
As shown in this tutorial, those two tables are part of the dataset generated by the extension.
For example, suppose we have a collection in Firebase called orders, like this:
When we install the extension, in the configuration panel shows as follows:
Then,
As soon as we create the first document in the collection, the extension creates the firebase_orders dataset in BigQuery with two resources:
A table of raw data that stores a full change history of the documents within the collection... Note that the table is named orders_raw_changelog using the prefix we configured before.
A view, named orders_raw_latest, which represents the current state of the data within the collection.
So, these are generated by the extension.
From the command you posten in your question, I see that you used the fs-bq-import-collection script with the --non-interactive flag, and pass the --dataset parameter with the
n_raw_latest value.
The --dataset parameter corresponds with the Dataset ID parameter that is shown in the configuration panel above. Therefore, you are creating a new dataset named n_raw_latest which will contain the n_raw_latest_raw_changelog table and the n_raw_latest_raw_latest view. In fact, you are creating a new dataset with your current registries, and not updating the dataset you created for instance.
To avoid this, as stated in the documentation, you must use the same Dataset ID that you set when configuring the extension:
${DATASET_ID}: the ID that you specified for your dataset during extension installation
See also:
Automated Firestore Replication to BigQuery
Stream Collections to BigQuery - GitHub
Import existing documents - GitHub
I have pipelines that copy files from on-premises to different sinks, such as on-premises and SFTP.
I would like to save a list of all files that were copied in each run for reporting.
I tried using Get Metadata and For Each, but not sure how to save the output to a flat file or even a database table.
Alternatively, is it possible to fine the list of object that are copied somewhere in the Data Factory logs?
Thank you
Update:
Items:#activity('Get Metadata1').output.childItems
If you want record the source file names, yes we can. As you said we need to use Get Metadata and For Each activity.
I've created a test to save the source file names of the Copy activity into a SQL table.
As we all know, we can get the file list via Child items in Get metadata activity.
The dataset of Get Metadata1 activity specify the container which contains several files.
The list of file in test container is as follows:
At inside of the ForEach activity, we can traverse this array. I set a Copy activity named Copy-Files to copy files from source to destnation.
#item().name represents every file in the test container. I key in the dynamic content #item().name to specify the file name. Then it will sequentially pass the file names in the test container. This is to execute the copy task in batches, each batch will pass in a file name to be copied. So that we can record each file name into the database table later.
Then I set another Copy activity to save the file names into a SQL table. Here I'm using Azure SQL and I've created a simple table.
create table dbo.File_Names(
Copy_File_Name varchar(max)
);
As this post also said, we can use similar syntax select '#{item().name}' as Copy_File_Name to access some activity datas in ADF. Note: the alias name should be the same as the column name in SQL table.
Then we can sink the file names into the SQL table.
Select the table which created previously.
After I run debug, I can see all the file names are saved into the table.
If you want add more infomation, you can reference the post I maintioned previously.
I have a dataset based on a csv file. This exposes a data as follows:
Name,Age
John,23
I have an Azure SQL Server instance with a table named: [People]
This has columns
Name, Age
I am using the Copy Data task activity and trying to copy data from the csv data set into the azure table.
There is no option to indicate the table name as a source. Instead I have a space to input a Stored Procedure name?
How does this work? Where do I put the target table name in the image below?
You should DEFINITELY have a table name to write to. If you don't have a table, something is wrong with your setup. Anyway, make sure you have a table to write to; make sure the field names in your table match the fields in the CSV file. Then, follow the steps outlined in the description below. There are several steps to click through, but all are pretty intuitive, so just follow the instructions step by step and you should be fine.
http://normalian.hatenablog.com/entry/2017/09/04/233320
You can add records into the SQL Database table directly without stored procedures, by configuring the table value on the Sink Dataset rather than the Copy Activity which is what is happening.
Have a look at the below screenshot which shows the Table field within my dataset.
I have over 100 JSON files which is nested and I am trying to Load the JSON files via Data FactoryV2 into SQL Data Warehouse. I have created the Data FactoryV2 and everything seems fine the connection below seems fine and the Data Preview seems fine also.
When I run the Data Factory I get this error:
I am not sure what the issue is. I have tried to re-create the Data Factory several times.
The error message is clear enough when it says "All columns of the table must be specified...". This means that the table in the data warehouse has more columns than what you are seeing in the preview of the json file. You will need to create a table in the data warehouse with the same columns that are shown in the preview of the json files.
If you need to insert them in a table with more fields, create a "staging" table with the same columns as the json file, and then call a stored procedure to insert the content of this staging table in the corresponding table.
Hope this helped!