I have 4 csv data files which I would to load into 4 different clusters of a single class in OrientDB: a class named demo with 4 clusters named respectively demo14q4, demo15q1, demo15q2 and demo15q3. I have tried to create 4 ETL modules in which I define a loader with an option as follow (in this case for demo14q4 cluster):
"cluster": "demo14q4",
Other options of loader (classes, indexes, etc.) are defined as usual.
My problem is that if I haven't previously created classes, all data from single csv will loaded in single class named demo with only one cluster; but if I previously create the clusters, all data from single csv will be distribuited in all 4 clusters created.
I would to create a JSON ETL module where I can define in which cluster I can load my csv and only in that, if possible.
Any ideas?
Thanks in advance.
Related
Im trying to create dataset from cfn in the way of joining 3 datasets like AxB, AxC to create base dataset (I can create these through web GUI) through cloudformation in LogicalTableMap and JoinInstruction in Dataset template. However, I cannot create the dataset with many errors in many deploys showing like “LogicalTableMap must have a single root” or “Circular Dependency” joining from base dataset. it there any suggests?
Thank in advance
I have a point cloud file (ply format) from a forest scene which is not annotated. The information about the class of the vegetation is inside a csv file. I was wondering how can I extract the information about the classes from the csv file in order to acquire the annotated point cloud.
In Azure Data Factory I have a pipeline, created from the built-in copy data task, that copies data from 12 entities (campaign, lead, contact etc.) from Dynamics CRM (using a linked service) and outputs the contents as parquet files in account storage. This is run every day, into a folder structure based on the date. The output structure in the container looks something like this:
Raw/CRM/2022/05/28/campaign.parquet
Raw/CRM/2022/05/28/lead.parquet
Raw/CRM/2022/05/29/campaign.parquet
Raw/CRM/2022/05/29/lead.parquet
That's just an example, but there is a folder structure for every year/month/day that the pipeline runs, and a parquet file for each of the 12 entities I'm retrieving.
This involved creating a pipeline, dataset for the source and dataset for the target. I modified the pipeline to add the pipeline's run date/time as a column in the parquet files, called RowStartDate (which I'll need in the next stage of processing)
My next step is to process the data into a staging area, which I'd like to output to a different folder in my container. My plan was to create 12 scripts (one for campaigns, one for leads, one for contact etc.) that essentially does the following:
accesses all of the correct files, using a wildcard path along the lines of: Raw/CRM/ * / * / * /campaign.parquet
selects the columns that I need
Rename column headings
in some cases, just take the most recent data (using the RowStartDate)
in some cases, create a slowly changing dimension, ensuring every row has a RowEndDate
I made some progress figuring out how to do this in SQL, by running a query using OPENROWSET with wildcards in the path as per above - but I don't think I can use my SQL script in ADF to move/process the data into a separate folder in my container.
My question is, how can I do this (preferably in ADF pipelines):
for each of my 12 entities, access each occurrence in the container with some sort of Raw/CRM///*/campaign.parquet statement
Process it as per the logic I've described above - a script of some sort
Output the contents back to a different folder in my container (each script would produce 1 output)
I've tried:
Using Azure Data Factory, but when I tell it which dataset to use, I point it to the dataset I created in my original pipeline - but this dataset has all 12 entities in the dataset and the data flow activity produces the error: "No value provided for Parameter 'cw_fileName" - but I don't see any place when configuring the data flow to specify a parameter (its not under source settings, source options, projection, optimize or inspect)
using Azure Data Factory, tried to add a script - but in trying to connect to my SQL script in Synapse - I don't know my Service Principal Key for the synapse workspace
using a notebook Databricks, I tried to mount my container but got an error along the lines that "adding secret to Databricks scope doesn't work in Standard Tier" so couldn't proceed
using Synapse, but as expected, it wants things in SQL whereas I'm trying to keep things in a container for now.
Could anybody point me in the right direction. What's the best approach that I should take? And if its one that I've described above, how do I go about getting past the issue I've described?
Pass the data flow dataset parameter values from the pipeline data flow activity settings.
Example
sameer/student/land/compressed files
sameer/student/pro/uncompressed files
sameer/employee/land/compressed files
sameer/employee/pro/uncompressed files
In the above example I need to read files from all LAND folders present in different sub directories and need to process them and place them in PRO folders with in same sub folders.
For this I have taken two GCS nodes one from source and another from sink.
in the GCS source i have provided path gs://sameer/ , it is reading files from all sub folders and merging them into one file placing it in sink path.
Excepted output all files should be placed in sub directories where i have fetched from.
It can achieve the excepted output by running pipeline separately for each folder
I am expecting is this can be possible by a single pipeline run
It seems like your use case is simply moving files. In that case, I would suggest using the Action plugin GCS Move or GCS Copy.
It seems like the task you are trying to carry out is not possible to do in one single Data Fusion pipeline, at least at the time of writing this.
In a pipeline, all the sources and sinks have to be connected. Otherwise you will get the following error:
'Invalid DAG. There is an island made up of stages ...'
This means it is not possible to parallelise several uncompression tasks, one for each folder of files, inside the same pipeline.
At the same time, if you were to use something like the following schema, the outputs would be aggregated and replicated over all of the sinks:
Finally, I would say that the only case in which you can parallelise a task between several sources and several links is when using multiple database tables. By means of the following plug-ins (2) and (3) you can process data from multiple table inputs and export the output to multiple tables. If you would like to see all available plugins for Data fusion, please check the following link (4).
I want to take data from 2 databases and copy(coalesce) it into 1 using Data factory.
The issue is: It seems that multiple inputs is not allowed for copy activities.
So i resorted to having 2 different datasets which are exact copies but with a different name... and then putting 2 different activities into the 1 pipeline which use their specific output dataset.
It just seems odd and wrong to do it this way.
Can i have some help.
This is what my diagram currently looks like:
Is there no way of just copying data from 2 seperate databases (which have the same structure but different data) to the 1 database?
The short answer is yes. But you need to work within the constraints of how ADF handles this.
A couple of things to help...
You'll always need at least 2 activities to do this when using the copy type activity. Microsoft of course charges per activity execution in ADF, so they aren't going to allow you to take shortcuts having many inputs and output per single copy activity (single charge).
The approach you show above is ok and to pass the ADF validation as you've found you simply need to have the output datasets created separately and called different things. Even if they still refer to the same underlying target table etc. This is really only a problem for the copy activity. What you could do is land the data firstly into separate staging tables in the Azure target database just for the copy (1:1). Then have a third downstream activity that executes a stored procedure that does the union of tables. In this case you could have 2 inputs to 1 output in the activity if you want to have that level of control in ADF.
Like this:
Final point, if you don't want the activities to execute in parallel you could chain the datasets to enforce a fake dependency or add a simple 'delay' clause to one of the copy operations. A delay on an activity would be simpler than provisioning a time slice offset.
Hope this helps