Reuse the same recipe for multiple datasets - google-cloud-dataprep

I want to use the same recipe that I use for one dataset for rest of my datasets.The structure/headers of all the datasets is same. Is there a way to import or reuse the same recipe without doing all the steps again?

I'm just getting started with DataPrep, but in my understanding you could feed all your sources into the recipe at the start, then fork them back out at the end and use a schedule to run each one.
Say you have five input files with identical structure but representing different sales markets. Import all five, and if there's no market column then use a recipe to derive a new column with a static value.
UNION all of these into the recipe (so the core recipe receives one file).
At the end of the recipe, add a new recipe for each output which runs KEEP, keeping only the data for that market. This will generate five outputs.
Schedule each of these recipes, and when the schedule runs you will get five different outputs - one for each input.

In the flow view page, you can "swap" the datasource for a recipe. If you want to use different follow-on steps for different data sources, you can "make a copy" of the recipe and then swap the data source of the copied recipe.
For more details, see https://cloud.google.com/dataprep/docs/html/Flow-View-Page_57344806

Related

In Power Query, when duplicating the source query should I duplicate the Transform File folder as well?

My apologies in advance if this question has already been asked, if so I cannot find it.
So, I have this huge data base divided by country where I need to import from each country data base individually and then, in Power Query, append the queries as one.
When I imported the US files, the Power Query automatically generated a Transform File folder with 4 helper queries:
Then I just duplicated the query US - Sales and named it as UK - Sales pointing it to the UK sales folder:
The Transform File folder didn't duplicate, though.
Everything seems to be working just fine right now, however I'd like to know if this could be problem in the near future, because I still have several countries to go. Should I manually import new queries as new connections instead of just duplicating them or it just doesn't matter?
Many thanks!
The Transform Files Folder group contains the code that is called to transform a list of files. It is re-usable code. You can see the Sample File, which serves as the template for the transform actions.
As long as the file that is arrived at for the Sample File has the same structure as the files that you are feeding into the command, then you can use any query with any list of files.
One thing you need to make sure is that the Sample File is not removed from your data source. You may want to create a new dummy file just for that purpose, make sure it won't be deleted, and then point the Sample File query to pull just that file.
The Transform Helper Queries are special queries that you may edit the queries, but you cannot delete and recreate your own manually. They are automatically created by PQ when combining list of contents and are inherently linked to the parent query.
That said, you cannot replicate them, and must use the Combine function provided by PQ to create the helper queries.
You may however, avoid duplicating the queries, instead replicate your steps in the parent query, and use table union to join the list before combining the contents with the same helper queries.

Multiple agents arrival based on Variable and database column

In my source block I want to be the amount of agents based on two different factors namely the amount of beds and visitors per bed. The visitors per bed is just a variable (e.g. visitors=3) and the amount of beds is loaded from the database table which is an excel file (see first image). Now I want to code this in the code block as shown in the example in image 2, but I do not know the correct code and do not know if it is even possible.
Simplest solution is just to do the pre-calcs in the input file and have in the dbase.
The more complex solution is to set the Source arrivals as:
Now, you read your dbase code at the start of the model using SQL (i.e. the query constructor). Make the necessary computations and create a Dynamic Event for each arrival when you want it to happen, relative to the model start. Each dynamic event then calls the source.inject(1) method.
Better still is to not use Source at all but a simple Enter block. The dynamic event creates the agent with all relevant properties from your dbase and pushes it into the Enter block using enter.take(myNewAgent)
But as I said: this is not trivial

Parameter Variation: Tracking the Metadata

I am trying to use parameter variation in AnyLogic. My inputs are 3 parameters, each varying 5 times. My output is water demand. What I need from parameter variation is the way in which demand changes according to the different combinations of the three parameters. I imagine something like: there are 10,950 rows (one for each day), the first column is time (in days), the second column are the values for the first combination, the second column is the second combination, and so on and so forth. What would be the best way to track this metadata to then be able to export it to excel? I have added a "dataset" to my main to track demand through each simulation, but I am not sure what to add to the parameter variation experiment interface to track the output across the different iterations. It would also be helpful to have a way to know which combination of inputs produced a given output (for example, have the combination be the name for each column). I see that there are Java Actions, but I haven't been able to figure out the code to do what I need. I appreciate any help with this matter.
The easiest approach is just to track this in output database tables which are then exported to Excel at the end of your run. As long as these tables include outputs from multiple runs (and are, for example, only cleared at the start of the experiment not the run), your Parameter Variation experiment will end up with an Excel file having outcomes from all the runs. (You will probably need to turn off parallel execution in the PV experiment so you don't run into issues trying to write to the same Excel file in parallel.)
So, for example, you might have tables:
run_details with columns id, parm1, parm2 and parm3 (with proper column names given your actual parameters and some unique ID generated for each run)
output_demand with columns run_id, sim_time_hrs and demand_value (if, say, you're storing some demand value each hour of simulated time) where run_id cross-references the run's ID in run_details
(There is extra complexity in how you could allocate a unique run ID and how and when you write to/clear those tables, but I'm just presenting the core design. You can also get round the need-serial-execution point by programmatically controlling when you export to Excel, rather than using the built-in "Export tables at the end of model execution" capability, but that's also more complicated.)

Google Data Fusion reading files from multiple sub folders in a bucket and need to place in another folder in side sub folder

Example
sameer/student/land/compressed files
sameer/student/pro/uncompressed files
sameer/employee/land/compressed files
sameer/employee/pro/uncompressed files
In the above example I need to read files from all LAND folders present in different sub directories and need to process them and place them in PRO folders with in same sub folders.
For this I have taken two GCS nodes one from source and another from sink.
in the GCS source i have provided path gs://sameer/ , it is reading files from all sub folders and merging them into one file placing it in sink path.
Excepted output all files should be placed in sub directories where i have fetched from.
It can achieve the excepted output by running pipeline separately for each folder
I am expecting is this can be possible by a single pipeline run
It seems like your use case is simply moving files. In that case, I would suggest using the Action plugin GCS Move or GCS Copy.
It seems like the task you are trying to carry out is not possible to do in one single Data Fusion pipeline, at least at the time of writing this.
In a pipeline, all the sources and sinks have to be connected. Otherwise you will get the following error:
'Invalid DAG. There is an island made up of stages ...'
This means it is not possible to parallelise several uncompression tasks, one for each folder of files, inside the same pipeline.
At the same time, if you were to use something like the following schema, the outputs would be aggregated and replicated over all of the sinks:
Finally, I would say that the only case in which you can parallelise a task between several sources and several links is when using multiple database tables. By means of the following plug-ins (2) and (3) you can process data from multiple table inputs and export the output to multiple tables. If you would like to see all available plugins for Data fusion, please check the following link (4).

How to take data from 2 databases (with same schema) and copy it into 1 database using Data factory

I want to take data from 2 databases and copy(coalesce) it into 1 using Data factory.
The issue is: It seems that multiple inputs is not allowed for copy activities.
So i resorted to having 2 different datasets which are exact copies but with a different name... and then putting 2 different activities into the 1 pipeline which use their specific output dataset.
It just seems odd and wrong to do it this way.
Can i have some help.
This is what my diagram currently looks like:
Is there no way of just copying data from 2 seperate databases (which have the same structure but different data) to the 1 database?
The short answer is yes. But you need to work within the constraints of how ADF handles this.
A couple of things to help...
You'll always need at least 2 activities to do this when using the copy type activity. Microsoft of course charges per activity execution in ADF, so they aren't going to allow you to take shortcuts having many inputs and output per single copy activity (single charge).
The approach you show above is ok and to pass the ADF validation as you've found you simply need to have the output datasets created separately and called different things. Even if they still refer to the same underlying target table etc. This is really only a problem for the copy activity. What you could do is land the data firstly into separate staging tables in the Azure target database just for the copy (1:1). Then have a third downstream activity that executes a stored procedure that does the union of tables. In this case you could have 2 inputs to 1 output in the activity if you want to have that level of control in ADF.
Like this:
Final point, if you don't want the activities to execute in parallel you could chain the datasets to enforce a fake dependency or add a simple 'delay' clause to one of the copy operations. A delay on an activity would be simpler than provisioning a time slice offset.
Hope this helps