Is there a way to loop through a complete Databricks notebook (pySpark)?

Is there a way to loop through a complete Databricks notebook (pySpark)? - pyspark

Let’s take an example. I’m working on a large dataset and want to substrat my treatment on a weekly basis. My process is divided in multiple chunks / command for now.
My question is, is it possible to loop on all the notebook or should I regroup all my code/treatment in the same chunk ?
For instance, working on January 2021. I want to make the code run on a weekly bases, giving him the starting date, running from this date to day+7, apply all the treatment and store the results, update my start variable from day+8 and this until it reach the limit fixed, for instance 31-January.
Is there a way to do it without regrouping all of the code in the same chunk? as a « run all above // all below command in line ? »

You can implement this by changing your notebook to accept parameter(s) via widgets, and then you can trigger this notebook, for example, as Databricks job or using dbutils.notebook.run from another notebook that will implement loop (doc), passing necessary dates as parameters.
This will be:
in your original notebook:
starting_date = dbutils.widgets.get("starting_date")
.... your code
in the calling notebook (60 is timeout, could be higher depending on the amount of transformations):
dbutils.notebook.run("path_to_orginal_notebook", 60,
{"starting_date": "2021-01-01"})

Related

Parameters Variation not running model in AnyLogic

When I create a ParametersVariation simulation, the main model does not run. All I see is the default UI with iterations completed and replication. My end goal (as with most people) is to have a model go through a certain number of replications, but nothing is even running. There is limited documentation available on this. Please advise.

This is how Parameters Variation is intended to work. If you're running 1000 runs and multiple replications with parallel runs, how can you see what's happening in Main in each?
Typically, the best way to benefit from such an experiment is to track the results of each run using elements from the Analysis palette or even better to export results to Excel or similar.
To be able to collect data, you need to write your code in Java actions fields with root. to access elements in main (or top-level agent).
Check the example below, where after each run a variable from main is added to a dataset in the Parameters Variation experiment. At the end of 100 runs for example, the dataset will have 100 values of the main variable, with 1 value for each run.

Azure Data Factory ForEach is seemingly not running data flow in parallel

In Azure Data Factory I am using a Lookup activity to get a list of files to download, then pass it to a ForEach where a dataflow is processing each file
I do not have 'Sequential' mode turned on, I would assume that the data flows should be running in parallel. However, their runtimes are not the same but actually have almost constant time between them (like, first data flow ran 4 mins, second 6, third 8 and so on). It seems as if the second data flow is waiting for the first one to finish and then uses its cluster to process the file.
Is that intended behavior? I have TTL on the cluster set but that did not help too much. If it is, then what is a workaround? I am currently working on creating a list of files first and using that instead of a ForEach but I am not sure if I am going to see an increase in efficiency

I have not been able to solve the issue with the Parallel data flows not executing in parallel, however, I have managed to change the solution that would increase performance.
What was before: A lookup activity that would get a list of files to process, passed on to a ForEach loop with a data flow activity.
What I am testing now: A Data flow activity that would get a list of files, and save them in a text file in ADLS, Then another data flow activity that was previously in a ForEach loop, but changed its source to use "List of Files" and point to that list
The result was an increase in efficiency (Using the same cluster, 40 files would take around 80 mins using ForEach and only 2-3 mins using List of Files), however, debugging is not easy now that everything is in 1 data flow
You can overwrite a list of files file, or use dynamic expressions and name the file as the pipelineId or something else

UiPath Orchestrator Triggers - Cron Expression For specific day of month or next working day if not a working day

I've currently got this Cron expression that I'm using to trigger a process in UiPath Orchestrator:
0 0 15 21W * ? *
Runs on the closest working day to the 21st of each month at 3pm.
However I need it to run on the next working day at 3pm if the 21st is a non working day.
Tried searching for an answer and nothing quite fit the brief.
I used this website to build my expression (which is a great tool) but it only had an option for 'nearest day' and not next working day given a specific day of month: https://www.freeformatter.com/cron-expression-generator-quartz.html

As you don't need the nearest day, you can't use the functionality of Orchestrator cronjob. I would recommend creating a wrapper process as follows:
Create a new process, let's call it StartJobByCheckingDate
Now create a trigger that starts StartJobByCheckingDate each day at 3pm
So that process is now your manager of your desired process
Now we need to check if it is the 21th day
Here you have different ways to solve it
You could create a DataTable or even a file in the StartJobByCheckingDate process, that contains all the different days where your desired process should be fired (but this is very manual, you might not want to update this every year, so this might not be the smartest but the easiest solution)
The other idea is to check if the current day is the 21th day. If so check if it is Saturday/Sunday (non-working day).
If true: you could now create a empty dummy file somewhere that tracks that the 21th was a non-working day, and the next day you check that file existing, if it exists you check the current day to be a working day, and if so you delete the file again and start your desired process
If false: just start your desired process directly
I think 2. idea would be that best. Sure you have 365 jobs runs/year. But when you keep that helper process smart this will just be seconds.
Another idea instead of using the dummy file, would be to use Entities. Smarter but need some more time to get familiar with.

We have (had) the exact same issue. Since UiPath doesn't offer a feasible solution out of the box, we will work around the restriction using the following strategy: We trigger the actual job daily, considering a custom-built, static NonWorkingDay-list that will just suppress the execution of the robot every day we don't want it to run.
These steps are needed:
Get a list with of all known bank holidays, saturdays and sundays until 2053 or so...
Build a the static exclusion-list using a script that does something like this (pseudocode. I will update the answer once we have actually implemented the solution):
1. get all valid execution dates
loop through every 28th of the month until end of 2053
if the date is in the bankHolidayList then
loop until the next bankDay is found
add it to the list of valid ExecutionDates
else
add the date to the validExecutionDate-list
2. build exclusion-list
loop through every day until end of 2053
if the date is not in the validExecutionDate-list
add it to the exclusionDate-list
Format the csv accordingly and upload it to the orchestrator tenant as a NonWorkingDay-List
Update your trigger to run daily at your desired time, using the uploaded NonWorkDay-Calendar
While the accepted answer will surely work as well, we prefered to go with this approach because having a separate robot that does nothing but executing a UiPath trigger just doesn't seem right to me. With this approach we have no additional code that we potentially need to maintain.
In my oppinion not having a solution for this concern out of the box is a lack of feature that UiPath will (hopefully) fix until end of 2053 ;-)
Cheers

You can configure your trigger to launch oftener, then manage dates at init of your process, but you must set up a list of "holydays" or check in some way.
Also you can use the calendar option of orchestrator (+info)

Writing different experiment output run to different cells in a sheet (Excel file)

I am simultaneously running a model with different input values and it is producing different output on each run. I am trying to create a code that will get anylogic to wright each experiment output run in a different cell in excel sheet i.e. throughput Vs. Time. I am using dataset. Wondering If there is any script or hint can help in solving the issue?
Currently I am using the following commands. They keep overwriting the output using the same cells.
Out_excelFile1.setCellValue("Sink1 Out",2,2,2);
Out_excelFile1.writeDataSet(Sink1_D,2,3,2);

Best if you actually use the build-in database for outputs and only write to Excel at the end of all runs, tbh.
But in your case, you need to change the row number by your replication/iteration number. Use getCurrentIteration() or getCurrentReplication() in your "after simulation run" or "after replication" or "after iteration" experiment code sections to get this right.
Then, it would look something like Out_excelFile1.setCellValue("Sink1 Out",2,getCurrentIteration(),2);
(Details depend on your actual implementation, check the help for further info on replications, iterations and those functions)

JMeter to record results on hourly basis

I have a JMeter project with multiple GET and POST requests and assertions for these. I use Aggregate results and View results tree listeners, but none of these can store results on hourly basis. I tried JMeterPlugins-Standard and JMeterPlugins-Extras packages and jp#gc - Graphs Generator listener, but all of them use aggregated data instead of hourly data. So I would like to get number of successful and failed requests/assertions per hour, maybe a bar chart would be most suitable for this purpose.

I'm going to suggest a non-conventional design-level solution: name your samplers dynamically with hour (or date and hour), so that each hour the name will change, and thus they will appear in different category, i.e.:
The code for such name is:
${__time(dd:hh,)} the rest of sampler name
Such sampler will appear in the following way in Aggregate Report (here I simulated it with minutes/seconds, but same will happen with days/hours, just on larger scale):
Pros and cons of such approach:
Simple, you can aggregate anything by hour, minute, or any other time slice while test is running, and not by analysis after execution.
Not listener-dependant, can be used with pretty much any listener or visualizer
If you want to also have overall stats, it will require to sum up every sub-category. So it alters data, but in the way that it can still can be added back to original relatively easy.
Calculating __time before every sampler will not be unnoticed completely from performance perspective, but I don't think it will add visible overhead to a script.
You could get the same data by properly aggregating JTL or CSV (whichever you use) after execution, so it doesn't provide you with anything that is not possible to achieve using standard methods
Script needs altering to make this happen. if you have 100s of samplers, it's going to take a while. And if you want to change back...

You might want to use Filter Results Tool which has --start-offset and --end-offset parameters, you can "cut" your results file into "interesting" pieces and plot them according to your requirements.
You can install Filter Results Tool using JMeter Plugins Manager
Also be aware that according to JMeter Best Practices you should
Use as few Listeners as possible; if using the -l flag as above they can all be deleted or disabled.
Don't use "View Results Tree" or "View Results in Table" listeners during the load test, use them only during scripting phase to debug your scripts.
You can get whatever information you need from the .jtl results file, you can specify test results location via -l command-line argument

To get summarized results per hour add to your test plan Generate Summary Results:
Generates a summary of the test run so far to the log file and/or standard output
Update interval in jmeter.properties to your needs ,1 hour, 3600 seconds:
summariser.interval=3600
You will get summary per hour of your requests.

You can try with Jmeter backend Listener. It has integration with graphite and Influxdb. After storing the results in these time series database you can display the result in Grafana dashboard. Grafana has its own filtering of showing the results in hourly, monthly, daily basis and so on.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Is there a way to loop through a complete Databricks notebook (pySpark)? - pyspark

Related

Parameters Variation not running model in AnyLogic

Azure Data Factory ForEach is seemingly not running data flow in parallel

UiPath Orchestrator Triggers - Cron Expression For specific day of month or next working day if not a working day

Writing different experiment output run to different cells in a sheet (Excel file)

JMeter to record results on hourly basis

Categories

Resources