I am currently looking into SAS Viya 3.4 to replace SAS 9.4.
Now I was curious to see the possibilities of the Environment Manager in scheduling Jobs and mantaining and creating Job flows. However, I noticed that I could only Drag and Drop Jobs in a flow and connect them with very few configurable options. Also as a trigger to start a Jobflow I was only able to select a time event. I am wondering if there are other trigger types to choose from. Like a Job will be triggered if a specific table exists or a file exists [or ...]. Neither did I see the possibility to trigger/start a job based on the return code of the previous job.
Also it does not seem to be smart enough to make sure two jobs don't access a library with write access at the same time.
I can't see how SAS Viya could replace a Job Orchestration Tool. However, I feel like the tool was built to replace such an Orchestration Tool. Did I miss something or is it just not possible to do so with the Environment Manager in SAS Viya?
Any help/insights is highly appreciated. I already searched through the documentation but could not find anything.. Maybe I was just looking at the wrong place?
Why 3.4 and not 3.5 (or Viya 4)?
If you want to use Viya with your own Job Orchestration software you can consider this tool (built by my team): https://cli.sasjs.io/job/
We deployed it on Jenkins for this customer: https://www.sas.com/en_us/news/press-releases/2021/july/sas-partnership-with-lloyds-list-intelligence.html
Related
I have a relatively big project in Azure Databricks that will soon go to production. The code is currently organized in a few folders in a repository and the tasks are triggered using ADF and job clusters executing notebooks one after another.
The notebooks have some hardcoded values like input path, output path etc.
I don't think it is the best approach.
I would like to get rid of hardcoded values and rely on some environment variables/environment file/environment class or something like that.
I was thinking about creating a few classes that will have methods with induvidual transformations and with save operations outside of the transformations.
Can you give me some tips? How do I reference one scala script from another in Databricks? Should I create a JAR?
Or can you refer me to some documentation/good public repositories where I can see how it should be done?
It's hard to write a comprehensible guide on how to go to prod but here are some things I wish I knew earlier.
When going to production:
Try to migrate to jar jobs once you have a well established flow.
Notebooks are for exploratory tasks and not recommended for long running jobs.
You can pass params to your main, read environment vars or read the spark config. It's up to you how to pass the config.
Choose New Job Cluster and avoid All Purpose Cluster.
In production, Databricks recommends using new clusters so that each task runs in a fully isolated environment.
The pricing is different for New Job Cluster. I would say it ends up cheaper.
Here is how to deal with secrets
.. and few other off-topic ideas:
I would recommend taking a look into CI\CD Jenkins recipes
Automate deployments with the Databricks cli
If you're using notebooks for your code, then it's better to split code into following pieces:
Notebooks with "library functions" ("library notebooks") - only defining functions that will transform data. These functions are usually just receive DataFrame + some parameters, perform transformation(s) and return new DataFrame. These functions shouldn't read/write data, or at least shouldn't have hardcoded paths.
Notebooks that are entry point of jobs (let's call them "main") - they may receive some parameters via widgets, for example, you can pass environment name (prod/dev/staging), file paths, etc. These "main" notebooks may include "library notebooks" using %run with relative paths, like, %run ./Library1, %run folder/Libray2 (see doc)
Notebooks that are used for testing - they also include "library notebooks", but add the code that call the functions & check results. Usually you need to have specialized libraries, like, spark-testing-base (Scala & Python), chispa (Python only), spark-fast-tests (Scala only), etc. to compare content of the DataRrames, schema, etc. (here are examples of using different libraries) These test notebooks could be triggered as either regular jobs or from CI/CD pipeline. For that you can use Databricks CLI or dbx tool (wrapper around Databricks CLI). I have a demo of CI/CD pipeline with notebooks, although it's for Python.
For notebooks it's recommended to use Repos functionality that allows to perform version control operations with multiple notebooks at once.
Depending on the size of your code, and how often it changes you can also package it as a library that will be attached to a cluster, and used from the "main notebooks". In this case it could be a bit easier to test that library functions - you can just use standard tooling, like, Maven, SBT, etc.
P.S. You can also reach solutions architect assigned to your account (if there is one), and discuss that topic in more details.
I've been able to create a Compute Environment, a Job Queue and about a dozen Job Definitions using CloudFormation. Great!
Unless I'm missing something, there doesn't seem to be an element to actually submit my Job Definitions using CloudFormation. :(
At first, I thought I had it figured out because you can create CloudWatch Events that trigger a Job Submission. However, I notice that the Event Rule in CloudFormation does not have support for Batch like the CLI/SDK does. Lame!
Anyone else deploying Batch with CloudFormation? How are you submitting jobs? I guess I can create a Custom Resource, but that seems harder than it should be.
Does https://docs.aws.amazon.com/batch/latest/userguide/batch-cwe-target.html solve your problem?
AWS Batch jobs are available as CloudWatch Events targets. Using simple rules that you can quickly set up, you can match events and submit AWS Batch jobs in response to them.
When you create a new rule, add the batch job as a target.
The easiest way would be to create a Lambda function. You can create it via CF and capture your requirement in the function code.
Or like you mentioned, you can create a custom resource.
My company has a couple of joblets that we put in new jobs to do things like initialization of variables, get system information from the database and sending out error / warning emails. The issue we are running into is that if we go ahead and start creating the components of a job and realize that we forgot to include these 3 joblets, we have to basically re-create the job to ensure that the joblets are added first so they run first.
Is there any way to force these joblets to run first and possibly also in a certain order before moving on to the contents of the job being created? Please let me know if there is any information you may need that I'm missing as I have only been using Talend for a few days. The rest of the team has not been using it too much longer than I have, so they do not have the answer I'm looking for either. Thanks in advance!
In Joblets you can use the components Trigger_Input and Trigger_Output as connection-points for on subjob OK triggers. So you can connect joblets and other components in a job with triggers. Thus enforcing execution order.
But you cannot get a on subjob OK trigger from a tPreJob. I am thinking on triggering from a tPreJob to a tWarn (on component OK) and then from tWarn to the joblet (on subjob OK).
I am exploring Talend at work, I was asked if Talend supports batch processing as in running the job in multiple threads. After going through the user guide I understood threading is possible with sub jobs. I would like to know if it is possible to run the a job with a single action in parallel
Talend has excellent multi threading support. There are two basic methods for this. One method gives you more control and is implemented using components. The other method is implemented as job setting.
For the first method see my screenshot. I use tParallelize to load three files into three tables at the same time. Then when all three files are successfully loaded I use the same tParallelize to set the values of a control table. tParallelize can also be connected to tRunJob as easily as a subjob.
The other method is described very well here in Talend Help: Talend Help- Run Jobs in Parallel
Generally I recommend the first method because of the control it gives you, but if your job follows the simple pattern described in the help link, that method works as well.
I want to run a plug-in every 30 minutes, to poll an external system for changes. I am in CRM Online, so I don't have ready access to a scheduling engine.
To run the plug-in, I have a 'trigger' entity with a timezone independent date-
Updating the field also triggers a workflow, which in pseudocode has this logic:
If (Trigger_WaitUntil >= [Process-Execution Time])
{
Timeout until Trigger:WaitUntil
{
Set Trigger_WaitUntil to [Process-Execution Time] + 30 minutes
Stop Workflow with status of: Succeeded
}
}
If Trigger_WaitUntil < [Process-Execution Time])
{
Send email //Tell an admin that the recurring task has self-terminated
Stop Workflow with status of: Canceled
}
So, the behaviour I expect is that every 30 minutes, the 'WaitUntil' field gets updated (and the Plug-in and workflow get triggered again); unless the WaitUntil date is before the Execution time, in which case stop the workflow.
However, 4 hours or so later (probably 8 executions, although I haven't verified that yet) I get an infinite loop warning "This workflow job was canceled because the workflow that started it included an infinite loop. Correct the workflow logic and try again. For information about workflow".
My question is why? Do workflows have a correlation id like plug-ins, which is being carried through to the child workflow? If so, is there anyway I can prevent this, whilst maintaining the current basic mechanism of using a single trigger record to manage the schedule (I've seen other solutions in which workflows create new records, but then you've got to go round tidying up the old trigger records as well)
Yes, this behavior is well-known. The only way to implement recurring workflows without issues with infinite loops in Dynamics CRM and using only OOB features is usage of Bulk Deletion functionality. This article describes how to implement it - http://www.crmsoftwareblog.com/2012/08/using-the-bulk-deletion-process-to-schedule-recurring-workflows/
UPD: If you want to run your code every 30 mins then you will have to create 48 bulkdelete jobs with correspond startdatetime like 12:00, 12: 30, 1:00 ...
The current supported method for CRM is to use the Azure Scheduler.
Excerpt:
create a Web API application to communicate with CRM and our external
provider running on a shared (free) Azure web site and also utilize
the Azure Scheduler to manage the recurrence pattern.
The free version of the Azure Scheduler limits us to execution no more
than once an hour and a maximum of 5 jobs. If you have a lot going on
$20 a month will get you executions every minute and up to 50 jobs -
which sounds like a pretty good deal.
so if you wanted every 30 minutes, you could create two jobs, one on the half hour, and one on the hour.
The Bulk Deletion is an interesting work around and something we've used before. It creates extra work and maintenance though so I try to avoid it if possible.
I would generally recommend building a windows application and using the windows scheduling feature (I know you said you don't have a scheduler available but this is often forgotten). This approach works really well and is very easy to troubleshoot. Writing to logs and sending error email alerts is pretty easy to make it robust. The server doesn't need to be accessible externally, it only needs to reach CRM. If you had CRM on-prem, you could just use the same server.
Azure Scheduler is a great suggestion. This keeps you in the cloud which is nice.
SSIS is another option if you already have KingswaySoft or Cozy Roc in place.
You could build a workflow that creates another record and cleans up after itself; however, this is really using the wrong tool for the job. Also, it's very easy for it to fail and then not initiate the next record.
There is a solution called "Scheduled Workflow Runner". You create a FetchXML query to create a record set to run against, and point it at an on-demand workflow that you want it to run on each record.
http://alexanderdevelopment.net/post/2013/05/18/scheduling-recurring-dynamics-crm-workflows-with-fetchxml/