How to properly structure Spark code in Databricks? - scala

I have a relatively big project in Azure Databricks that will soon go to production. The code is currently organized in a few folders in a repository and the tasks are triggered using ADF and job clusters executing notebooks one after another.
The notebooks have some hardcoded values like input path, output path etc.
I don't think it is the best approach.
I would like to get rid of hardcoded values and rely on some environment variables/environment file/environment class or something like that.
I was thinking about creating a few classes that will have methods with induvidual transformations and with save operations outside of the transformations.
Can you give me some tips? How do I reference one scala script from another in Databricks? Should I create a JAR?
Or can you refer me to some documentation/good public repositories where I can see how it should be done?

It's hard to write a comprehensible guide on how to go to prod but here are some things I wish I knew earlier.
When going to production:
Try to migrate to jar jobs once you have a well established flow.
Notebooks are for exploratory tasks and not recommended for long running jobs.
You can pass params to your main, read environment vars or read the spark config. It's up to you how to pass the config.
Choose New Job Cluster and avoid All Purpose Cluster.
In production, Databricks recommends using new clusters so that each task runs in a fully isolated environment.
The pricing is different for New Job Cluster. I would say it ends up cheaper.
Here is how to deal with secrets
.. and few other off-topic ideas:
I would recommend taking a look into CI\CD Jenkins recipes
Automate deployments with the Databricks cli

If you're using notebooks for your code, then it's better to split code into following pieces:
Notebooks with "library functions" ("library notebooks") - only defining functions that will transform data. These functions are usually just receive DataFrame + some parameters, perform transformation(s) and return new DataFrame. These functions shouldn't read/write data, or at least shouldn't have hardcoded paths.
Notebooks that are entry point of jobs (let's call them "main") - they may receive some parameters via widgets, for example, you can pass environment name (prod/dev/staging), file paths, etc. These "main" notebooks may include "library notebooks" using %run with relative paths, like, %run ./Library1, %run folder/Libray2 (see doc)
Notebooks that are used for testing - they also include "library notebooks", but add the code that call the functions & check results. Usually you need to have specialized libraries, like, spark-testing-base (Scala & Python), chispa (Python only), spark-fast-tests (Scala only), etc. to compare content of the DataRrames, schema, etc. (here are examples of using different libraries) These test notebooks could be triggered as either regular jobs or from CI/CD pipeline. For that you can use Databricks CLI or dbx tool (wrapper around Databricks CLI). I have a demo of CI/CD pipeline with notebooks, although it's for Python.
For notebooks it's recommended to use Repos functionality that allows to perform version control operations with multiple notebooks at once.
Depending on the size of your code, and how often it changes you can also package it as a library that will be attached to a cluster, and used from the "main notebooks". In this case it could be a bit easier to test that library functions - you can just use standard tooling, like, Maven, SBT, etc.
P.S. You can also reach solutions architect assigned to your account (if there is one), and discuss that topic in more details.

Related

Working with multiple data warehouses in dbt

I'm building an application where each of our clients needs their own data warehouse (for security, compliance, and maintainability reasons). For each client we pull in data from multiple third party integrations and then merge them into a unified view, which we use to perform analytics and report metrics for the data across those integrations. These transformations and all relevant schemas are the same for all clients. We would need this to scale to 1000s of clients.
From what I gather dbt is designed so each project corresponds with one warehouse. I see two options:
Use one project and create a separate environment target for each client (and maybe a single dev environment). Given that environments aren't designed for this, are there any catches to this? Will scheduling, orchestrating, or querying the outputs be painful or unscalable for some reason?
profiles.yml:
example_project:
target: dev
outputs:
dev:
type: redshift
...
client_1:
type: redshift
...
client_2:
type: redshift
...
...
Create multiple projects, and create a shared dbt package containing most of the logic. This seems very unwieldy needing to maintain a separate repo for each client and less developer friendly.
profiles.yml:
client_1_project:
target: dev
outputs:
client_1:
type: redshift
...
client_2_project:
target: dev
outputs:
client_2:
type: redshift
...
Thoughts?
I think you captured both options.
If you have a single database connection, and your client data is logically separated in that connection, I would definitely pick #2 (one package, many client projects) over #1. Some reasons:
Selecting data from a different source (within a single connection), depending on the target, is a bit hacky, and wouldn't scale well for 1000's of clients.
The developer experience for packages isn't so bad. You will want a developer data source, but depending on your business you could maybe get away with using one client's data (or an anonymized version of that). It will be good to keep this developer environment logically separate from any individual client's implementation, and packages allow you to do that.
I would consider generating the client projects programmatically, probably using a Python CLI to set up, dbt run, and tear down the required files for each client project (I'm assuming you're not going to use dbt Cloud and have another orchestrator or compute environment that you control). It's easy to write YAML from Python with pyyaml (each file is just a dict), and your individual projects probably only need separate profiles.yml, sources.yml, and (maybe) dbt_project.yml files. I wouldn't check these generated files for each client into source control -- just check in the script and generate the files you need with each invocation of dbt.
On the other hand, if your clients each have their own physical database with separate connections and credentials, and those databases are absolutely identical, you could get away with #1 (one project, many profiles). The "hardest" parts of that approach would likely be managing secrets and generating/maintaining a list of targets that you could iterate over (ideally in a parallel fashion).

SAS Viya - Environment Manager: Job triggers

I am currently looking into SAS Viya 3.4 to replace SAS 9.4.
Now I was curious to see the possibilities of the Environment Manager in scheduling Jobs and mantaining and creating Job flows. However, I noticed that I could only Drag and Drop Jobs in a flow and connect them with very few configurable options. Also as a trigger to start a Jobflow I was only able to select a time event. I am wondering if there are other trigger types to choose from. Like a Job will be triggered if a specific table exists or a file exists [or ...]. Neither did I see the possibility to trigger/start a job based on the return code of the previous job.
Also it does not seem to be smart enough to make sure two jobs don't access a library with write access at the same time.
I can't see how SAS Viya could replace a Job Orchestration Tool. However, I feel like the tool was built to replace such an Orchestration Tool. Did I miss something or is it just not possible to do so with the Environment Manager in SAS Viya?
Any help/insights is highly appreciated. I already searched through the documentation but could not find anything.. Maybe I was just looking at the wrong place?
Why 3.4 and not 3.5 (or Viya 4)?
If you want to use Viya with your own Job Orchestration software you can consider this tool (built by my team): https://cli.sasjs.io/job/
We deployed it on Jenkins for this customer: https://www.sas.com/en_us/news/press-releases/2021/july/sas-partnership-with-lloyds-list-intelligence.html

Running Netlogo headless on the cloud

I've written a NetLogo model to model agent movement in a landscape. I'd like to run this model from the command prompt, using AWs/Google Compute. The model uses about 500MB worth of input rasters and shapefiles and writes rasters and csv files. It also uses the extensions gis, rnd, cf, table and csv.
Would this be possible using the Controlling API? (https://github.com/NetLogo/NetLogo/wiki/Controlling-API). Can I just use the steps listed in the link? I have not tried running NetLogo from the command prompt before.
Also, I do not want to run BehaviourSpace as it is not relevant to this model.
A BehaviorSpace experiment can consist of only a single run, so BehaviorSpace may actually be relevant to you here. You only need to write one short XML file (or no new files at all, if the experiment setup you want is already part of the model) to do it this way.
Whereas if you go through the controlling API, you will have to write and compile Java (or Scala) code, which is a substantially more complex task.
But if you decide to go the controlling API route: yes, that works too, and it is documented, as you've already noticed.

Can I create a simple Talend Job using its Java Source Code?

can we generate a Talend Job through its Java code? For the sake of simplicity consider a simple Talend Job without any context variables for different environments and other complications.
Java code for a job is generated by the Studio, but there is no reverse operation consisting in generating a job from its corresponding Java code.
So, the short answer is "No, you can't".
You may consider to generate other files such as job.properties and job.item which contain the configuration for each component used by a job, but I'm afraid it could be very hazardous and at least a very long trip to try to do that.

Run a single job in parallel

I need to know that how can we run a single job in parallel with different parameters in talend.
The answer is straightforward, but rather depends on what you want, and whether you are using free Talend or commercial.
As far as parameters go, make sure that your jobs are using context variables - this is the preferred way of passing parameters in.
As for running in parallel, there are a few options.
Talend's studio is a java code generator, so you can export your job (it's just java code) and run it wherever you want. How you invoke it is up to you - schedule it, invoke it N times manually, your call. Obviously, if your job touches shared resources then making it safe to run in parallel is up to you - the usual concurrency issues apply.
If you have the commercial product, then you can use the Talend admin centre (TAC). The TAC allows you to schedule a job more than once with different contexts. Or, if you want to keep the parallelization logic inside your job, then consider using the tParallelize component in one job to run another job N times.