How to validate ksql script? - apache-kafka

I would like to know if there is a way for checking if a .ksql script is syntactically correct?
I know that you can send a POST request to the server, this however would also execute the containing ksql commands. I would love to have some kind of endpoint where you can pass your statement and it returns you either an error code or an OK like:
curl -XPOST <ksqldb-server>/ksql/validate -d '{"ksql": "<ksql-statement>"}' .
My question aims for a way to check the synatx in an automated fashion without the need to cleaning up everything afterwards.
Thanks for you help!
Note: I am also well aware that I could run everything separately using, e.g., a docker-compose file and tear everything down again. This however is quite resource heavy and and harder to maintain.

one option could be to use the ksql test runner (see here: https://docs.ksqldb.io/en/latest/how-to-guides/test-an-app/) and look at the errors to check if the statement is valid. Let me know if it works for your scenario.

By now I've found a way to test for my use case. I had a ksqldb cluster already in place with all other systems needed for the Kafka ecosystem (Zookeeper, Broker,...). I had to compromise a little but and go through the effort of deploying everything but here is my approach:
Use proper naming (let it be prefixed with test or whatever suits your use case) for your streams, tables,... the queries' sink property should include the prefixed topic in order to find it easily, or you simply assign an QUERY_ID (https://github.com/confluentinc/ksql/pull/6553).
Deploy the streams, tables,... to your ksqldb server using its rest API. Since I was programming in Python, I made use of the ksql package using pip (https://github.com/bryanyang0528/ksql-python).
Cleanup the ksqldb server by filtering for the naming that you assigned to the ksql resources and run the corresponding DROP or TERMINATE statement. Consider also, you will have dependencies that result in multiple streams/tables reusing a topics. The statements can be looked up in the official developer guide (https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/quick-reference/).
If you had errors in step 2, step 3 should have cleaned up the leftovers so that you can adjust your ksql scripts until they run through smoothly.
Note: I could not make any assumptions on how the streams,... look like. If you can, I would prefer the ksql-test-runner.

Related

SAS Viya - Environment Manager: Job triggers

I am currently looking into SAS Viya 3.4 to replace SAS 9.4.
Now I was curious to see the possibilities of the Environment Manager in scheduling Jobs and mantaining and creating Job flows. However, I noticed that I could only Drag and Drop Jobs in a flow and connect them with very few configurable options. Also as a trigger to start a Jobflow I was only able to select a time event. I am wondering if there are other trigger types to choose from. Like a Job will be triggered if a specific table exists or a file exists [or ...]. Neither did I see the possibility to trigger/start a job based on the return code of the previous job.
Also it does not seem to be smart enough to make sure two jobs don't access a library with write access at the same time.
I can't see how SAS Viya could replace a Job Orchestration Tool. However, I feel like the tool was built to replace such an Orchestration Tool. Did I miss something or is it just not possible to do so with the Environment Manager in SAS Viya?
Any help/insights is highly appreciated. I already searched through the documentation but could not find anything.. Maybe I was just looking at the wrong place?
Why 3.4 and not 3.5 (or Viya 4)?
If you want to use Viya with your own Job Orchestration software you can consider this tool (built by my team): https://cli.sasjs.io/job/
We deployed it on Jenkins for this customer: https://www.sas.com/en_us/news/press-releases/2021/july/sas-partnership-with-lloyds-list-intelligence.html

Datastage- Loop throught the file to read email ID and send email

I have to read Input file to get email id of employees and send each employee email.
How can I do this using Datastage job?
File looks like this,
PERSON_ID|FName|LName|Email_ID
DataStage itself offers a Notification Stage which is only available at the Sequence level.
As your information is in the data stream of a job you could use a Wrapped Stage in order to send the mail from within a job.
A wrapped stage allows to call a OS command for each row in your stream. Sendmail etc. could be used to send the mails like you wish.
I have implemented this recently. The wrapped stage is tricky so I would recommend to use it in a very simple way - use it to call the bash (or any other shell) and prepare the mail command upfront and simply send it to that stage.
There are some more options.
First is using the Wrapped Stage like Michael mentioned. Another method is writing a Parallel Routine to use in an ordinary parallel transformer, which is quite similar.
The simplest way of sending an email per row that I know of is using a server routine in a transformer.
Drawback is that server routines are deprecated and we're not yet sure
how well they can be migrated to future versions of DataStage (CP4D).
This should be considerd when doing this.
In each project you should have a folder Routines/Built-In/Utilities containing the server routines DSSendMailAttachmentTester and DSSendMailTester. These are originally meant to be used in the Routine Editor just for testing the backend wether it's actually able to send mail.
But you can also use them in a Transformer as well, as long as it's a BASIC Transformer. That means you can either write a server job using all old school stuff (which is probably not what you want), or you can use the BASIC Transformer in a parallel job. (Follow the link on how to enable it.) It gives access to BASIC transforms and functions.
I suggest copying the mentioned server routines to make your own custom one and maybe modify it to your needs.

How to properly structure Spark code in Databricks?

I have a relatively big project in Azure Databricks that will soon go to production. The code is currently organized in a few folders in a repository and the tasks are triggered using ADF and job clusters executing notebooks one after another.
The notebooks have some hardcoded values like input path, output path etc.
I don't think it is the best approach.
I would like to get rid of hardcoded values and rely on some environment variables/environment file/environment class or something like that.
I was thinking about creating a few classes that will have methods with induvidual transformations and with save operations outside of the transformations.
Can you give me some tips? How do I reference one scala script from another in Databricks? Should I create a JAR?
Or can you refer me to some documentation/good public repositories where I can see how it should be done?
It's hard to write a comprehensible guide on how to go to prod but here are some things I wish I knew earlier.
When going to production:
Try to migrate to jar jobs once you have a well established flow.
Notebooks are for exploratory tasks and not recommended for long running jobs.
You can pass params to your main, read environment vars or read the spark config. It's up to you how to pass the config.
Choose New Job Cluster and avoid All Purpose Cluster.
In production, Databricks recommends using new clusters so that each task runs in a fully isolated environment.
The pricing is different for New Job Cluster. I would say it ends up cheaper.
Here is how to deal with secrets
.. and few other off-topic ideas:
I would recommend taking a look into CI\CD Jenkins recipes
Automate deployments with the Databricks cli
If you're using notebooks for your code, then it's better to split code into following pieces:
Notebooks with "library functions" ("library notebooks") - only defining functions that will transform data. These functions are usually just receive DataFrame + some parameters, perform transformation(s) and return new DataFrame. These functions shouldn't read/write data, or at least shouldn't have hardcoded paths.
Notebooks that are entry point of jobs (let's call them "main") - they may receive some parameters via widgets, for example, you can pass environment name (prod/dev/staging), file paths, etc. These "main" notebooks may include "library notebooks" using %run with relative paths, like, %run ./Library1, %run folder/Libray2 (see doc)
Notebooks that are used for testing - they also include "library notebooks", but add the code that call the functions & check results. Usually you need to have specialized libraries, like, spark-testing-base (Scala & Python), chispa (Python only), spark-fast-tests (Scala only), etc. to compare content of the DataRrames, schema, etc. (here are examples of using different libraries) These test notebooks could be triggered as either regular jobs or from CI/CD pipeline. For that you can use Databricks CLI or dbx tool (wrapper around Databricks CLI). I have a demo of CI/CD pipeline with notebooks, although it's for Python.
For notebooks it's recommended to use Repos functionality that allows to perform version control operations with multiple notebooks at once.
Depending on the size of your code, and how often it changes you can also package it as a library that will be attached to a cluster, and used from the "main notebooks". In this case it could be a bit easier to test that library functions - you can just use standard tooling, like, Maven, SBT, etc.
P.S. You can also reach solutions architect assigned to your account (if there is one), and discuss that topic in more details.

IBM Datastage reports failure code 262148

I realize this is a bad question, but I don't know where else to turn.
can someone point me to where I can find the list of reports failure codes for IBM? I've tried searching for it in the IBM documentation, and in general google search, but this particular error is unique and I've never seen it before.
I'm trying to find out what code 262148 means.
Background:
I built a datastage job that has:
ORACLE CONNECTOR --> TRANSFORMER -> HIERARCHICAL DATA
The intent is to pull data from a ORACLE table, and output the response of the select statement into a JSON file. I'm using the HIERARCHICAL stage to set it. When tested in the stage, no problems, I see the JSON output.
However, when I run the job, it squawks:
reports failure code 262148
then the job aborts. There are no warnings, no signs, no errors prior to this line.
Until I know what it is, I can't troubleshoot.
If someone can point me to where the list of failure codes are, i can proceed.
Thanks!
can someone point me to where I can find the list of reports failure codes for IBM?
Here you go:
https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rzahb/rzahbsrclist.htm
While this list does not list your specific error code, it does categorize many other codes, and explains how the code breakdown works. While this list is not specifically for DataStage, in my experience IBM standards are generally compatible across different products. In this list every code that starts with a 2 is a disk failure, so maybe run a disk checker. That's the best I've got as far as error codes.
Without knowledge of the inner workings of the product, there is not much more you can do beyond checking system health in general (especially disk, network and permissions in this case). Personally, I prefer to go get internal knowledge whenever exterior knowledge proves insufficient. I would start with a network capture, as I'm sure there's a socket involved in the connection between the layers. Compare the network capture from when the select statement is run from the hierarchical from one captured when run from the job. There may be clues in there, like reset or refused connections.

Jmeter - Can I change a variable halfway through runtime?

I am fairly new to JMeter so please bear with me.
I need to understand whist running a JMeter script I can change the variable with the details of "DB1" to then look at "DB2".
Reason for this is I want to throw load at a MongoDB and then switch to another DB at a certain time. (hotdb/colddb)
The easiest way is just defining 2 MongoDB Source Config elements pointing to separate database instances and give them 2 different MongoDB Source names.
Then in your script you will be able to manipulate MongoDB Source parameter value in the MongoDB Script test element or in JSR223 Samplers so your queries will hit either hotdb or colddb
See How to Load Test MongoDB with JMeter article for detailed information
How about reading the value from a file in a beanshell/javascript sampler each iteration and storing in a variable, then editing/saving the file when you want to switch? It's ugly but it would work.