Deploy DataFlow job using Scio - scala

I've started developing my first DataFlow job using Scio, the Scala SDK. The dataflow job will run in streaming mode.
Can anyone advise the best way to deploy this? I have read in the Scio docs they use sbt-pack and then deploy this within a Docker container. I have also read about using DataFlow templates (but not in great detail).
What's best?

Like for Java and Python version, you can run directly your code on Dataflow by using the dataflow runner and by launching it from your computer (or a VM/function).
If you want to package it for a reutilisation, you can create a template.
You can't run custom container on Dataflow.

Related

Spring Cloud Integration test, embedded kafka vs testcontainers

I have a Spring cloud stream application which I need to make an integration test for (to be specific using cucumber). The application communicate with other services using Kafka message broker. From what I know I could make this work using either a kafka testcontainers or using spring provided embedded kafka. But what I don't know is which one would be the best solution so are there anything that the testcontainer could do but embedded can't or the other way around? (use cases or example would be appreciate!)
p.s This integration should be able to run on ci/cd pipeline.
It is called embedded for a reason. It really can be only accessed from the process which spawned it. With Testcontainers you really can reuse existing container and have access to it from the other process. But that's probably to exotic.
I guess with properly configured Testcontainers you can reach as much as possible similarities with the prod you'd deploy your solution. The embedded Kafka might be limited in some areas, e.g. SSL configuration or so.

Is it possible to run a cell of a databricks notebook via REST API?

I would like to run a notebook cell automatically via REST API to make the usability of a dev tool we created better. Is that possible in databricks?
Yes, it's possible by using an older API version 1.2. You need to create an execution context with /api/1.2/contexts/create API (it requires cluster ID and what language is used), and then you can submit code using the /api/1.2/commands/execute API, and get command execution status using /api/1.2/commands/status API. Please note that you need to keep context to execute multiple commands depending on each other...
You can find an example of such execution using the Go language in the source code of Databricks Terraform provider

Spring Task in Spring Cloud Dataflow on PCF can't find java

i have a Spring Cloud Task fat jar that i have successfully deployed to SCDF running on PCF. i have created a definition for it and can therefore run it from the dashboard. fwiw it reads and writes from a database using Spring JDBC.
i'm trying to now set it up to run in a scheduled way and am having issues. i created a stream with a triggertask source and a task-launcher-local sink, and have configured the triggertask URI to point to the fat jar (via http, using a staticfile PCF pushed app).
the dashboard shows the two PCF apps (one for triggertask, one for task-local-launcher) both starting successfully, and it all runs, but the task fails every time with the error:
Caused by: java.io.IOException: Cannot run program "java" (in directory "/home/vcap/tmp/spring-cloud-dataflow-5903184636016162160/Task--582903409-1502669137014/Task--582903409"): error=2, No such file or directory
from what i can tell and surmise, the PCF app running the stream tries to fork and exec a java call, but since java is not in the path for PCF app containers i get the error
am i right? either way, how can i get the Spring Cloud Task (jar) to successfully run?
Spring Cloud Data Flow: Server
1.2.3 (using built spring-cloud-dataflow-server-cloudfoundry-1.2.3.BUILD-SNAPSHOT.jar)
Spring Cloud Data Flow: Shell
1.2.3 (using downloaded spring-cloud-dataflow-shell-1.2.3.RELEASE.jar)
Deployment Environment
PCF v1.11.6 (on Azure)
pcf dev v0.26.0 (on mac)
App Starters
http://bit-dot-ly/1-0-4-GA-stream-applications-rabbit-maven
Logs
link to log
The stream definition is missing from the post. It is possible that you're using the tasklauncher-local sink, which is compatible only when using SCDF's local-server and it will fail with the attached error when running in CF. Please make sure you're using tasklauncher-cloudfoundry sink. This application was added in the latest release of app-starters.
As pointed in the previous SO thread, it is highly recommended that you use the latest release of app-starters (1.0.4 is at least 10 months old). The latest releases can be found at the project site.

Can Eclipse/IntelliJ Idea be used to execute code on the cluster

Production system : HDP-2.5.0.0 using Ambari 2.4.0.1
Aplenty demands coming in for executing a range of code(Java MR etc., Scala, Spark, R) atop the HDP but from a desktop Windows machine IDE.
For Spark and R, we have R-Studio set-up.
The challenge lies with Java, Scala and so on, also, people use a range of IDEs from Eclipse to IntelliJ Idea.
I am aware that the Eclipse Hadoop plugin is NOT actively maintained and also has aplenty bugs when working with latest versions of Hadoop, IntelliJ Idea I couldn't find reliable inputs from the official website.
I believe the Hive and HBase client API is a reliable way to connect from Eclipse etc. but I am skeptical about executing MR or other custom Java/Scala code.
I referred several threads like this and this, however, I still have the question that is any IDE like Eclipse/Intellij Idea having an official support for Hadoop ? Even the Spring Data for Hadoop seems to lost traction, it anyways didn't work as expected 2 years ago ;)
As a realistic alternative, which tool/plugin/library should be used to test the MR and other Java/Scala code 'locally' i.e on the desktop machine using a standalone version of the cluster ?
Note : I do not wish to work against/in the sandbox, its about connecting to the prod. cluster directly.
I don't think that there is a genereal solution which would work for all Hadoop services equally. Each solution has it's own development, testing and deployment scenarios as they are different standalone products. For MR case you can use MRUnit to simulate your work locally from IDE. Another option is LocalJobRunner. They both allow you to check your MR logic directly from IDE. For Storm you can use backtype.storm.Testing library to simalate topology's workflow. But they all are used from IDE without direct cluster communications like in case wuth Spark and RStudio integration.
As for the MR recommendation your job should ideally pass the following lifecycle - writing the job and testing it locally, using MRUnit, then you should run it on some development cluster with some test data (see MiniCluster as an option) and then running in on real cluster with some custom counters which would help you to locate your malformed data and to properly maintaine the job.

How to create a Spark Streaming jar that would work in AWS EMR?

I've been developing a Spark Streaming application with Eclipse, and I'm using sbt to run it locally.
Now I want to deploy the application on AWS using a jar, but when I try to use the command package of sbt it creates a jar without all dependencies so when I upload it on AWS it won't work because of Scala being missing.
Is there a way to create a uber-jar with SBT? Am I doing something wrong with the deployment of Spark on AWS?
For creating uber-jar with sbt, use sbt plugin sbt-assembly. For more details about creating uber-jar using sbt-assembly refer the blog post
After creating you can run the assembly jar using java -jar command.
But from Spark-1.0.0 onwards the spark-submit script in Spark’s bin directory is used to launch applications on a cluster for more details refer here
You should really be following Running Spark on EC2 that reads:
The spark-ec2 script, located in Spark’s ec2 directory, allows you to
launch, manage and shut down Spark clusters on Amazon EC2. It
automatically sets up Spark, Shark and HDFS on the cluster for you.
This guide describes how to use spark-ec2 to launch clusters, how to
run jobs on them, and how to shut them down. It assumes you’ve already
signed up for an EC2 account on the Amazon Web Services site.
I've only partially followed the document so I can't comment on how well it's written.
Moreover, according to Shipping Code to the Cluster chapter in the other document:
The recommended way to ship your code to the cluster is to pass it
through SparkContext’s constructor, which takes a list of JAR files
(Java/Scala) or .egg and .zip libraries (Python) to disseminate to
worker nodes. You can also dynamically add new files to be sent to
executors with SparkContext.addJar and addFile.