I have a Play2 app which serves some api.
I that same app, I added code that runs an etl based on Alpakka. Under the etl hood, there is a Future[Done] running that never ends. Currently I trigger the etl process with a web request at a separate route.
To deploy my etl service, I wanna be able to run my Play2 app with a special command that ideally does not open a server, but just runs that single Action of etl controller. If it's not achievable and sever has to be open, I'd like to trigger that etl process but isolate my etl box from incoming web connection. I feel that all of this is very hackish and probably there is a better way.
The way I ended up doing it is with scheduled tasks.
A task may be scheduled once and depending on a config variable:
class ProtonConnector #Inject()(
config: TypeSafeConfig,
)(
implicit actorSystem: ActorSystem
) {
// Run the stream
def listen: Future[Done] =
itemsProtonSource.via(mapFlow).via(solrIndexFlow).runWith(Sink.ignore)
// Schedule ETL execution
if (config.scheduleProtonEtlTask) {
actorSystem.scheduler.scheduleOnce(delay = 1.seconds) {
logger.debug("Executing proton ETL task...")
val _ = listen
}
}
}
Then we pick up config.scheduleProtonEtlTask from an environment variable (which has value set to true so the code block inside if would execute).
Then in our CI/CD pipeline we define a new deployment where this env variable would be supplied by task configuration. That deployment may be isolated from outer world too - that way, Play2 still spins up a server but it's inaccessible from outside of local network.
Related
I'm trying to write a component that will start up an EMR cluster, run a Spark pipeline on that cluster, and then shut that cluster down once the pipeline completes.
I've gotten as far as creating the cluster and setting permissions to allow my main cluster's worker machines to start EMR clusters. However, I'm struggling with debugging the created cluster and waiting until the pipeline has concluded. Here is the code I have now. Note I'm using Spark Scala, but this is very close to standard Java code:
val runSparkJob = new StepConfig()
.withName("Run Pipeline")
.withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
.withHadoopJarStep(
new HadoopJarStepConfig()
.withJar("/path/to/jar")
.withArgs(
"spark-submit",
"etc..."
)
)
// Create a cluster and run the Spark job on it
val clusterName = "REDACTED Cluster"
val createClusterRequest =
new RunJobFlowRequest()
.withName(clusterName)
.withReleaseLabel(Configs.EMR_RELEASE_LABEL)
.withSteps(enableDebugging, runSparkJob)
.withApplications(new Application().withName("Spark"))
.withLogUri(Configs.LOG_URI_PREFIX)
.withServiceRole(Configs.SERVICE_ROLE)
.withJobFlowRole(Configs.JOB_FLOW_ROLE)
.withInstances(
new JobFlowInstancesConfig()
.withEc2SubnetId(Configs.SUBNET)
.withInstanceCount(Configs.INSTANCE_COUNT)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType(Configs.MASTER_INSTANCE_TYPE)
.withSlaveInstanceType(Configs.SLAVE_INSTANCE_TYPE)
)
val newCluster = emr.runJobFlow(createClusterRequest)
I have two concrete questions:
The call to emr.runJobFlow returns immediately upon submitting the result. Is there any way that I can make it block until the cluster is shut down or otherwise wait until the workflow has concluded?
My cluster is actually not coming up and when I go to the AWS Console -> EMR -> Events view I see a failure:
Amazon EMR Cluster j-XXX (REDACTED...) has terminated with errors at 2019-06-13 19:50 UTC with a reason of VALIDATION_ERROR.
Is there any way I can get my hands on this error programmatically in my Java/Scala application?
Yes, it is very possible to wait until an EMR cluster is terminated.
There is are waiters that will block execution until the cluster (i.e. job flow) gets to a certain state.
val newCluster = emr.runJobFlow(createClusterRequest);
val describeRequest = new DescribeClusterRequest()
.withClusterId(newCluster.getClusterId())
// Wait until terminated
emr.waiters().clusterTerminated().run(new WaiterParameters(describeRequest))
Also, if you want to get the status of the cluster (i.e. job flow), you can call the describeCluster function of the EMR client. Check out the linked documentation as you can get state and status information about the cluster to determine if it's successful or erred.
val result = emr.describeCluster(describeRequest)
Note: Not the best Java-er so the above is my best guess and how it would work based on the documentation but I have not tested the above.
BackGround:
Our project is build on PlayFrameWork.
Front-end language: JavaScript
Back-end language: Scala
we are develope a web application,the server is a cluster.
Want to achieve:
In the web UI, User first input some parameters which about query, and click the button such as "submit".Then these parameters will be sent to backend. (This is easy,obviously)
When backend get parameters, backend start reading and process the data which store in HDFS. Data processing include data-cleaning,filtering and other operations such as clustering algorithms,not just a spark-sql query. All These operations need to run on spark cluster
We needn't manually pack a fat jar and submit it to cluster and send the result to front-end(These are what bothering me!)
What we have done:
We build a spark-project separately in IDEA. When we get parameters, we manually assign these parameters to variables in spark-project.
Then "Build Artifacts"->"Bulid" to get a fat jar.
Then submit by two approaches:
"spark-submit --class main.scala.Test --master yarn /path.jar"
run scala code directly in IDEA on local mode (if change to Yarn, will throw Exceptions).
When program execution finished, we get the processed_data and store it.
Then read the processed_data's path and pass it to front-end.
All are not user interactively submit. Very stupid!
So if I am a user, I want to query or process data on cluster and get feedback on front-end conveniently.
What should i do?
Which tools or lib could use?
Thanks!
Here is multiple ways to submit a spark job:
using spark-submit command on terminal.
using spark built-in rest API. you can click to find out how to use it.
providing a rest API in yourself in your program and set the api as the Main-Class to run the jar on your spark cluster master. By doing so, your api should dispatch the input job submit requests to the certain action you want. At your api you should instantiate the class where your SparkContext is instantiated. This action is the equivalent of the spark-submit action. It means that when rest api receives the job submission request and do as mentioned above you can see the job progression on the master web ui and then your job termination your api is up and waits for your next request.
**The 3rd solution is my own experience to run different types of algorithms in a web crawler. **
So generally you have two approaches:
Create Spark application that will also be a web service
Create Spark application that will be called by a web service
First approach - spark app is a web service, is not good approach, because for as long as your web service will be running you will also use resources on a cluster (except if you run spark on mesos with specific configuration) - read more about cluster managers here.
Second approach - service and spark app separated is better. In this approach you can create one or multiple spark applications that will be launched by calling spark submit from web service. There are also two options - create single spark app that will be called with parameters that will specify what to do, or create one spark app for one query. The result of the queries in this approach could be just saved to a file or sent to a web server via network or any using any other inter process communication approach.
What is the recommended way to launch a Spark job on-demand from within an enterprise application (in Java or Scala)? There is a processing step which currently takes several minutes to complete. I would like to use a Spark cluster to reduce the processing down to, let's say less than 15 seconds:
Rewrite the time consuming process in Spark and Scala.
The parameters would be passed to the JAR as command line arguments. The Spark job then acquires source data from a database. Do the processing and save the output in a location readable by the enterprise application.
Question 1: How to launch the Spark job on-demand from within the enterprise application? The Spark cluster (standalone) is on the same LAN but separate from the servers on which the enterprise app is running.
Question 2: What is the recommended way to transmit the processing results back to the caller code?
Question 3: How to notify the caller code about job completion (or failure such as Spark cluster down, job time out, exception in spark code)
You could try spark-jobserver . Upload your spark.jar to the server. And from your application, you can call the job in your spark.jar using the rest interface . To know whether your job is completed or not , you can keep polling the rest interface. And when your job completes and if the result is very small you could get it from the rest interface itself. But if the result is huge it is better to save to some db.
We want to write a Service Worker that performs source code transformation on the loaded files. In order to test this functionality, we use Karma.
Our tests import source files, on which the source code transformation is performed. The tests only succeed if the Service Worker performs the transformation and fail when the Service Worker is not active.
Locally, we can start Karma with singleRun: false and watch for changed files to restart the tests. However, Service Workers are not active for the page that originally loaded them. Therefore, every test case succeeds but the first one.
However, for continuous integration, we need a single-run mode. So, our Service Worker is not active during the run of the test, which fail accordingly.
Also, two consecutive runs do not solve this issue, as Karma restarts the used browser (so we lose the Service Worker).
So, the question is, how to make the Service Worker available in the test run?
E.g., by preserving the browser instance used by karma.
Calling self.clients.claim() within your service worker's activate hander signals to the browser that you'd like your service worker to take control on the initial page load in which the service worker is first registered. You can see an example of this in action in Service Worker Sample: Immediate Control.
I would recommend that in the JavaScript of your controlled page, you wait for the navigator.serviceWorker.ready promise to resolve before running your test code. Once that promise does resolve, you'll know that there's an active service worker controlling your page. The test for the <platinum-sw-register> Polymer element uses this technique.
I have a linux server with three play 2 framework instances on it and I would like to execute regularly an external Scala script that has access to all application environment (models) and that is executed only once at a time.
I would like to call this script from crontab but I cannot find any documentation on how to do it. I know that we can schedule asynchronous tasks from Global object, but I want the script executed only once for the three play instances.
Actually I would like to do the same kind of things as Ruby on Rails rake tasks for those who knows them.
Create a regular action for this task which will be accessible via http, then you can use ie. curl in unix' crontab to call that action and it will hit first available instance.
Other possibility is... using Global object to schedule the task with Akka support. In this case to make sure that only one instance will schedule task, you need to determine somehow which one should it be. If you are starting all 3 instances with specified port (always the same per instance) you can read http.port to allow or skip the execution.
Finally you can use database to inform other instances, that task is just executed: all 3 instances tries to execute the Akka scheduler, but before execution the task they can check if this task has still TODO flag. If not, instance sets TODO flag to false and continues execution, otherwise just skips execution this time.
Also you can use filesystem for similar approach: at the beginning of the execution, create flag-file to inform other instances, that, this time they can skip the task.