Using Spark 1.3.1, when a master node is started with ./sbin/start-master.sh, a RESTful webservice is started on that machine (for me port 6066). Is there any documentation on how to interact with that service?
I found this code, but I was not able to find the corresponding Scaladoc let alone some sort of guide.
Here's the JIRA ticket, contains the Design Doc that motived this feature.
The goal is to create a new submission gateway that is stable across Spark versions
Additionally,
It is also not a goal to expose the new gateway as a general mechanism
for users of Spark to submit their applications. The new gateway will
be used strictly internally between Spark submit and the standalone
Master.
Related
I am trying to implement a Rest client for Flink to send jobs via Restful Flink services. And also I want to integrate Flink and Kubernetes natively. I have decided to use “Application Mode” as deployment mode according to Flink documentation .
I have already implemented a job and packaged it as jar. And I have tested it on Standalone Flink. But my aim is to move on Kubernetes and deploy my application in Application mode via Rest API of Flink.
I have already investigated the samples at Flink documentation - Native Kubernetes. But I cannot find a sample for executing same samples via Restful services (esp. how to set --target kubernetes-application/kubernetes-session or other parameters).
In addition to samples, I checked out the Flink sources from GitHub and tried to find some sample implementation or get some clue.
I think the below ones are related with my case.
org.apache.flink.client.program.rest. RestClusterClient
org.apache.flink.kubernetes. KubernetesClusterDescriptorTest. testDeployApplicationCluster
But they are all so complicated for me to understand below points.
For application mode, are there any need to initialize a container to serve Flink Rest services before submitting job? If so, is it JobManager?
For application mode, how can I set the same command line parameters via Rest services?
For session mode, in command line samples, kubernetes-session.sh is executed before job submission to initialize a JobManager container. How sould I do this step via Rest client?
For session mode, how can I set the same command line parameters via Rest services? Although the command line samples send .jar job as parameter, should I upload jar before submitting job?
Could you please provide me some clue/sample to continue my implementation?
Best regards,
Burcu
I suspect that if you study the implementation of the Apache Flink Kubernetes Operator you'll find some clues.
Problem -> I want to deploy JanusGraph as separate service on Kubernetes. which storage backend should i use for cassandra. Is it CQL or cassandrathrift?? Cassandra is running as stateful service on Kubernetes.
Detailed Description-> As per JanusGraph doc, in case of Remote Server Mode, storage backend should be cql.
JanusGraph graph = JanusGraphFactory.build().
set("storage.backend", "cql").
set("storage.hostname", "77.77.77.77").
open();
Even they mentioned that Thrift is deprecated going ahead with Cassandra 2.1 & I am using Cassandra 3.
But in some blog, they have mentioned that rest api call from JanusGraph to Cassandra is possible only through Thrift.
Is Thrift really required? Can't we use CQL as storage backend for rest api call as well?
Yes, you absolutely should use the cql storage backend.
Thrift is deprecated, disabled by default in the current version of Cassandra (version 3), and has been removed from Cassandra version 4.
I would also be interested in reading the blog post you referenced. Are you talking about IBM's Rest API mentioned in their JanusGraph-utils Git repo? That confuses me as well, because I see both Thrift and CQL config happening there. In any case, I would go with the cql settings and give it a shot.
tl;dr;
Avoid Thrift at all costs!
I am investigating options for monitoring our installation in Swisscom's cloud-foundry. My objectives are the following:
monitor performance indicators for deployed application (such as cpu, disk, memory)
monitor performance indicators for services (slow queries, number of queries, ideally also some metrics on hitting quotas)
So far, I understand the options are the following (including some BUTs):
I used a very nice TOP cf-plugin (github)
This works very well. It seems that it registers itself to get the required firehose nozzles and consume data.
That is very useful for tracing / ad-hoc monitoring, but not very good for a serious infrastructure monitoring.
Another way I found is to use firehose-syslog solution.
This can be deployed as an app to (as far as I understand) do the job in similar way, as the TOP cf plugin.
The problem is, that it requires registered client, so it can authenticate with the doppler endpoint. For some reason, the top-cf-plugin does that automatically / in another way.
Last option i am considering is to build the monitoring itself to the App (using a special buildpack)
That can be for example done with Datadog. But it seems to also require a dedicated uaa client to register the Nozzle.
I would like to check, if somebody is (was) on the similar road, has some findings.
Eventually I would like to raise the following questions towards the swisscom community support:
is it possible to register uaac client to be able to ingest events through the firehose nozzle from external service? (this requires admin credentials if I was reading correctly)
is there an alternative way to authenticate with the nozzle (for example using a special user and his authentication token?)
is there any alternative to monitor the CF deployments in Swisscom? Eventually, is there a paper, blogpost or other form of documentation, that would be helpful in this respect (also for other users of AppCloud)?
Since it requires admin permissions, we can not give out UAA clients for the firehose.
However, there are different ways to get metrics in context of a user.
CF API
You can obtain basic metrics of a specific app by polling the CF API:
https://apidocs.cloudfoundry.org/5.0.0/apps/get_detailed_stats_for_a_started_app.html
However, since you have to poll (and for each app), it's not the recommended way.
Metrics in syslog drain
CF allows devs to forward their logs to syslog drains; in more recent versions, CF also sends metrics to this syslog drain (see https://docs.cloudfoundry.org/devguide/deploy-apps/streaming-logs.html#container-metrics).
For example, you could use Swisscom's Elasticsearch service to store these metrics and then analyze it using Kibana.
Metrics using loggregator (firehose)
The firehose allows streaming logs to clients for two types of roles:
Streaming all logs to admins (which requires a UAA client with admin permissions) and streaming app logs and metrics to devs with permissions in the app's space. This is also what the cf logs command uses. cf top also works this way (it enumerates all apps and streams the logs of each app).
However, you will find out that most open source tools that leverage the firehose only work in admin mode, since they're written for the platform operator.
Of course you also have the possibility to monitor your app by instrumenting it (white box approach), for example by configuring Spring actuator in a Spring boot app or by including an agent of your favourite APM vendor (Dynatrace, AppDynamics, ...)
I guess this is the most common approach; we've seen a lot of teams having success by instrumenting their applications. Especially since advanced monitoring anyway requires you to create your own metrics as the firehose provided cpu/memory metrics are not that powerful in a microservice world.
However, option 2. would be worth a try as well, especially since the ELK's stack metric support is getting better and better.
I want to deploy a trained machine learning model as a REST API. The API would take a file and first decompose it into features. The problem is that this step depends on other libraries (e.g., FFTW). The API would then query the model with the features from the previous step.
Theoretically I can spin up a virtual machine in the cloud, install all the dependencies there, and point the end point to that VM. But this won't scale if we have concurrent requests.
Ideally I'd love to put everything in a API gateway and leverage serverless paradigm so I don't have to worry about scalability.
First of all, you need to decompose your model into different steps. From your question I see preprocessing and model inference steps.
Your preprocessing includes dependencies such as a FFTW.
You didn't specify what kind of model do you have, but I assume that it also requires some sort of environment and/or dependencies.
Having said that, what do you need to do is to implement 2 services for each step.
It's better pack them into docker images in order to keep each container isolated and you will be able to easily deploy them.
Scalability on a docker lever could be achieved by deployment into cloud providers and docker orchestration with AWS ECS or Kubernetes.
There is an open-source project hydro-serving that could help you with this task.
In this case you just need to focus on the models themselves. hydro-serving takes care of the infrastructure.
If preprocessing stage is implemented as Python script -- we can deploy it with all deps from requirements.txt in individual containers.
The same is also true for the model -- it has have out-of-box of Tensorflow and Spark models.
Otherwise it's easy to adapt existing mechanism to satisfy your requirements (other language/toolkit)
Then, assuming that you already have a hydro-serving instance somewhere, you upload your steps with hs upload --host $HOST --port $PORT
and compose an application pipeline with your models.
You can access your application via HTTP api, GRPC api or Kafka topic.
It would be great if you'd specify what the files you are trying to send to REST API.
Possibly you will need to encode them somehow, in order to send them through REST API. On the other hand you could just send them as-is via GRPC api.
Disclosure: I'm a developer of hydro-serving
BackGround:
Our project is build on PlayFrameWork.
Front-end language: JavaScript
Back-end language: Scala
we are develope a web application,the server is a cluster.
Want to achieve:
In the web UI, User first input some parameters which about query, and click the button such as "submit".Then these parameters will be sent to backend. (This is easy,obviously)
When backend get parameters, backend start reading and process the data which store in HDFS. Data processing include data-cleaning,filtering and other operations such as clustering algorithms,not just a spark-sql query. All These operations need to run on spark cluster
We needn't manually pack a fat jar and submit it to cluster and send the result to front-end(These are what bothering me!)
What we have done:
We build a spark-project separately in IDEA. When we get parameters, we manually assign these parameters to variables in spark-project.
Then "Build Artifacts"->"Bulid" to get a fat jar.
Then submit by two approaches:
"spark-submit --class main.scala.Test --master yarn /path.jar"
run scala code directly in IDEA on local mode (if change to Yarn, will throw Exceptions).
When program execution finished, we get the processed_data and store it.
Then read the processed_data's path and pass it to front-end.
All are not user interactively submit. Very stupid!
So if I am a user, I want to query or process data on cluster and get feedback on front-end conveniently.
What should i do?
Which tools or lib could use?
Thanks!
Here is multiple ways to submit a spark job:
using spark-submit command on terminal.
using spark built-in rest API. you can click to find out how to use it.
providing a rest API in yourself in your program and set the api as the Main-Class to run the jar on your spark cluster master. By doing so, your api should dispatch the input job submit requests to the certain action you want. At your api you should instantiate the class where your SparkContext is instantiated. This action is the equivalent of the spark-submit action. It means that when rest api receives the job submission request and do as mentioned above you can see the job progression on the master web ui and then your job termination your api is up and waits for your next request.
**The 3rd solution is my own experience to run different types of algorithms in a web crawler. **
So generally you have two approaches:
Create Spark application that will also be a web service
Create Spark application that will be called by a web service
First approach - spark app is a web service, is not good approach, because for as long as your web service will be running you will also use resources on a cluster (except if you run spark on mesos with specific configuration) - read more about cluster managers here.
Second approach - service and spark app separated is better. In this approach you can create one or multiple spark applications that will be launched by calling spark submit from web service. There are also two options - create single spark app that will be called with parameters that will specify what to do, or create one spark app for one query. The result of the queries in this approach could be just saved to a file or sent to a web server via network or any using any other inter process communication approach.