How to get YARN "Memory Total" and "VCores Total" metrics programmatically in pyspark - pyspark

I'm lingering around this:
https://docs.actian.com/vectorhadoop/5.0/index.html#page/User/YARN_Configuration_Settings.htm
but none of those configs are what I need.
"yarn.nodemanager.resource.memory-mb" was promising, but it's only for the node manager it seems and only gets master's mem and cpu, not the cluster's.
int(hl.spark_context()._jsc.hadoopConfiguration().get('yarn.nodemanager.resource.memory-mb'))

You can access those metrics from Yarn History Server.
URL: http://rm-http-address:port/ws/v1/cluster/metrics
metrics:
totalMB
totalVirtualCores
Example response (can be also XML):
{ "clusterMetrics": {
"appsSubmitted":0,
"appsCompleted":0,
"appsPending":0,
"appsRunning":0,
"appsFailed":0,
"appsKilled":0,
"reservedMB":0,
"availableMB":17408,
"allocatedMB":0,
"reservedVirtualCores":0,
"availableVirtualCores":7,
"allocatedVirtualCores":1,
"containersAllocated":0,
"containersReserved":0,
"containersPending":0,
"totalMB":17408,
"totalVirtualCores":8,
"totalNodes":1,
"lostNodes":0,
"unhealthyNodes":0,
"decommissioningNodes":0,
"decommissionedNodes":0,
"rebootedNodes":0,
"activeNodes":1,
"shutdownNodes":0 } }
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Metrics_API
All you need is to figure out your Yarn History Server address and port- check in your configuration files, can't help you with this since I don't know where do you manage Yarn.
When you have the URL, access it with python:
import requests
url = 'http://rm-http-address:port/ws/v1/cluster/metrics'
reponse = requests.get(url)
# Parse the reponse json/xml and get the relevant metrics...
Of course no Hadoop or Spark Context is needed in this solution

Related

Writing log to gcloud Vertex AI Endpoint using gcloud client fails with google.api_core.exceptions.MethodNotImplemented: 501

Trying to use google logging client library for writing logs into gcloud, specifically, i'm interested in writing logs that will be attached to a managed resource, in this case, a Vertex AI endpoint:
Code sample:
import logging
from google.api_core.client_options import ClientOptions
import google.cloud.logging_v2 as logging_v2
from google.oauth2 import service_account
def init_module_logger(module_name: str) -> logging.Logger:
module_logger = logging.getLogger(module_name)
module_logger.setLevel(settings.LOG_LEVEL)
credentials= service_account.Credentials.from_service_account_info(json.loads(SA_KEY_JSON))
client = logging_v2.client.Client(
credentials=credentials,
client_options=ClientOptions(api_endpoint="us-east1-aiplatform.googleapis.com"),
)
handler = client.get_default_handler(
resource=Resource(
type="aiplatform.googleapis.com/Endpoint",
labels={"endpoint_id": "ENDPOINT_NUMBER_ID",
"location": "us-east1"},
)
)
#Assume we have the formatter
handler.setFormatter(ENRICHED_FORMATTER)
module_logger.addHandler(handler)
return module_logger
logger = init_module_logger(__name__)
logger.info("This Fails with 501")
And i am getting:
google.api_core.exceptions.MethodNotImplemented: 501 The GRPC target
is not implemented on the server, host:
us-east1-aiplatform.googleapis.com, method:
/google.logging.v2.LoggingServiceV2/WriteLogEntries. Sent all pending
logs.
I thought we need to enable api and was told it's enabled, and that we have: https://www.googleapis.com/auth/logging.write
what could be causing the error?
As mentioned by #DazWilkin in the comment, the error is because the API endpoint us-east1-aiplatform.googleapis.com does not have a method called WriteLogEntries.
The above endpoint is used to send requests to Vertex AI services and not to Cloud Logging. The API endpoint to be used is the logging.googleapis.com as shown in the entries.write method. Refer to this documentation for more info.
The ClientOptions() function should have logging.googleapis.com as the api_endpoint parameter. If the client_options parameter is not specified, logging.googleapis.com is used by default.
After changing the api_endpoint parameter, I was able to successfully write the log entries. The ClientOptions() is as follows:
client = logging_v2.client.Client(
credentials=credentials,
client_options=ClientOptions(api_endpoint="logging.googleapis.com"),
)

Spark Session returned an error : Apache NiFi

We are trying to run a spark program using NiFi. This is the basic sample we tried to follow.
We have configured Apache-Livy server in 127.0.0.1:8998.
ExecutiveSparkInteractive processor is used to run sample Spark code.
val gdpDF = spark.read.json("gdp.json")
val gdpRDD = gdpDF.rdd
gdpRDD.count()
LivyController is confiured for 127.0.0.1 port 8998 and Session Type : spark.
When we run the processor we get following error :
Spark Session returned an error, sending the output JSON object as the flow file content to failure (after penalizing)
We just want to output the line count in JSON file. How to redirect it to flowfile?
NiFi User log :
2020-04-13 21:50:49,955 INFO [NiFi Web Server-85]
org.apache.nifi.web.filter.RequestLogger Attempting request for
(anonymous) GET
http://localhost:9090/nifi-api/flow/controller/bulletins (source ip:
127.0.0.1)
NiFi app.log
ERROR [Timer-Driven Process Thread-3]
o.a.n.p.livy.ExecuteSparkInteractive
ExecuteSparkInteractive[id=9a338053-0173-1000-fbe9-e613558ad33b] Spark
Session returned an error, sending the output JSON object as the flow
file content to failure (after penalizing)
I have seen several people struggling with this example. I recommend following this example from the Cloudera Community (especially note part 2).
https://community.cloudera.com/t5/Community-Articles/HDF-3-1-Executing-Apache-Spark-via-ExecuteSparkInteractive/ta-p/247772
The key points I would be concerned with:
Does your spark work in general
Does your livy work in general
Is the Spark sample code good

Prometheus statsd-exporter - how to tag status code in request duration metric (histogram)

I have setup statsd-exporter to scrape metric from gunicorn web server. My goal is to filter request duration metric only for successful request(non 5xx), however in statsd-exporter there is no way to tag status code in duration metric. Can anyone suggest a way to add status code in request duration metric or a way to filter only successful request duration in prometheus.
In particular I want to extract successful request duration hitogram from statsd-exporter to prometheus.
To export successful request duration histogram metrics from gunicorn web server to prometheus you would need to add this functionality in gunicorn sorcecode.
First take a look at the code that exports statsd metrics here.
You should see this peace of code:
status = resp.status
...
self.histogram("gunicorn.request.duration", duration_in_ms)
By changing the code to sth like this:
self.histogram("gunicorn.request.duration.%d" % status, duration_in_ms)
from this moment you will have metrics names exported with status codes like gunicorn_request_duration_200 or gunicorn_request_duration_404 etc.
You can also modify it a little bit and move status codes to label by adding a configuration like below to your statsd_exporter:
mappings:
- match: gunicorn.request.duration.*
name: "gunicorn_http_request_duration"
labels:
status: "$1"
job: "gunicorn_request_duration"
So your metrics will now look like this:
# HELP gunicorn_http_request_duration Metric autogenerated by statsd_exporter.
# TYPE gunicorn_http_request_duration summary
gunicorn_http_request_duration{job="gunicorn_request_duration",status="200",quantile="0.5"} 2.4610000000000002e-06
gunicorn_http_request_duration{job="gunicorn_request_duration",status="200",quantile="0.9"} 2.4610000000000002e-06
gunicorn_http_request_duration{job="gunicorn_request_duration",status="200",quantile="0.99"} 2.4610000000000002e-06
gunicorn_http_request_duration_sum{job="gunicorn_request_duration",status="200"} 2.4610000000000002e-06
gunicorn_http_request_duration_count{job="gunicorn_request_duration",status="200"} 1
gunicorn_http_request_duration{job="gunicorn_request_duration",status="404",quantile="0.5"} 3.056e-06
gunicorn_http_request_duration{job="gunicorn_request_duration",status="404",quantile="0.9"} 3.056e-06
gunicorn_http_request_duration{job="gunicorn_request_duration",status="404",quantile="0.99"} 3.056e-06
gunicorn_http_request_duration_sum{job="gunicorn_request_duration",status="404"} 3.056e-06
gunicorn_http_request_duration_count{job="gunicorn_request_duration",status="404"} 1
And now to query all metrics except these with 5xx status in prometheus you can run:
gunicorn_http_request_duration{status=~"[^5].*"}
Let me know if it was helpful.

Authenticate with ECE ElasticSearch Sink from Apache Fink (Scala code)

Compiler error when using example provided in Flink documentation. The Flink documentation provides sample Scala code to set the REST client factory parameters when talking to Elasticsearch, https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/elasticsearch.html.
When trying out this code i get a compiler error in IntelliJ which says "Cannot resolve symbol restClientBuilder".
I found the following SO which is EXACTLY my problem except that it is in Java and i am doing this in Scala.
Apache Flink (v1.6.0) authenticate Elasticsearch Sink (v6.4)
I tried copy pasting the solution code provided in the above SO into IntelliJ, the auto-converted code also has compiler errors.
// provide a RestClientFactory for custom configuration on the internally created REST client
// i only show the setMaxRetryTimeoutMillis for illustration purposes, the actual code will use HTTP cutom callback
esSinkBuilder.setRestClientFactory(
restClientBuilder -> {
restClientBuilder.setMaxRetryTimeoutMillis(10)
}
)
Then i tried (auto generated Java to Scala code by IntelliJ)
// provide a RestClientFactory for custom configuration on the internally created REST client// provide a RestClientFactory for custom configuration on the internally created REST client
import org.apache.http.auth.AuthScope
import org.apache.http.auth.UsernamePasswordCredentials
import org.apache.http.client.CredentialsProvider
import org.apache.http.impl.client.BasicCredentialsProvider
import org.apache.http.impl.nio.client.HttpAsyncClientBuilder
import org.elasticsearch.client.RestClientBuilder
// provide a RestClientFactory for custom configuration on the internally created REST client// provide a RestClientFactory for custom configuration on the internally created REST client
esSinkBuilder.setRestClientFactory((restClientBuilder) => {
def foo(restClientBuilder) = restClientBuilder.setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
override def customizeHttpClient(httpClientBuilder: HttpAsyncClientBuilder): HttpAsyncClientBuilder = { // elasticsearch username and password
val credentialsProvider = new BasicCredentialsProvider
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(es_user, es_password))
httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider)
}
})
foo(restClientBuilder)
})
The original code snippet produces the error "cannot resolve RestClientFactory" and then Java to Scala shows several other errors.
So basically i need to find a Scala version of the solution described in Apache Flink (v1.6.0) authenticate Elasticsearch Sink (v6.4)
Update 1: I was able to make some progress with some help from IntelliJ. The following code compiles and runs but there is another problem.
esSinkBuilder.setRestClientFactory(
new RestClientFactory {
override def configureRestClientBuilder(restClientBuilder: RestClientBuilder): Unit = {
restClientBuilder.setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
override def customizeHttpClient(httpClientBuilder: HttpAsyncClientBuilder): HttpAsyncClientBuilder = {
// elasticsearch username and password
val credentialsProvider = new BasicCredentialsProvider
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(es_user, es_password))
httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider)
httpClientBuilder.setSSLContext(trustfulSslContext)
}
})
}
}
The problem is that i am not sure if i should be doing a new of the RestClientFactory object. What happens is that the application connects to the elasticsearch cluster but then discovers that the SSL CERT is not valid, so i had to put the trustfullSslContext (as described here https://gist.github.com/iRevive/4a3c7cb96374da5da80d4538f3da17cb), this got me past the SSL issue but now the ES REST Client does a ping test and the ping fails, it throws an exception and the app shutsdown. I am suspecting that the ping fails because of the SSL error and maybe it is not using the trustfulSslContext i setup as part of new RestClientFactory and this makes me suspect that i should not have done the new, there should be a simple way to update the existing RestclientFactory object and basically this is all happening because of my lack of Scala knowledge.
Happy to report that this is resolved. The code i posted in Update 1 is correct. The ping to ECE was not working for two reasons:
The certificate needs to include the complete chain including the root CA, the intermediate CA and the cert for the ECE. This helped get rid of the whole trustfulSslContext stuff.
The ECE was sitting behind an ha-proxy and the proxy did the mapping for the hostname in the HTTP request to the actual deployment cluster name in ECE. this mapping logic did not take into account that the Java REST High Level client uses the org.apache.httphost class which creates the hostname as hostname:port_number even when the port number is 443. Since it did not find the mapping because of the 443 therefore the ECE returned a 404 error instead of 200 ok (only way to find this was to look at unencrypted packets at the ha-proxy). Once the mapping logic in ha-proxy was fixed, the mapping was found and the pings are now successfull.

Ganglia No matching metrics detected

We are getting error as "No matching metrics detected". cluster level metrics are visible.
ganglia core 3.6.0
ganglia web 3.5.12
Please help to resolve this issue.
Regards,
Jayendra
Somewhere, in a .conf file (or .pyconf, et al,) you must specify a 'collection_group' with a list of the metrics you want to collect. From the default gmond.conf, it should look similar to this:
collection_group {
collect_once = yes
time_threshold = 1200
metric {
name = "cpu_num"
title = "CPU Count"
}
metric {
name = "cpu_speed"
title = "CPU Speed"
}
metric {
name = "mem_total"
title = "Memory Total"
}
}
You may use wildcards to match the name.
You'll also need to include the module that provides the metrics you are looking to collect. Again, the example gmond.conf contains something like this:
modules {
module {
name = "core_metrics"
}
module {
name = "cpu_module"
path = "modcpu.so"
}
}
among others.
You can generate an example gmond.conf by typing
gmond -t > /usr/local/etc/gmond.conf
This path is correct for ganglia-3.6.0, I know that many file paths have changed several times since 3.0...
A good reference book is 'Monitoring with Ganglia.' I'd recommend getting a copy if you're going to be getting very deeply involved with configuring / maintaining a ganglia installation.
When summary/cluster graphs are visible, but individual host graph data is not, this might be caused by a mismatch of hostname case (between reported hostname and rrd graph directory names).
Check /var/lib/ganglia/rrds/CLUSTER-NAME/HOSTNAME
This will show you what case the hostnames are getting their graphs generated as.
If the case does not match their hostname, edit: /etc/ganglia/conf.php (this allows overrides to defaults at: /usr/share/ganglia/conf_default.php)
Add the following line:
$conf['case_sensitive_hostnames'] = false;
Another place to check for case sensitiviy is the gmetad settings at /etc/ganglia/gmetad
case_sensitive_hostnames 0
Versions This Was Fixed On:
OS: CentOS 6
Ganglia Core: 3.7.2-2
Ganglia Web: 3.7.1-2
Installed via EPEL