How do we capture all container logs on google Vertex AI? - gcloud

I am having an endpoint for online predictions on AI platform (unified)
and only logs with severity >= ERROR can be found..
Model was deployed using: --enable-container-logging
Logger code within container:
module_logger = logging.getLogger("MODULE_NAME")
module_logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter("%(asctime)s — %(name)s — %(levelname)s — %(funcName)s:%(lineno)d — " "%(message)s")
module_logger.addHandler(handler)
Query: resource.type="aiplatform.googleapis.com/Endpoint" resource.labels.endpoint_id="ENDPOINT_ID" resource.labels.location="us-central1"
Two questions:
How do we make sure all logs logged by the container are logged and seen in the logs viewer?
What's that severity? how is it deduced by the console/platform?

Answering myself:
Container logs that are logged to stdout or stderr are captured by the gcloud logger
There isn't seem to be clear documentation but it seem that stderr logs are being interpreted to have Severity ERROR while stdout are INFO

Related

Program stalls on "DEBUG Starting new HTTPS connection (1): login.microsoftonline.com:443" only when executed via cron

I'm using the python-O365 library in order to access my O365 mailbox. The project requires me to execute the program in a docker container. If I start the program manually (as root), everything works fine, but if I try to start it via cron, it stalls on DEBUG Starting new HTTPS connection (1): login.microsoftonline.com:443, which I found out after activating logging.
The minimal code example that reproduces the error (with log):
import O365
from utils.credentials import get_credentials
import logging # We want to get additional information
logging.basicConfig(
filename='./easy_log_file.log',
filemode='a',
format='%(levelname)s %(message)s', # %(asctime)s %(pathname)s %(lineno)d
level=logging.DEBUG
)
filename = "o365_token.txt"
token_backend = O365.FileSystemTokenBackend(token_path = filename)
account = O365.Account(get_credentials(), token_backend=token_backend)
inbox = account.mailbox().inbox_folder()
messages = inbox.get_messages()
for message in messages:
logging.info(message)
logging.info("finished")
To start it via cron, I used the following command:
echo "15 21 * * * bash /workspace/daemon_start.sh >> /workspace/cronlogs/logs_daemon_mail.log" | crontab. If I start the program manually, the log continues like this:
DEBUG Starting new HTTPS connection (1): graph.microsoft.com:443
DEBUG https://graph.microsoft.com:443 "GET /v1.0/me/mailFolders/Inbox/messages?%24top=100&%24filter=isRead+eq+false HTTP/1.1" 200 None
DEBUG Received response (200) from URL https://graph.microsoft.com/v1.0/me/mailFolders/Inbox/messages?%24top=100&%24filter=isRead+eq+false
If the program is started via cron, sometimes the log continues like this:
DEBUG Incremented Retry for (url='/common/oauth2/v2.0/token'): Retry(total=2, connect=3, read=3, redirect=None, status=None)
WARNING Retrying (Retry(total=2, connect=3, read=3, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)'))': /common/oauth2/v2.0/token
In order to resolve the issue, I added my proxy by using account = Account(credentials, token_backend=token_backend, proxy_server="proxy.my_proxy.com"). It's strange, that I would have to add it, for the container is already configured to use this proxy. When I tried it with this setting, I encountered the same issue, only that the log when started with cron was continued always and much faster.
Since I think, that cron simply starts the program and does not meddle with the connections, it doesn't make sense to me, that I get different outcomes by starting it manually or with cron.

Writing log to gcloud Vertex AI Endpoint using gcloud client fails with google.api_core.exceptions.MethodNotImplemented: 501

Trying to use google logging client library for writing logs into gcloud, specifically, i'm interested in writing logs that will be attached to a managed resource, in this case, a Vertex AI endpoint:
Code sample:
import logging
from google.api_core.client_options import ClientOptions
import google.cloud.logging_v2 as logging_v2
from google.oauth2 import service_account
def init_module_logger(module_name: str) -> logging.Logger:
module_logger = logging.getLogger(module_name)
module_logger.setLevel(settings.LOG_LEVEL)
credentials= service_account.Credentials.from_service_account_info(json.loads(SA_KEY_JSON))
client = logging_v2.client.Client(
credentials=credentials,
client_options=ClientOptions(api_endpoint="us-east1-aiplatform.googleapis.com"),
)
handler = client.get_default_handler(
resource=Resource(
type="aiplatform.googleapis.com/Endpoint",
labels={"endpoint_id": "ENDPOINT_NUMBER_ID",
"location": "us-east1"},
)
)
#Assume we have the formatter
handler.setFormatter(ENRICHED_FORMATTER)
module_logger.addHandler(handler)
return module_logger
logger = init_module_logger(__name__)
logger.info("This Fails with 501")
And i am getting:
google.api_core.exceptions.MethodNotImplemented: 501 The GRPC target
is not implemented on the server, host:
us-east1-aiplatform.googleapis.com, method:
/google.logging.v2.LoggingServiceV2/WriteLogEntries. Sent all pending
logs.
I thought we need to enable api and was told it's enabled, and that we have: https://www.googleapis.com/auth/logging.write
what could be causing the error?
As mentioned by #DazWilkin in the comment, the error is because the API endpoint us-east1-aiplatform.googleapis.com does not have a method called WriteLogEntries.
The above endpoint is used to send requests to Vertex AI services and not to Cloud Logging. The API endpoint to be used is the logging.googleapis.com as shown in the entries.write method. Refer to this documentation for more info.
The ClientOptions() function should have logging.googleapis.com as the api_endpoint parameter. If the client_options parameter is not specified, logging.googleapis.com is used by default.
After changing the api_endpoint parameter, I was able to successfully write the log entries. The ClientOptions() is as follows:
client = logging_v2.client.Client(
credentials=credentials,
client_options=ClientOptions(api_endpoint="logging.googleapis.com"),
)

How to resolve "Invalid Sequence Token" when using cloudwatch agent?

I'm seeing the following warning in the /var/log/amazon/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log:
2021-10-06T06:39:23Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49619410836690261519535138406911035003981074860446093650
But there is no mention about which file is really the one that it's failing. Not even when I add "debug": true to the /opt/aws/amazon-cloudwatch-agent/bin/config.json.
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq .agent
{
"metrics_collection_interval": 60,
"debug": true,
"run_as_user": "root"
}
I have many (28) files in my .logs.logs_collected.files.collect_list section of the config.json file, so how can I find which file is exactly causing trouble?
As of 2021-11-29 a PR to improve the log messages has been merged to the cloudwatch-agent but a new version of the cloudwatch-agent has not been released yet, the next version after v1.247349.0 will likely include a fix for this.
The fix will change the log statements to say
INFO: First time sending logs to %v/%v since startup so sequenceToken is nil, learned new token: xxxx: yyyy: This is an INFO message, as this behaviour is expected at startup for example.
WARN: Invalid SequenceToken used (%v) while sending logs to %v/%v, will use new token and retry: xxxxxv: This on the other hand is not expected and may mean that someone else is writing to the loggroup/logstream concurrently.
If those warnings come right after a restart of the cloudwatch agent (cwagent) then you can safely ignore them, it's expected behaviour . The cloudwatch agent does not save the next sequence token in its persistent state so on restart it will "learn" the correct sequence number by issuing a PutLogEvent with no sequence token at all, that returns an InvalidSequenceTokenException with the next sequence token to use. So it's expected to see those at startup, anyway I proposed a PR to amazon-cloudwatch-agent to improve those log messages.
If the "Invalid SequenceToken used" is seen long after the restart then you may have other issues.
The "Invalid SequenceToken used" error usually means that two entities/sources are trying to write to the same log group/log stream as mentioned in 2 (which is really for the old awslogs agent but still useful):
Caught exception: An error occurred (InvalidSequenceTokenException)
when calling the PutLogEvents operation: The given sequenceToken is
invalid[…] -or- Multiple agents might be sending log events to log
stream[…] – You can't push logs from multiple log files to a single
log stream. Update your configuration to push each log to a log
stream-log group combination.
I could be that the amazon cloudwatch agent itself it's trying to upload the same file twice because you have duplicates in your config.json.
So first print all your log group / log stream pairs in your config.json with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort
which should give an output similar to:
/tableauserver/apigateway apigateway_node5-0.log
/tableauserver/apigateway control_apigateway_node5-0.log
/tableauserver/appzookeeper appzookeeper-discovery_node5-1.log
...
/tableauserver/vizqlserver vizqlserver_node5-3.log
Then you can use uniq -d to find the duplicates in that list with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort|uniq -d
# The list should be empty otherwise you have duplicates
If that command produces any output it means that you have duplicates in your config.json collect_list.
I personally think that cwagent itself should print the "offending" loggroup/logstream in the logs so I opened in issue in amazon-cloudwatch-agent GitHub page.

How to get a log out of run command in a remote node that is triggered by NodeExecutorResult

I am making a Rundeck plugin in where it uses a node executor to trigger
a another module in remote. Then I want to verify whether or not this remote command is finished successfully so that I need the log message of the command in remote that the node executor triggered.
NodeExecutorResult ner = jne.executeCommand(context.getExecutionContext(), command, ine);
But the NodeExecutorResult above doesn't have any log data but only have a result code and a result data and failure message and code. How can I get the log? I am sure that the api exists. Because I can see the log output in the console of rundeck. Thanks for reading.
The NodeExecutorResult interface has other methods inherited from other interfaces, try the following :
NodeExecutorResult ner = jne.executeCommand(context.getExecutionContext(), command, ine);
String failure message = ner.getFailureMessage();
FailureReason freas = ner.getFailureReason();
Map<java.lang.String,java.lang.Object> fdata = ner.getFailureData() ;
For the FailureReason it rather can be a NodeStepFailureReason or StepFailureReason.
Check their API and the implementations enums :
FailureReason
NodeExecutorResult

wsadmin script timing out when executing against DMGR via SOAP

I'm attempting to start and stop an application on a single JVM via the wsadmin console since the Web UI for IBM BPM PS Adv. doesn't allow for that kind of operation. So, I have the following script:
https://gist.github.com/predatorian3/b8661c949617727630152cbe04f78d7e
and when I run it against the DMGR from the Cell Host, I receive the following errors.
[wasadmin#server01 ~]$ cat /usr/local/bin/Run_wsadmin.sh
#!/bin/bash
#
#
#
/opt/IBM/WebSphere/AppServer/bin/wsadmin.sh -lang jython -user serviceAccount -password password $*
[wasadmin#cessoapscrt00 ~]$ time Run_wsadmin.sh -f /opt/IBM/wsadmin/wsadmin_Restart_Application.py WPS00 CRT00WPS01 redirectResource_war
WASX7209I: Connected to process "dmgr" on node CRTDMGR using SOAP connector; The type of process is: DeploymentManager
WASX7303I: The following options are passed to the scripting environment and are available as arguments that are stored in the argv variable: "[WPS00, CRT00WPS01, redirectResource_war]"
WASX7017E: Exception received while running file "/opt/IBM/wsadmin/wsadmin_Restart_Application.py"; exception information: com.ibm.websphere.management.exception.ConnectorException
org.apache.soap.SOAPException: [SOAPException: faultCode=SOAP-ENV:Client; msg=Read timed out; targetException=java.net.SocketTimeoutException: Read timed out]
real 3m21.275s
user 0m17.411s
sys 0m0.796s
So, I'm not specifying the connection types, and using the default, which is SOAP. However, upon reading about the other Connection Types, none of them seem any better, but I attribute that to IBM Documentation vagueness. Is there an option to increase the timeout wait periods, or turn it off, or is there a better connection type?
Also running this directly on the wsadmin console, it seems that it is hanging up on gathering the application manager string.
[wasadmin#server01 ~]$ Run_wsadmin.sh
WASX7209I: Connected to process "dmgr" on node CRTDMGR using SOAP connector; The type of process is: DeploymentManager WASX7031I: For help, enter: "print Help.help()"
wsadmin>appManager = AdminControl.queryNames('cell=CRTCELL,node=WPS00,type=ApplicatoinManager,process=CRT00WPS01,*')
WASX7015E: Exception running command: "appManager = AdminControl.queryNames('cell=CRTCELL,node=WPS00,type=ApplicationManager,process=CRT00WPS01,*')"; exception information:
com.ibm.websphere.management.exception.ConnectorException
org.apache.soap.SOAPException: [SOAPException: faultCode=SOAP-ENV:Client; msg=Read timed out; targetException=java.net.SocketTimeoutException: Read timed out]
wsadmin>
You can increase timeout value in {profile}/properties/soap.client.props
com.ibm.SOAP.requestTimeout=180
If you want to turn off timeout, modify com.ibm.SOAP.requestTimeout=0
Or if you want longer timeout you can modify the value 180 to something else.
Also about your query command, I noticed that you have a typo on the MBean type, you had type=ApplicatoinManager, it should be type=ApplicationManager
HERE YOU GO -- I had the same issue. I want to override the timeout prop temporarily. This worked like a champ. Make sure you follow below steps exactly.I did some mistakes and the prop did not passed, I figured out and it works.
Copy the soap.client.props file from /properties and give it a new name such as mysoap.client.props.
Edit mysoap.client.props and update the value of com.ibm.SOAP.requestTimeout as required
Create a new Java properties file soap_override.props and enter the following line:
com.ibm.SOAP.ConfigURL=file:/mysoap.client.props
Pass soap_override.props into wsadmin using the -p option: wsadmin -p soap_override.props...
REFERENCE:
https://www.ibm.com/developerworks/community/blogs/timdp/entry/avoiding_wsadmin_request_timeouts_the_neat_way32?lang=en