AWS Cloudwatch: How to add the instance name / custom fields to the log? - amazon-cloudwatchlogs

We currently have multiple cloudwatch log streams per ec2 instance. This is horrible to debug; queries for "ERROR XY" across all instances would involve either digging into each log stream (time consuming) or using aws cli (time consuming queries).
I would prefer to have a log stream combining the log data of all instances of a specific type, let's say all "webserver" instances log their "apache2" log data to one central stream and "php" log data to another central stream.
Obviously, I still want to be able to figure out which log entry stems from which instance - as I would be with central logging via syslogd.
How can I add the custom field "instance id" to the logs in cloudwatch?

The best way to organize logs in CloudWatch Logs is as follows:
The log group represents the log type. For example: webserver/prod.
The log stream represents the instance id (i.e. the source).
For querying, I highly recommend using the Insights feature (I helped build it when I worked # AWS). The log stream name will be available with each log record as a special #logStream field.
You can query across all instances like this:
filter #message like /ERROR XY/
Or inside one instance like this:
filter #message like /ERROR XY/ and #logStream = "instance_id"

Related

Passing Cloud Storage custom metadata into Cloud Storage Notification

We have a Python script that copies/creates files in a GCS bucket.
# let me know if my setting of the custom-metadata is correct
blob.metadata = { "file_capture_time": some_timestamp_var }
blob.upload(...)
We want to configure the bucket such that it generates Cloud Storage notifications whenever an object is created. We also want the custom metadata above to be passed along with the Pub/Sub message to the topic and use that as an ordering key in the Subscription side. How can we do this?
The recommended way to receive notification when an event occurs on the intended GCS bucketis to create a Cloud Pub/Sub topic for new objects and to configure your GCS bucket to publish messages to that topic when new objects are created.
Initially, make sure you've activated the Cloud Pub/Sub API, and use the gsutil command similar to below:
gsutil notification create -f json -e OBJECT_FINALIZE gs://example-bucket
The -e specifies that you're only interested in OBJECT_FINALIZE messages (objects being created)
The -f specifies that you want the payload of the messages to be the object metadata for the JSON API
The -m specifies a key:value attribute that is appended to the set of attributes sent to Cloud Pub/Sub for all events associated with this notification config.
You may specify this parameter multiple times to set multiple attributes.
The full Firebase example which explains the parsing the filename and other info from its context/data with
Here is a good example with a similar context.

Can I update Apache Atlas metadata by adding a message directly into the Kafka topic?

I am trying to add a message to Entities_Topic to update the kafka_Topic type metadata in Apache Atlas. I wrote the data according to the JSON format of Message, but it didn't work.
application.log is displayed as follows:
graph rollback due to exception AtlasBaseException:Instance kafka_topicwith unique attribute {qualifiedName=atlas_test00#primary # clusterName to use in qualified name of entities. Default: primary} does not exist (GraphTransactionInterceptor:202)
graph rollback due to exception AtlasBaseException:Instance __AtlasUserProfile with unique attribute {name=admin} does not exist(GraphTransactionInterceptor:202)
And here is the message I passed into Kafka Topic earlier:
{"version":{"version":"1.0.0","versionParts":[1]},"msgCompressionKind":"NONE","msgSplitIdx":1,"msgSplitCount":1,"msgSourceIP":"192.168.1.110","msgCreatedBy":"","msgCreationTime":1664440029000,"spooled":false,"message":{"type":"ENTITY_NOTIFICATION_V2","entity":{"typeName":"kafka_topic","attributes":{"qualifiedName":"atlas_test_k1#primary # clusterName to use in qualifiedName of entities. Default: primary","name":"atlas_test01","description":"atlas_test_k1"},"displayText":"atlas_test_k1","isIncomplete":false},"operationType":"ENTITY_CREATE","eventTime":1664440028000}}
It is worth noting that there is no GUID in the message and I do not know how to create it manually. Also, I changed the time according to the timestamp of the current time. The JSON is passed in through the Kafka tool Offset Explorer.
My team leader wants to update the metadata by sending messages directly into Kafka, and I'm just trying to see if that's possible.
How can I implement this idea, or please tell me what's wrong.

GCP Pub/Sub logs for internal cloudsqladmin user show query fragments / partial data

When exporting logs to Pub/Sub (via Sink, Topic and Subscription) from a GCP Postgresql server (v11), some lines auditing cloudsqladmin internal user return what seems to be fragments of SQL queries run on the server. I am looking at them with the Logs viewer.
Examples:
db=cloudsqladmin,user=cloudsqladmin LOG: 00000: statement: WITH max_age AS ("
textPayload: "2020-11-10 23:30:01.188 UTC [*****]: [5-1] db=*********,user=cloudsqladmin LOG: 00000: statement: ;"
timestamp: "2020-11-10T23:30:01.188675Z"
It seems to be part of a longer query, but I can't logically attach it to any other adjacent log line.
Does this look like a bug on the GCP side, or am I missing something else here?
Screenshots:
I tried several flags, even the pgAudit beta flags, and I didn't find how to flatten the logs. In fact, when you write a multi line query (it's here the case with Cloud SQL, but also the case if you write your app with a multi-line log), it's logged in multi-line and print multiple single line logs.
If you sink the logs into Pub/Sub, you have a message per entry, and if you have multiple line logs, you have multiple messages.
You can mitigate the issue by removing the useless logs trace (system trace, increase the log level,...) and inline your business query to have nice logs.
It's a built in issue, it worth to fill a case here.
If you click on the left arrow in any of the logs you are interested, and then click on "Expand nested fields" you will see your complete query in that log (between other information).

Same data read by PCF instances for spring batch application

I am working on a spring batch application,which read data from data base using JdbcCursorItemReader,this application is working as expected when I run a single instance.
I deployed this application in PCF and used auto scale feature, but multiple instances are retrieving the same record from the data base.
How can I prevent the duplicate data reads from other instances?
This is normally handled by applying the processed indicator pattern. In this pattern, you have an additional field on each row that you mark with the status as each record is processed. You then use your query to filter only the records that match the status you care about. In this case, the status could be node specific so that the node only selects records that node tags.

What is the role of Logstash Shipper and Logstash Indexer in ELK stack?

I have been studying online about ELK stack for my new project.
Although most of the tech blogs are about how to set ELK up.
Although I need more information to begin with.
What is Logstash ? Further, Logstash Shipper and Indexer.
What is Elasticsearch's role ?
Any leads will be appreciated too if not a proper answer.
I will try to explain the elk stack to you with an example.
Applications generate logs which all have the same format ( timestamp | loglevel | message ) on any machine in our cluster and write those logs to some file.
Filebeat (a logshipper from elk) tracks that file, gathers any updates to the file periodically and forwards them to logstash over the network. Unlike logstash Filebeat is a lightweight application that uses very little resources so I don't mind running it on every machine in the cluster. It notices when logstash is down and waits with tranferring data until logstash is running again (no logs are lost).
Logstash receives messages from all log shippers through the network and applies filters to the messages. In our case it splits up each entry into timestamp, loglevel and message. These are separate fields and can later be searched easily. Any messages that do not conform to that format will get a field: invalid logformat. These messages with fields are now forwarded to elastic search in a speed that elastic search can handle.
Elastic search stores all messages and indexes ( prepares for quick search) all the fields im the messages. It is our database.
We then use Kibana (also from elk) as a gui for accessing the logs. In kibana I can do something like: show me all logs from between 3-5 pm today with loglevel error whose message contains MyClass. Kibana will ask elasticsearch for the results and display them
I don't know, if this helps, but ... whatever... Let's take some really stupid example: I want to do statistics about squirrels in my neighborhood. Every squirrel has a name and we know what they look like. Each neighbor makes a log entry whenever he sees a squirrel eating a nut.
ElasticSearch is a document database that structures data in so called indices. It is able to save pieces (shards) of those indices redundantly on multiple servers and gives you great search functionalities. so you can access huge amounts of data very quickly.
Here we might have finished events that look like this:
{
"_index": "squirrels-2018",
"_id": "zr7zejfhs7fzfud",
"_version": 1,
"_source": {
"squirrel": "Bethany",
"neighbor": "A",
"#timestamp": "2018-10-26T15:22:35.613Z",
"meal": "hazelnut",
}
}
Logstash is the data collector and transformator. It's able to accept data from many different sources (files, databases, transport protocols, ...) with its input plugins. After using one of those input plugins all the data is stored in an Event object that can be manipulated with filters (add data, remove data, load additional data from other sources). When the data has the desired format, it can be distributed to many different outputs.
If neighbor A provides a MySQL database with the columns 'squirrel', 'time' and 'ate', but neighbor B likes to write CSVs with the columns 'name', 'nut' and 'when', we can use Logstash to accept both inputs. Then we rename the fields and parse the different datetime formats those neighbors might be using. If one of them likes to call Bethany 'Beth' we can change the data here to make it consistent. Eventually we send the result to ElasticSearch (and maybe other outputs as well).
Kibana is a visualization tool. It allows you to get an overview over your index structures and server status and create diagrams for your ElasticSearch data
Here we can do funny diagrams like 'Squirrel Sightings Per Minute' or 'Fattest Squirrel (based on nut intake)'