What is the role of Logstash Shipper and Logstash Indexer in ELK stack? - elastic-stack

I have been studying online about ELK stack for my new project.
Although most of the tech blogs are about how to set ELK up.
Although I need more information to begin with.
What is Logstash ? Further, Logstash Shipper and Indexer.
What is Elasticsearch's role ?
Any leads will be appreciated too if not a proper answer.

I will try to explain the elk stack to you with an example.
Applications generate logs which all have the same format ( timestamp | loglevel | message ) on any machine in our cluster and write those logs to some file.
Filebeat (a logshipper from elk) tracks that file, gathers any updates to the file periodically and forwards them to logstash over the network. Unlike logstash Filebeat is a lightweight application that uses very little resources so I don't mind running it on every machine in the cluster. It notices when logstash is down and waits with tranferring data until logstash is running again (no logs are lost).
Logstash receives messages from all log shippers through the network and applies filters to the messages. In our case it splits up each entry into timestamp, loglevel and message. These are separate fields and can later be searched easily. Any messages that do not conform to that format will get a field: invalid logformat. These messages with fields are now forwarded to elastic search in a speed that elastic search can handle.
Elastic search stores all messages and indexes ( prepares for quick search) all the fields im the messages. It is our database.
We then use Kibana (also from elk) as a gui for accessing the logs. In kibana I can do something like: show me all logs from between 3-5 pm today with loglevel error whose message contains MyClass. Kibana will ask elasticsearch for the results and display them

I don't know, if this helps, but ... whatever... Let's take some really stupid example: I want to do statistics about squirrels in my neighborhood. Every squirrel has a name and we know what they look like. Each neighbor makes a log entry whenever he sees a squirrel eating a nut.
ElasticSearch is a document database that structures data in so called indices. It is able to save pieces (shards) of those indices redundantly on multiple servers and gives you great search functionalities. so you can access huge amounts of data very quickly.
Here we might have finished events that look like this:
{
"_index": "squirrels-2018",
"_id": "zr7zejfhs7fzfud",
"_version": 1,
"_source": {
"squirrel": "Bethany",
"neighbor": "A",
"#timestamp": "2018-10-26T15:22:35.613Z",
"meal": "hazelnut",
}
}
Logstash is the data collector and transformator. It's able to accept data from many different sources (files, databases, transport protocols, ...) with its input plugins. After using one of those input plugins all the data is stored in an Event object that can be manipulated with filters (add data, remove data, load additional data from other sources). When the data has the desired format, it can be distributed to many different outputs.
If neighbor A provides a MySQL database with the columns 'squirrel', 'time' and 'ate', but neighbor B likes to write CSVs with the columns 'name', 'nut' and 'when', we can use Logstash to accept both inputs. Then we rename the fields and parse the different datetime formats those neighbors might be using. If one of them likes to call Bethany 'Beth' we can change the data here to make it consistent. Eventually we send the result to ElasticSearch (and maybe other outputs as well).
Kibana is a visualization tool. It allows you to get an overview over your index structures and server status and create diagrams for your ElasticSearch data
Here we can do funny diagrams like 'Squirrel Sightings Per Minute' or 'Fattest Squirrel (based on nut intake)'

Related

Fluent Bit prometheus_scrape input is not record

I expose kube-state-metrics to an endpoint and scrape it usingthe prometheus_scrape input plugin (using fluentbit 2.0). I want to select some of these metrics and send them to Azure Log Analytics workspace as logs, but it seems like the scraped data is not a record. Not the whole dump, nor individually. When I write a regex parser and apply it via a filter, it gets applied no matter what key I specify in the filter which is wierd. But it seems like they are still not records, because even a lua script can't operate on them, can't even print it to the stdout via the script.
2022-12-02T15:48:19.388264036Z kube_pod_container_status_restarts_total{namespace="kube-system",pod="ama-logs-t6smx",uid="99825b27-919d-4943-bc7d-b87b56081297",container="ama-logs"} = 0
2022-12-02T15:48:19.388264036Z kube_pod_container_status_restarts_total{namespace="kube-system",pod="ama-logs-t6smx",uid="16195b27-915d-3963-bc7d-b86b56557297",container="ama-logs-prometheus"} = 0
2022-12-02T15:48:19.388264036Z kube_pod_container_status_restarts_total{namespace="kube-system",pod="aks-secrets-store-csi-driver-mk47n",uid="d7924927-caf4-39f3-a28b-356af3144f50",container="liveness-probe"} = 0
I tried dropping or altering the records with a lua script, but it simply does not do anything to them, and they still get printed on screen as I did nothing to them with the script.
Is there any way to make them records? Why is this not working?

how to collect all information about the current Job in Talend data studio

I'm Running any job then I want to log all information like ---
job name
Source detail and destination details (file name/Table name)
No of records input and number of records processed or save.
so I want log all the above information and insert into Mongodb using talend open studio Components also explain what component do I need to perform that task. need some serious response thanks.
You can use tJava component as below. Get the count of source, destination, details of the source name and target name. Now redirect the details to a file in tJava.
For more about logging functionalities, go through below tutorials,
https://www.youtube.com/watch?v=SSi8BC58v3k&list=PL2eC8CR2B2qfgDaQtUs4Wad5u-70ala35&index=2
I'd consider using log4j which has most of this information. Using MDC you could expand the log messages with custom attributes. Log4j does have a JSON format, and there seems to be a MongoDB appender as well.
It might take a bit more time to configure (I'd suggest adding the dependencies via a routine) but once configured it will require absolutely no configuration in the job. Using log4j you can create filters, etc.

Dynamic routing to IO sink in Apache Beam

Looking at the example for ClickHouseIO for Apache Beam the name of the output table is hard coded:
pipeline
.apply(...)
.apply(
ClickHouseIO.<POJO>write("jdbc:clickhouse:localhost:8123/default", "my_table"));
Is there a way to dynamically route a record to a table based on its content?
I.e. if the record contains table=1, it is routed to my_table_1, table=2 to my_table_2 etc.
Unfortunately the ClickHouseIO is still in development does not support this. The BigQueryIO does support Dynamic Destinations, so it is possible with Beam.
The limitation in the current ClickHouseIO is around transforming data to match the destination table schema. As a workaround, if your destination tables are known at pipeline creation time you could create a ClickHouseIO per table, then use the data to route to the correct instance of the IO.
You might want to file a feature request in the Beam bug tracker for this.

Where can I find a complete list about replication slot options in PostgreSQL?

I an working on PG logical replication by Java, and find a demo on the jdbc driver docs
PGReplicationStream stream =
replConnection.getReplicationAPI()
.replicationStream()
.logical()
.withSlotName("demo_logical_slot")
.withSlotOption("include-xids", false)
.withSlotOption("skip-empty-xacts", true)
.start();
then I can parse message from the stream.
This is enough for some daily needs, but now I want to know the transaction commit time.
From the help of the question on stackoverflow, I add .withSlotOption("include-timestamp", "on") and it is working.
My question is where can find a complete list about the "slot option", so we can find them very conveniently instead of search on google or stackoverflow.
The available options depend on the logical decoding plugin of the replication slot, which is specified when the replication slot is created.
The example must be using the test_decoding plugin, which is included with PostgreSQL as a contrib module for testing and playing.
The available options for that plugin are not documented, but can be found in the source code:
include-xids: include the transaction number in BEGIN and COMMIT output
include-timestamp: include timestamp information with COMMIT output
force-binary: specifies that the output mode is binary
skip-empty-xacts: don't output anything for transactions that didn't modify the database
only-local: output only data whose replication origin is not set
include-rewrites: include information from table rewrites caused by DDL statements

AWS Cloudwatch: How to add the instance name / custom fields to the log?

We currently have multiple cloudwatch log streams per ec2 instance. This is horrible to debug; queries for "ERROR XY" across all instances would involve either digging into each log stream (time consuming) or using aws cli (time consuming queries).
I would prefer to have a log stream combining the log data of all instances of a specific type, let's say all "webserver" instances log their "apache2" log data to one central stream and "php" log data to another central stream.
Obviously, I still want to be able to figure out which log entry stems from which instance - as I would be with central logging via syslogd.
How can I add the custom field "instance id" to the logs in cloudwatch?
The best way to organize logs in CloudWatch Logs is as follows:
The log group represents the log type. For example: webserver/prod.
The log stream represents the instance id (i.e. the source).
For querying, I highly recommend using the Insights feature (I helped build it when I worked # AWS). The log stream name will be available with each log record as a special #logStream field.
You can query across all instances like this:
filter #message like /ERROR XY/
Or inside one instance like this:
filter #message like /ERROR XY/ and #logStream = "instance_id"