How do logstash configs work

How do logstash configs work - centos

Running logstash 2.4 on Centos 7
So there seems to be zero documentation for this on Elastic's web site, but I have found in this tutorial as well as some other links that you can break up your configs in /etc/logstash/conf.d and logstash, when run as a service, will aggregate them.
The thing is in the article I linked to they are creating separate configs for inputs, outputs, and filters.
Ideally I'd prefer to have a single config for each log type that includes the inputs, outputs, and filters for that log type. Is this possible and is there any official documentation explaining how to run logstash as a daemon this way- if so where is Elastic hiding it?

Basically logstash conf is the file which contains the input, filter and output sectors. Input is where you give the path for your input data file such as log files or a database or any other doc types. Filtering is where you can mutate db fields or do some grok filtering if the input is a log file. Output is where you mention, what's the index to store all the docs and assign other output parameters such as host, document_id etc according to your need.
You could either create the index initially using ES and push all the documents to it, or you could create it through logstash conf itself, where it creates the index when you run the conf file.
You could have more than one logstash conf files if you need. If you're handling more than one type of documents, it'll be easy if you have different conf files for those. These are the available flags you could use, when you run the logstash config.
logstash samples

Related

How to install MongoDB Connector for Spark in Azure Synapse Notebook?

I want to install MongoDB Spark Connector in Azure Synapse, so I can write code like this in my Notebooks:
df = spark.read.format("mongodb").option("spark.synapse.linkedService", "MongoDB_MyCollection").load()
At the moment this fails saying "Failed to find data source: mongodb"
Getting started guide for MongoDB Spark Connector says I should use --packages org.mongodb.spark:mongo-spark-connector:10.0.2 flag when invoking./bin/pyspark, but I'm running PySpark code in Azure Synapse Notebooks, so I don't invoke ./bin/pyspark myself.
Question - how can I install MongoDB Connector for Spark when running in Azure Synapse?

Version 10.x of the MongoDB Connector for Spark is an all-new connector based on the latest Spark API. Version 10.x uses the new namespace:
com.mongodb.spark.sql.connector.MongoTableProvider
There are various configuration options available. The following options for writing to MongoDB are available:
Note: If you use SparkConf to set the connector's write configurations, prefix spark.mongodb.write. to each property.
Property Name
Description
mongoClientFactory
MongoClientFactory configuration key. You can specify a custom implementation which must implement the com.mongodb.spark.sql.connector.connection.MongoClientFactory interface. Default: com.mongodb.spark.sql.connector.connection.DefaultMongoClientFactory
connection.uri
Required. The connection string configuration key. Default: mongodb://localhost:27017/
database
Required. The database name configuration.
collection
Required. The collection name configuration.
maxBatchSize
Specifies the maximum number of operations to batch in bulk operations. Default: 512
ordered
Specifies whether to perform ordered bulk operations. Default: true
operationType
Specifies the type of write operation to perform. You can set this to one of the following values: insert: insert the data. replace: replace an existing document that matches the idFieldList value with the new data, or insert the data if no match exists. update: update an existing document that matches the idFieldList value with the new data, or inserts the data if no match exists. Default: replace
idFieldList
Field or list of comma-separated fields to use to identify a document. Default: _id
writeConcern.w
Specifies w, a write concern option to acknowledge the level to which the change propagated in the MongoDB replica set. You can specify one of the following values: MAJORITY; W1, W2, or W3; ACKNOWLEDGED; UNACKNOWLEDGED. Default: ACKNOWLEDGED
writeConcern.journal
Specifies j, a write concern option to enable request for acknowledgment that the data is confirmed on on-disk journal for the criteria specified in the w option. You can specify either true or false.
writeConcern.wTimeoutMS
Specifies wTimeoutMS, a write concern option to return an error when a write operation exceeds the number of milliseconds. If you use this optional setting, you must specify a non-negative integer.
You can refer the PySpark code that will read the CSV file into a stream, compute a moving average, and stream the results into MongoDB here.
Useful links: Configuration options, Write Configuration Options

Jmeter csv data split

I am load testing using Jmeter containers inside k8s cluster.Right now the jmx and the csv files are copied to all the containers.Is there a way to split the data file so that each JMeter instance in container gets its own subset of the original file?

Are you looking for split command or what? The number of lines in file and the number of pods in cluster can be obtained using wc command
Also there might be better solutions like using HTTP Simple Table Server or Redis Data Set so the test data would be stored in a "central" location and you won't have to bother about copying splitting it and copying the parts to the slaves

Reading a file from local file system after reading it from hadoop file system

I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?

For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.

How to redirect Apache Spark logs from the driver and the slaves to the console of the machine that launchs the Spark job using log4j?

I'm trying to build an Apache Spark application that normalizes csv files from HDFS (changes delimiter, fix broken lines). I use log4j for logging but all the logs just print in the executors so the only way i can check them is using yarn logs -applicationId command. Is there any way i can redirect all logs( from driver and from executors) to my gateway node(the one which launchs the spark job) so i can check them during execution?

You should have the executors log4j props configured to write files local to themselves. Streaming back to the driver will cause unnecessary latency in processing.
If you plan on being able to 'tail" the logs in near real-time, you would need to instrument a solution like Splunk or Elasticsearch, and use tools like Splunk Forwarders, Fluentd, or Filebeat that are agents on each box that specifically watch for all configured log paths, and push that data to a destination indexer, that'll parse and extract log field data.
Now, there are other alternatives like Streamsets or Nifi or Knime (all open source), which offer more instrumentation for collecting event processing failures, and effectively allow for "dead letter queues" to handle errors in a specific way. The part I like about those tools - no programming required.

i think it is not possible. When you execute spark in local mode you can able to see it in console. Otherwise you have to alter log4j properties for the log file path.

As per https://spark.apache.org/docs/preview/running-on-yarn.html#configuration,
YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config in yarn-site.xml file), container logs are copied to HDFS and deleted on the local machine.
You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix in yarn-site.xml).
I am not sure whether the log aggregation from worker nodes happen in real time !!

There is an indirect way to achieve. Enable the following property in yarn-site.xml.
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
This will store all your logs of the submitted applications in hdfs location. Then using the following command you can download the logs into a single aggregated file.
yarn logs -applicationId application_id_example > app_logs.txt
I came across this github repo which downloads the driver and container logs separately. Clone this repository : https://github.com/hammerlab/yarn-logs-helpers
git clone --recursive https://github.com/hammerlab/yarn-logs-helpers.git
In your .bashrc (or equivalent), source .yarn-logs-helpers.sourceme:
$ source /path/to/repo/.yarn-logs-helpers.sourceme
Then download the aggregated logs into nicely segregated driver and container logs by this command.
yarn-container-logs application_example_id

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));

To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.