I am load testing using Jmeter containers inside k8s cluster.Right now the jmx and the csv files are copied to all the containers.Is there a way to split the data file so that each JMeter instance in container gets its own subset of the original file?
Are you looking for split command or what? The number of lines in file and the number of pods in cluster can be obtained using wc command
Also there might be better solutions like using HTTP Simple Table Server or Redis Data Set so the test data would be stored in a "central" location and you won't have to bother about copying splitting it and copying the parts to the slaves
Related
I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?
For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.
How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.
Running logstash 2.4 on Centos 7
So there seems to be zero documentation for this on Elastic's web site, but I have found in this tutorial as well as some other links that you can break up your configs in /etc/logstash/conf.d and logstash, when run as a service, will aggregate them.
The thing is in the article I linked to they are creating separate configs for inputs, outputs, and filters.
Ideally I'd prefer to have a single config for each log type that includes the inputs, outputs, and filters for that log type. Is this possible and is there any official documentation explaining how to run logstash as a daemon this way- if so where is Elastic hiding it?
Basically logstash conf is the file which contains the input, filter and output sectors. Input is where you give the path for your input data file such as log files or a database or any other doc types. Filtering is where you can mutate db fields or do some grok filtering if the input is a log file. Output is where you mention, what's the index to store all the docs and assign other output parameters such as host, document_id etc according to your need.
You could either create the index initially using ES and push all the documents to it, or you could create it through logstash conf itself, where it creates the index when you run the conf file.
You could have more than one logstash conf files if you need. If you're handling more than one type of documents, it'll be easy if you have different conf files for those. These are the available flags you could use, when you run the logstash config.
logstash samples
I am using Standalone clusters to run the ALS algorithm. The predictions are being stored to the textfile using:
saveAsTextFile(path)
But the text file is being stored on the clusters. I want to store the text file on the Master.
That is expected behavior. path is resolved on the machine it
is executed, the slaves. I'd recommend to either use a cluster FS
(e.g. HDFS) or .collect() your data so you can save them locally on
the master. Beware of OOM if your data is large.
I have a distributed application and I use zookeeper to manage configuration data in all distributed servers.My service in each server needs some dlls to run . I am trying to build a centralized system from where I can copy my dlls to all the server.
Can I achieve that using zookeeper ?
I am aware that "ZooKeeper is generally not designed for large size storage" . My dll files are of size less the 3mb.
There is a 1mb soft limit on how large node data can get. According to the docs you can increase the max data size:
jute.maxbuffer:
(Java system property: jute.maxbuffer)
This option can only be set as a Java system property. There is no zookeeper prefix on it. It specifies the maximum size of the data that can be stored in a znode. The default is 0xfffff, or just under 1M. If this option is changed, the system property must be set on all servers and clients otherwise problems will arise. This is really a sanity check. ZooKeeper is designed to store data on the order of kilobytes in size.
I would not recommend using Zookeeper for this purpose, (you could much more easily host the binaries on a web server instead,) but it does seem possible in theory.
Zookeeper is designed to transfer messages inside the cluster.
Best thing you can do is create a Znode_A that will contain Znodes,
watch znode a for changes. Each Znode in Znode_A will represent a dll and will contain a dll path. Each node on the cluster watch for Znode_A data changes, so when a new dll (znode) will be created the nodes will know to copy the dll from a main repository.
In order to transfer files you can use SCP.
As data you can pass file path of your dlls. Using SCP you can pull files from base repository.