Implementing a distributed grep - distributed-computing

I'm trying to implement a distributed grep. How can I access the log files from different systems? I know I need to use the network but I don't know whether you use ssh, telnet, or anything else? What information do I need to know about the machines I am going to connect to from my machine? I want to be able to connect to different Linux machines and read their log files and pipe it back to my machine.

Your system contains a number of Linux machine which produce log data(SERVERs), and one machine which you operate(CLIENT). Right?
Issue 1) file to be accessed.
In general, log file is locked by a software which produce log data, because the software has to be able to write data into log file at any time.
To access the log file from other software, you need to prepare unlocked log data file.
Some modification of the software's setup ane/or the software(program) itself.
Issue 2) program to serve log files.
To get log data from SERVER, each SERVERs have to run some server program.
For remote shell access, rshd (remote shell deamon) is needed. (ssh is combination of rsh and secure communication).
For FTP access, ftpd (file transfer protocol deamon) is needed.
The software to be needed is depend how CLIENT accesses SERVERs.
Issue 3) distribued grep.
You use words 'distribued grep'. What do you mean by the words?
What are distribued in your 'distributed grep'?
Many senarios came in my mind.
a) Log files are distribued in SERVERs. All log data are collected to CLIENT, and grep program works for collected log data at CLIENT.
b) Log files are distribued in SERVERs. Grep function are implemented on each SERVERs also. CLIENT request to each SERVERs for getting the resule of grep applied to log data, and results are collected to CLIENT.
etc.
What is your plan?
Issue 4) access to SERVERs.
Necessity of secure communication is depend on locations of machines and networks among them.
If all machines are in a room/house, and networks among machines are not connected the Internet, secure communication is not necessary.
If the data of log is top secret, you may need encript the data before send the data on the network.
How is your log data important?
At very early stage of development, you should determing things described above.
This is my advice.

Related

How does NFS process requests for data?

When I used someone else's framework, I found that it would use NFS technology to share a specified folder before performing distributed computing.
For example, there are two parts 'part1' and 'part2' in this folder. Then if my machine 1 reads 'part1' and machine 2 reads 'part2', if machine 1 wants to get the content of 'part2', then it should make a request directly to machine 2, or directly read the local 'part2' file?
My understanding is that NFS can synchronize each machine under the corresponding folder, and the file will be stored in each machine, rather than a link to the corresponding location of a certain machine. I'm not sure if this understanding is correct.
NFS makes files available over a network. Using your example, if machine 1 and machine 2 are clients of the NFS server, they won't refer to each other when attempting to retrieve data. As such, when machine 1 wants 'part2', it will make the request to the NFS server rather than to machine 2 (despite the fact machine 2 has read 'part2').
The reasoning for this is that the version of 'part2' that exists on the NFS server may have changed in the time between machine 2 reading 'part2', making machine 2's copy of 'part2' out of date. By making all requests to the NFS server, clients can ensure that they are getting the most recent version of a file at any given time.
The behaviour you're describing is more akin to the behaviour of BitTorrent (https://en.wikipedia.org/wiki/BitTorrent). BitTorrent solves the out-of-date file problem by not allowing files to ever change and distributing hashes of the files. Knowing this, your torrent client can request parts of a folder or file from anyone in a 'swarm' and independently verify that the parts you received are correct.

Redirect logged information from one computer to a process in another?

My setup: I have one computer running an application, foo, and logging information using rsyslog to a file on another remote machine. The remote machine is running a different application, bar. bar reads the logged files it has received and does some processing with this information, however, this is slow.
What I'm trying to do: I would like to pipe the information from foo's log file into the process bar directly. I suppose I theoretically could alter foo's source code to support something like this to even bypass rsyslog or writing to a file locally, but it is a massive enterprise-level software and would be a last resort.

Two master instances on same database

I want to use Postgresql in Windows Server 2012 R2 for one our project where it can be 24/7 uptime.
I would like to ask the community if I can have 2 master instances in 2 different servers A&B and they will 'work' on the same DB located in a shared file storage in lan. Always one master instance on server A will be online and when it goes offline for some reason (I suppose) a powershell script will recognize that the postgresql service stopped and will start the service in server B. The same script will continuous check that only one service in servers A & B is working to avoid conflicts.
I'd like to ask if this is possible or a better approach for my configuration.
(I can't use replication because when server A shuts down the server B is in read-only mode thing that I don't want)
If you manage to start two instances of PostgreSQL on the same data directory, serious data corruption will happen.
Normally there is a postmaster.pid file that prevents that, but a PostgreSQL server process on a different machine that accesses the same file system will happily unlink that after spewing some log messages, thinking it was left behind from a crash.
So you are really walking on thin ice with a solution like that.
One other issue that you didn't think of is that script that is supposed to check if the server is still running. What if that script fails, because for example the network connection between the two servers is down, but the server is still up an running happily? Such a “split brain” scenario will cause data corruption with your setup.
Another word of caution: since you seem to be using Windows (Powershell?), you probably envision a CIFS file system when you are talking of shared storage. A Windows “network share” is not a reliable file system — last time I checked, it did not honor _commit.
Creating a reliable failover cluster is harder than you think, and I'd recommend that you check existing solutions before you try to roll your own.

Remote execute Power Shell scripts to collect data

I am looking to collect data snapshot on a random interval from various machines in our network that we don't own, but may get access to install an agent to collect these data.
These machines are either in a domain or work-group and kind of data i get are based on the role they play and information they have. The machines are "Windows Server 2003" and above and I do not want to install anything on those machines before i get started, so thought I can use the PowerShell scripts that I can remote invoke form my server and pass the script it has to run to return the data.
I was wondering if this is possible to do that with the PowerShell scripts and as this is supposed to run in a secure environment, is there any major security implications with this approach. i.e. do I need to do anything on the client machines that can make them vulnerable to security threats.
BTW these machines are not exposed to internet and are behind a firewall.
I would appreciate if you point me to any other alternatives that can be useful for my analysis.
Regards
Kiran

Need an opinion on a method for pull data from a file with Perl

I am having a conflict of ideas with a script I am working on. The conflict is I have to read a bunch of lines of code from a VMware file. As of now I just use SSH to probe every file for each virtual machine while the file stays on the server. The reason I am now thinking this is a problem is because I have 10 virtual machines and about 4 files that I probe for filepaths and such. This opens a new SSH channel every time I refer to the ssh object I have created using Net::OpenSSH. When all is said and done I have probably opened about 16-20 ssh objects. Would it just be easier in a lot of ways if I SCP'd the files over to the machine that needs to process them and then have most of the work done on the local side. The script I am making is a backup script for ESXi and it will end up storing the files anyway, the ones that I need to read from.
Any opinion would be most helpful.
If the VM's do the work locally, it's probably better in the long run.
In the short term, the ~equal amount of resources will be used, but if you were to migrate these instances to other hardware, then of course you'd see gains from the processing distribution.
Also from a maintenance perspective, it's probably more convenient for each VM to host the local process, since I'd imagine that if you need to tweak it for a specific box, it would make more sense to keep it there.
Aside from the scalability benefits, there isn't really any other pros/cons.