How does NFS process requests for data? - distributed-computing

When I used someone else's framework, I found that it would use NFS technology to share a specified folder before performing distributed computing.
For example, there are two parts 'part1' and 'part2' in this folder. Then if my machine 1 reads 'part1' and machine 2 reads 'part2', if machine 1 wants to get the content of 'part2', then it should make a request directly to machine 2, or directly read the local 'part2' file?
My understanding is that NFS can synchronize each machine under the corresponding folder, and the file will be stored in each machine, rather than a link to the corresponding location of a certain machine. I'm not sure if this understanding is correct.

NFS makes files available over a network. Using your example, if machine 1 and machine 2 are clients of the NFS server, they won't refer to each other when attempting to retrieve data. As such, when machine 1 wants 'part2', it will make the request to the NFS server rather than to machine 2 (despite the fact machine 2 has read 'part2').
The reasoning for this is that the version of 'part2' that exists on the NFS server may have changed in the time between machine 2 reading 'part2', making machine 2's copy of 'part2' out of date. By making all requests to the NFS server, clients can ensure that they are getting the most recent version of a file at any given time.
The behaviour you're describing is more akin to the behaviour of BitTorrent (https://en.wikipedia.org/wiki/BitTorrent). BitTorrent solves the out-of-date file problem by not allowing files to ever change and distributing hashes of the files. Knowing this, your torrent client can request parts of a folder or file from anyone in a 'swarm' and independently verify that the parts you received are correct.

Related

Is there a way to upload data from FTP server to Amazon S3? [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
Is there a way to connect to an Amazon S3 bucket with FTP or SFTP rather than the built-in Amazon file transfer interface in the AWS console? Seems odd that this isn't a readily available option.
There are three options.
You can use a native Amazon Managed SFTP service (aka AWS Transfer for SFTP), which is easier to set up.
Or you can mount the bucket to a file system on a Linux server and access the files using the SFTP as any other files on the server (which gives you greater control).
Or you can just use a (GUI) client that natively supports S3 protocol (what is free).
Managed SFTP Service
In your Amazon AWS Console, go to AWS Transfer for SFTP and create a new server.
In SFTP server page, add a new SFTP user (or users).
Permissions of users are governed by an associated AWS role in IAM service (for a quick start, you can use AmazonS3FullAccess policy).
The role must have a trust relationship to transfer.amazonaws.com.
For details, see my guide Setting up an SFTP access to Amazon S3.
Mounting Bucket to Linux Server
Just mount the bucket using s3fs file system (or similar) to a Linux server (e.g. Amazon EC2) and use the server's built-in SFTP server to access the bucket.
Install the s3fs
Add your security credentials in a form access-key-id:secret-access-key to /etc/passwd-s3fs
Add a bucket mounting entry to fstab:
<bucket> /mnt/<bucket> fuse.s3fs rw,nosuid,nodev,allow_other 0 0
For details, see my guide Setting up an SFTP access to Amazon S3.
Use S3 Client
Or use any free "FTP/SFTP client", that's also an "S3 client", and you do not have setup anything on server-side. For example, my WinSCP or Cyberduck.
WinSCP has even scripting and .NET/PowerShell interface, if you need to automate the transfers.
Update
S3 now offers a fully-managed SFTP Gateway Service for S3 that integrates with IAM and can be administered using aws-cli.
There are theoretical and practical reasons why this isn't a perfect solution, but it does work...
You can install an FTP/SFTP service (such as proftpd) on a linux server, either in EC2 or in your own data center... then mount a bucket into the filesystem where the ftp server is configured to chroot, using s3fs.
I have a client that serves content out of S3, and the content is provided to them by a 3rd party who only supports ftp pushes... so, with some hesitation (due to the impedance mismatch between S3 and an actual filesystem) but lacking the time to write a proper FTP/S3 gateway server software package (which I still intend to do one of these days), I proposed and deployed this solution for them several months ago and they have not reported any problems with the system.
As a bonus, since proftpd can chroot each user into their own home directory and "pretend" (as far as the user can tell) that files owned by the proftpd user are actually owned by the logged in user, this segregates each ftp user into a "subdirectory" of the bucket, and makes the other users' files inaccessible.
There is a problem with the default configuration, however.
Once you start to get a few tens or hundreds of files, the problem will manifest itself when you pull a directory listing, because ProFTPd will attempt to read the .ftpaccess files over, and over, and over again, and for each file in the directory, .ftpaccess is checked to see if the user should be allowed to view it.
You can disable this behavior in ProFTPd, but I would suggest that the most correct configuration is to configure additional options -o enable_noobj_cache -o stat_cache_expire=30 in s3fs:
-o stat_cache_expire (default is no expire)
specify expire time(seconds) for entries in the stat cache
Without this option, you'll make fewer requests to S3, but you also will not always reliably discover changes made to objects if external processes or other instances of s3fs are also modifying the objects in the bucket. The value "30" in my system was selected somewhat arbitrarily.
-o enable_noobj_cache (default is disable)
enable cache entries for the object which does not exist. s3fs always has to check whether file(or sub directory) exists under object(path) when s3fs does some command, since s3fs has recognized a directory which does not exist and has files or subdirectories under itself. It increases ListBucket request and makes performance bad. You can specify this option for performance, s3fs memorizes in stat cache that the object (file or directory) does not exist.
This option allows s3fs to remember that .ftpaccess wasn't there.
Unrelated to the performance issues that can arise with ProFTPd, which are resolved by the above changes, you also need to enable -o enable_content_md5 in s3fs.
-o enable_content_md5 (default is disable)
verifying uploaded data without multipart by content-md5 header. Enable to send "Content-MD5" header when uploading a object without multipart posting. If this option is enabled, it has some influences on a performance of s3fs when uploading small object. Because s3fs always checks MD5 when uploading large object, this option does not affect on large object.
This is an option which never should have been an option -- it should always be enabled, because not doing this bypasses a critical integrity check for only a negligible performance benefit. When an object is uploaded to S3 with a Content-MD5: header, S3 will validate the checksum and reject the object if it's corrupted in transit. However unlikely that might be, it seems short-sighted to disable this safety check.
Quotes are from the man page of s3fs. Grammatical errors are in the original text.
Answer from 2014 for the people who are down-voting me:
Well, S3 isn't FTP. There are lots and lots of clients that support S3, however.
Pretty much every notable FTP client on OS X has support, including Transmit and Cyberduck.
If you're on Windows, take a look at Cyberduck or CloudBerry.
Updated answer for 2019:
AWS has recently released the AWS Transfer for SFTP service, which may do what you're looking for.
Or spin Linux instance for SFTP Gateway in your AWS infrastructure that saves uploaded files to your Amazon S3 bucket.
Supported by Thorntech
Amazon has released SFTP services for S3, but they only do SFTP (not FTP or FTPES) and they can be cost prohibitive depending on your circumstances.
I'm the Founder of DocEvent.io, and we provide FTP/S Gateways for your S3 bucket without having to spin up servers or worry about infrastructure.
There are also other companies that provide a standalone FTP server that you pay by the month that can connect to an S3 bucket through the software configuration, for example brickftp.com.
Lastly there are also some AWS Marketplace apps that can help, here is a search link. Many of these spin up instances in your own infrastructure - this means you'll have to manage and upgrade the instances yourself which can be difficult to maintain and configure over time.
WinSCp now supports S3 protocol
First, make sure your AWS user with S3 access permissions has an “Access key ID” created. You also have to know the “Secret access key”. Access keys are created and managed on Users page of IAM Management Console.
Make sure New site node is selected.
On the New site node, select Amazon S3 protocol.
Enter your AWS user Access key ID and Secret access key
Save your site settings using the Save button.
Login using the Login button.
Filezilla just released a Pro version of their FTP client. It connects to S3 buckets in a streamlined FTP like experience. I use it myself (no affiliation whatsoever) and it works great.
As other posters have pointed out, there are some limitations with the AWS Transfer for SFTP service. You need to closely align requirements. For example, there are no quotas, whitelists/blacklists, file type limits, and non key based access requires external services. There is also a certain overhead relating to user management and IAM, which can get to be a pain at scale.
We have been running an SFTP S3 Proxy Gateway for about 5 years now for our customers. The core solution is wrapped in a collection of Docker services and deployed in whatever context is needed, even on-premise or local development servers. The use case for us is a little different as our solution is focused data processing and pipelines vs a file share. In a Salesforce example, a customer will use SFTP as the transport method sending email, purchase...data to an SFTP/S3 enpoint. This is mapped an object key on S3. Upon arrival, the data is picked up, processed, routed and loaded to a warehouse. We also have fairly significant auditing requirements for each transfer, something the Cloudwatch logs for AWS do not directly provide.
As other have mentioned, rolling your own is an option too. Using AWS Lightsail you can setup a cluster, say 4, of $10 2GB instances using either Route 53 or an ELB.
In general, it is great to see AWS offer this service and I expect it to mature over time. However, depending on your use case, alternative solutions may be a better fit.

Two master instances on same database

I want to use Postgresql in Windows Server 2012 R2 for one our project where it can be 24/7 uptime.
I would like to ask the community if I can have 2 master instances in 2 different servers A&B and they will 'work' on the same DB located in a shared file storage in lan. Always one master instance on server A will be online and when it goes offline for some reason (I suppose) a powershell script will recognize that the postgresql service stopped and will start the service in server B. The same script will continuous check that only one service in servers A & B is working to avoid conflicts.
I'd like to ask if this is possible or a better approach for my configuration.
(I can't use replication because when server A shuts down the server B is in read-only mode thing that I don't want)
If you manage to start two instances of PostgreSQL on the same data directory, serious data corruption will happen.
Normally there is a postmaster.pid file that prevents that, but a PostgreSQL server process on a different machine that accesses the same file system will happily unlink that after spewing some log messages, thinking it was left behind from a crash.
So you are really walking on thin ice with a solution like that.
One other issue that you didn't think of is that script that is supposed to check if the server is still running. What if that script fails, because for example the network connection between the two servers is down, but the server is still up an running happily? Such a “split brain” scenario will cause data corruption with your setup.
Another word of caution: since you seem to be using Windows (Powershell?), you probably envision a CIFS file system when you are talking of shared storage. A Windows “network share” is not a reliable file system — last time I checked, it did not honor _commit.
Creating a reliable failover cluster is harder than you think, and I'd recommend that you check existing solutions before you try to roll your own.

Implementing a distributed grep

I'm trying to implement a distributed grep. How can I access the log files from different systems? I know I need to use the network but I don't know whether you use ssh, telnet, or anything else? What information do I need to know about the machines I am going to connect to from my machine? I want to be able to connect to different Linux machines and read their log files and pipe it back to my machine.
Your system contains a number of Linux machine which produce log data(SERVERs), and one machine which you operate(CLIENT). Right?
Issue 1) file to be accessed.
In general, log file is locked by a software which produce log data, because the software has to be able to write data into log file at any time.
To access the log file from other software, you need to prepare unlocked log data file.
Some modification of the software's setup ane/or the software(program) itself.
Issue 2) program to serve log files.
To get log data from SERVER, each SERVERs have to run some server program.
For remote shell access, rshd (remote shell deamon) is needed. (ssh is combination of rsh and secure communication).
For FTP access, ftpd (file transfer protocol deamon) is needed.
The software to be needed is depend how CLIENT accesses SERVERs.
Issue 3) distribued grep.
You use words 'distribued grep'. What do you mean by the words?
What are distribued in your 'distributed grep'?
Many senarios came in my mind.
a) Log files are distribued in SERVERs. All log data are collected to CLIENT, and grep program works for collected log data at CLIENT.
b) Log files are distribued in SERVERs. Grep function are implemented on each SERVERs also. CLIENT request to each SERVERs for getting the resule of grep applied to log data, and results are collected to CLIENT.
etc.
What is your plan?
Issue 4) access to SERVERs.
Necessity of secure communication is depend on locations of machines and networks among them.
If all machines are in a room/house, and networks among machines are not connected the Internet, secure communication is not necessary.
If the data of log is top secret, you may need encript the data before send the data on the network.
How is your log data important?
At very early stage of development, you should determing things described above.
This is my advice.

Vagrant: what is the difference between "shared" and "synced" directories?

Vagrant uses the words "share" and "sync" seemingly interchangeably. Is there a difference? If so, what is the difference?
IMO, "sync" implies that the data is duplicated in two places, and Vagrant does some magic to ensure that changes to one are also made to the other. This is a slightly different semantics to "sharing". Which is Vagrant doing, or can it do both?
EDIT: for example, say, I want a VM running MySQL server, but storing the database files on the host. Is this kind of setup the kind of thing that shared/syncd directories are appropriate for? E.g., do I have a guarantee of atomicity/transactionality? Sharing semantics would guarantee this, but syncing semantics possibly wouldn't.
(To make things worse, there's also Vagrant Share, which is unrelated to syncing or sharing.)
shared folder (v1 terminology) VS synced folder (renamed in v2)
In short: Shared Folders is more VirtualBox specific (vboxsf) and have known performance issues as number of files grows.
Vagrant v2 (vagrant 1.1.x, 1.2.x +) docs use a more generic name called Synced Folder, which now includes many options: default vboxsf, rsync, samba/CIFS, NFS.
By default, vagrant sync the project directory (where Vagrantfile resides) with /vagrant within the guest. This can be disabled by explicitly disable it in Vagnrantfile and do a vagrant reload.
e.g. config.vm.synced_folder ".", "/vagrant", disabled: true
To see a long story, see this answer: https://stackoverflow.com/a/18529697/1801697
Let's talk about sync
For vboxsf and nfs, host and guest folders (I mean synced folders) are always in sync (changes made on either side is synced to the other).
NOTE: SMB/CIFS should be the same but I've never used it.
In vagrant 1.5, rsync type is added, which makes manual sync possible, by default it sync from host to guest upon 1st vagrant up. I personally prefer rsync if real-time sync between host and is NOT needed.
BTW: Vagrant share is something different, it's sharing SSH access or other services via a cloud gateway.

Need an opinion on a method for pull data from a file with Perl

I am having a conflict of ideas with a script I am working on. The conflict is I have to read a bunch of lines of code from a VMware file. As of now I just use SSH to probe every file for each virtual machine while the file stays on the server. The reason I am now thinking this is a problem is because I have 10 virtual machines and about 4 files that I probe for filepaths and such. This opens a new SSH channel every time I refer to the ssh object I have created using Net::OpenSSH. When all is said and done I have probably opened about 16-20 ssh objects. Would it just be easier in a lot of ways if I SCP'd the files over to the machine that needs to process them and then have most of the work done on the local side. The script I am making is a backup script for ESXi and it will end up storing the files anyway, the ones that I need to read from.
Any opinion would be most helpful.
If the VM's do the work locally, it's probably better in the long run.
In the short term, the ~equal amount of resources will be used, but if you were to migrate these instances to other hardware, then of course you'd see gains from the processing distribution.
Also from a maintenance perspective, it's probably more convenient for each VM to host the local process, since I'd imagine that if you need to tweak it for a specific box, it would make more sense to keep it there.
Aside from the scalability benefits, there isn't really any other pros/cons.