I'm writing an application that monitors a directory for new input files by polling the directory every few seconds. New files may often be several megabytes, and so take some time to fully arrive in the input directory (eg: on copy from a remote share).
Is there a simple way to detect whether a file is currently in the process of being copied? Ideally any method would be platform and filesystem agnostic, but failing that specific strategies might be required for different platforms.
I've already considered taking two directory listings separaetd by a few seconds and comparing file sizes, but this introduces a time/reliability trade-off that my superiors aren't happy with unless there is no alternative.
For background, the application is being written as a set of Matlab M-files, so no JRE/CLR tricks I'm afraid...
Edit: files are arriving in the input directly by straight move/copy operation, either from a network drive or from another location on a local filesystem. This copy operation will probably be initiated by a human user rather than another application.
As a result, it's pretty difficult to place any responsibility on the file provider to add control files or use an intermediate staging area...
Conclusion: it seems like there's no easy way to do this, so I've settled for a belt-and-braces approach - a file is ready for processing if:
its size doesn't change in a certain period of time, and
it's possible to open the file in read-only mode (some copying processes place a lock on the file).
Thanks to everyone for their responses!
The safest method is to have the application(s) that put files in the directory first put them in a different, temporary directory, and then move them to the real one (which should be an atomic operation even when using FTP or file shares). You could also use naming conventions to achieve the same result within one directory.
Edit:
It really depends on the filesystem, on whether its copy functionality even has the concept of a "completed file". I don't know the SMB protocol well, but if it has that concept, you could write an app that exposes an SMB interface (or patch Samba) and an API to get notified for completed file copies. Probably a lot of work though.
This is a middleware problem as old as the hills, and the short answer is: no.
The two 'solutions' put the onus on the file-uploader: (1) upload the file in a staging directory and then move it into the destination directory (2) upload the file, and then create/upload a 'ready' file that indicates the state of the content file.
The 1st one is the better, but both are inelegant. The truth is that better communication media exist than the filesystem. Consider using some IPC that involves only a push or a pull (and not both, as does the filesystem) such as an HTTP POST, a JMS or MSMQ queue, etc. Furthermore, this can also be synchronous, allowing the process receiving the file to acknowledge the content, even check it for worthiness, and hand the client a receipt - this is the righteous road to non-repudiation. Follow this, and you will never suffer arguments over whether a file was or was not delivered to your server for processing.
M.
One simple possibility would be to poll at a fairly large interval (2 to 5 minutes) and only acknowledge the new file the second time you see it.
I don't know of a way in any OS to determine whether a file is still being copied, other than maybe checking if the file is locked.
How are the files getting there? Can you set an attribute on them as they are written and then change the attribute when write is complete? This would need to be done by the thing doing the writing ... which sounds like it isn't an option.
Otherwise, caching the listing and treating a file as new if it has the same file size for two consecutive listings is the best way I can think of.
Alternatively, you could use the modified time on the file - the file has to be new and have a modified time that is at least x in the past. But I think this will be about equivalent to caching the listing.
It you are polling the folder every few seconds, its not much of a time penalty is it? And its platform agnostic.
Also, linux only: http://www.linux.com/feature/144666
Like cron but for files. Not sure how it deals with your specific problem - but may be of use?
What is your OS. In unix you can use the "lsof" utility to determine if a user has the file open for write. Apparently somewhere in the MS Windows Process Explorer there is the same functionality.
Alternativly you could just try an exclusive open on the file and bail out of this fails. But this can be a little unreliable and its easy to tread on your own toes.
Related
I'm new to Apama. I see that a com.apama.file lib exists, but I am unsure how to actually use it to read a file. I want to send each line as an event to be parsed and then depending on the contents sent as a different event from there, but googling suggests that I'd need a transport (not sure what that is either) to do so, but my project lead is under the impression that this can all be done using Apama EPL. How true is this and if it has some validity, how can I go about achieving that?
Yes, this is certainly possible. To help you do it, though, please can you provide a little more information about your setup? For example, what is the file type and is the file local to where the correlator will be running? Will there only be one file to process at a time? How large is the file, and are there any specific performance requirements?
You may find this helpful:
https://github.com/SoftwareAG/apama-streaming-analytics-connectivity-FileTransport
You don't say quite what you are trying to achieve, but if you are new to Apama then I will say that that is not something that is done frequently, especially in simpler solutions when your are just starting.
Depending what you are trying to achieve, are you aware of the "engine_send" tool and the ability to use it to send in a text file of Apama events (normally a .evt file), and with batch tags if you want spread them over time?
http://www.apamacommunity.com/documents/10.5.3.0/apama_10.5.3.0_webhelp/apama-webhelp/apama-webhelp/re-DepAndManApaApp_sending_events_to_correlators.html
http://www.apamacommunity.com/documents/10.5.3.0/apama_10.5.3.0_webhelp/apama-webhelp/apama-webhelp/co-DepAndManApaApp_event_file_format.html
I am building an internal iOS application (so - it won't ever be in the app store), and I need to keep a directory of content synchronized between a server and each of the instances of the iOS application. This would be easy enough if I just wanted to delete and re-download this content each time, but I would rather use something similar to rsync to only download the elements that have changed.
I haven't found any good way to utilize rsync. I considered looking at Objective-Git as a possibility here, but at a quick glance it looked like there is still a lot of the support for remote repositories that isn't supported yet.
As a final note, while this won't be in the app store, I will not be jailbreaking these devices and I would prefer to not rely on any private API's (although if there was an elegant solution that utilized private API's I might consider it).
Thoughts?
ADDITIONAL NOTE: This needs to be an isolated solution. I won't be relying on outside services (like Dropbox, Box.net, etc...). This needs to work solely between the device and the server (which is on a local network with the device).
Use HTTP to list the contents of each folder on the server.
Compare last modification time of each file with those on the device, and identify added/removed files.
Get added and modified files, remove deleted files.
It sounds like you're maybe asking for a library that already does this, but if you don't find one it's obviously moderately easy to write this from the ground up using stat(2) on the server and the same or a higher-level equivalent on the iOS devices. Have the iPhone send a tree of files with their modification date to the server and get back a list of insert/delete/update operations to do with the url (or whatever) for each one so you can do them incrementally on a background thread. Have the information from the server for new/updated files include the mod date that the server has so you can set it to be the same on the iOS device and send that when asking the server for the status of each file (kind of hack using the file system to store that, but it works).
Why not just set up a RESTful interface and do it across HTTP; that way you could query the modification times easily enough to determine whether client or server files need to be updated. You might also want to keep track of what files on the client have been synced, so you can easily know which files to add or delete. This can be done with a simple .sync file or using a plist / sqlite / etc.
If you'll consider FTP, there are some pretty advanced client libraries available.
For example, the iOS Chilkat bundle includes an FTP client library that supports synchronization in both directions. It's not free, but it's pretty cheap -- and you get a ton of other stuff that will likely prove useful someday. Here's an example of iOS pulling down all additions and changes (mode 2):
http://www.example-code.com/ios/ftp_syncLocalTree.asp
One caveat -- judging solely from the example, it doesn't appear to synchronize deletions. If this is a requirement, you could do it yourself without too much effort immediately following a sync.
acrosync (see https://acrosync.com/library.html) seems like a good fit given the initial question, however I haven't used it myself yet.
I have files that are updated every 2 hours. I have to detect the files automatically and insert the extracted information from them into a database.
Our DBMS is Postgresql and programming language is Python. How would you suggest I do that?
I want to make use of DAL (Database Abstraction Layer) to make connection between the files and database and use postgresql LISTEN/NOTIFY techniques to detect the new files. If you agree with me please tell me how I can use LISTEN/NOTIFY functions to detect the files.
Thank you
What you need is to write a script that stays running as a dæmon, using a file system notify API to run a callback function when the files change. When the script is notified that the files change it should connect to PostgreSQL and do the required work, then go back to sleep waiting for the next change.
The only truly cross platform way to watch a directory for changes is to use a delay loop to poll os.listdir and os.stat to check for new files and updated modification times. This is a waste of power and disk I/O; it also gets slow for big sets of files. If your OS reliably changes the directory modification time whenever files within the directory change you can just os.stat the directory in a delay-loop, which helps.
It's much better to use an operating system specific notification API. Were you using Java I'd tell you to use the NIO2 watch service, which handles all the platform specifics for you. It looks like Watchdog may offer something similar for Python, but I haven't needed to do directory change notification in my Python coding so I haven't tested it. If it doesn't work out you can use platform-specific techniques like inotify/dnotify for Linux, and the various watcher APIs for Windows.
See also:
How do I watch a file for changes?
Python daemon to watch a folder and update a database
You can't use LISTEN/NOTIFY because that can only send messages from within the database and your files obviously aren't in there.
You'll want to have your python script scan the directory the files are in and check their modification time (mtime). If they are updated, you'll need to read in the files, parse the data and insert it to the db. Without knowing the format of the files, there's no way to be more specific.
I'm testing out Microsoft Sync Framework to try and see if it'll be suitable for a task that I'm working on. One of the things I'd like to be able to do is to have the option to not just send changed files, but instead to send all of the files (for example, if I'm syncing to a client machine for the first time, and so want to send all files).
I can't seem to find an example of this in the documentation, so any advice would be welcome.
if you're synching for the first time, then there is nothing special to configure as it will sync everything.
if you've already synched and want to re-send all files regardless of whether they've changed or not, just delete the metadata file and that should remove all knowledge of what has been synched.
I have my server maintaining the content with a file-system(i mean folder structure). The same folder structure is also maintained in my iPhone client application bundle too.
Now if there is a change in my server file system(Add,Delete,Update of a file in some folder in the hierarchy) i need to update the file system accordingly at the client. This means that i need a protocol to be followed b/w the server and the client.
Can any one suggest how can this be done?
--
Thanks and Regards,
U'suf
From what I can tell, there is no easy way. I was looking for an rsync equivalent, but I haven't found one.
In my case, I'm manually walking the tree asking the server for differences after a certain date and I remember the last successful sync date.
Not pretty. Could spend lots of time coming up with something sophisticated.