akka event sourcing with distributed and/or large files - scala

I am using the book "Mastering Akka" by Christian Baxter.
Now I try to build a new project with akka as event sourced system.
I have a object like Folder. In this Folder could be a number of Files. Really files (java.io.File). For a locale system no problem. But I try to build a distributed system. User A set up a database and gets access to his database. Where are the files? Because user A does not sitting on his desktop pc (where he saved the Folder). He is sitting at a notebook on his homeoffice. Now he needs these files inside Folder.
First I thought to save the files as Array[Byte]. But whats about the situation the file is 150MB? And maybe there are 20 files inside folder and all with more then 150MB? I think my RAM is not working for so long without a crash.
Their is no http server. Also maybe I could ask the server to deliver the files as stream? But this needs settings. Is that the best way?
What is the best practice to handle multiple distrubted and/or large files on av event sourced actor?

Related

Delta for file upload

I would like to synchronize uploads from our own server to our clients' dropboxes to which we have full access. syncing changes on dropbox is easy because i can use the delta call, but I need a more efficient way to identify and upload changes made locally to dropbox.
The sync api would be amazing for this but I'm not trying to make a mobile app so the languages with the api are not easily accessible (AFAIK). Is there an equivalent to the sync api for python running on a linux server?
Possible solution:
So far, I was thinking of using anydbm to store string,string dictionaries that would hold folder names as the key and the hash generated from the metadata call from the server. then I could query the dropbox and every time I run into a folder, I will check the folder compared with the metadata on the anydbm.
if there is a difference, compare the file dates/sizes in the folder and if there are any subfolders, recurse the function into them,
if it the same, skip the folder.
This should save a substantial amount of time compared to the current verification of each and every file, but if there are better solutions, please do let me know.

importing updated files into a database

I have files that are updated every 2 hours. I have to detect the files automatically and insert the extracted information from them into a database.
Our DBMS is Postgresql and programming language is Python. How would you suggest I do that?
I want to make use of DAL (Database Abstraction Layer) to make connection between the files and database and use postgresql LISTEN/NOTIFY techniques to detect the new files. If you agree with me please tell me how I can use LISTEN/NOTIFY functions to detect the files.
Thank you
What you need is to write a script that stays running as a dæmon, using a file system notify API to run a callback function when the files change. When the script is notified that the files change it should connect to PostgreSQL and do the required work, then go back to sleep waiting for the next change.
The only truly cross platform way to watch a directory for changes is to use a delay loop to poll os.listdir and os.stat to check for new files and updated modification times. This is a waste of power and disk I/O; it also gets slow for big sets of files. If your OS reliably changes the directory modification time whenever files within the directory change you can just os.stat the directory in a delay-loop, which helps.
It's much better to use an operating system specific notification API. Were you using Java I'd tell you to use the NIO2 watch service, which handles all the platform specifics for you. It looks like Watchdog may offer something similar for Python, but I haven't needed to do directory change notification in my Python coding so I haven't tested it. If it doesn't work out you can use platform-specific techniques like inotify/dnotify for Linux, and the various watcher APIs for Windows.
See also:
How do I watch a file for changes?
Python daemon to watch a folder and update a database
You can't use LISTEN/NOTIFY because that can only send messages from within the database and your files obviously aren't in there.
You'll want to have your python script scan the directory the files are in and check their modification time (mtime). If they are updated, you'll need to read in the files, parse the data and insert it to the db. Without knowing the format of the files, there's no way to be more specific.

Guaranteeing consistency while accessing files on a web server

I'm in the process of building a simple update server for an application. The parts of the application being updated are configuration files; the most up-to-date copies of these files exist on the update server and these files can be edited by the individual managing the application (the "application manager") at any time. However, I don't want the application to be able to download one of these files while the file is being edited by the application manager; this would obviously cause consistency issues. How can I prevent these files from being accessed in an inconsistent state? Alternatively, would a solution be to provide a checksum along with the file that the application could use to determine if the file was received in a consistent state?
EDIT: I've seen this post concerning access restrictions using .htcaccess and think it could be of use. However, I want the application manager to do as little thinking as possible; having them forget to re-allow connections might be problematic. That being said, they're going to have to do some work at some point; maybe this is the way I should go?

file system sync between the server and the iphone client

I have my server maintaining the content with a file-system(i mean folder structure). The same folder structure is also maintained in my iPhone client application bundle too.
Now if there is a change in my server file system(Add,Delete,Update of a file in some folder in the hierarchy) i need to update the file system accordingly at the client. This means that i need a protocol to be followed b/w the server and the client.
Can any one suggest how can this be done?
--
Thanks and Regards,
U'suf
From what I can tell, there is no easy way. I was looking for an rsync equivalent, but I haven't found one.
In my case, I'm manually walking the tree asking the server for differences after a certain date and I remember the last successful sync date.
Not pretty. Could spend lots of time coming up with something sophisticated.

Detect a file in transit?

I'm writing an application that monitors a directory for new input files by polling the directory every few seconds. New files may often be several megabytes, and so take some time to fully arrive in the input directory (eg: on copy from a remote share).
Is there a simple way to detect whether a file is currently in the process of being copied? Ideally any method would be platform and filesystem agnostic, but failing that specific strategies might be required for different platforms.
I've already considered taking two directory listings separaetd by a few seconds and comparing file sizes, but this introduces a time/reliability trade-off that my superiors aren't happy with unless there is no alternative.
For background, the application is being written as a set of Matlab M-files, so no JRE/CLR tricks I'm afraid...
Edit: files are arriving in the input directly by straight move/copy operation, either from a network drive or from another location on a local filesystem. This copy operation will probably be initiated by a human user rather than another application.
As a result, it's pretty difficult to place any responsibility on the file provider to add control files or use an intermediate staging area...
Conclusion: it seems like there's no easy way to do this, so I've settled for a belt-and-braces approach - a file is ready for processing if:
its size doesn't change in a certain period of time, and
it's possible to open the file in read-only mode (some copying processes place a lock on the file).
Thanks to everyone for their responses!
The safest method is to have the application(s) that put files in the directory first put them in a different, temporary directory, and then move them to the real one (which should be an atomic operation even when using FTP or file shares). You could also use naming conventions to achieve the same result within one directory.
Edit:
It really depends on the filesystem, on whether its copy functionality even has the concept of a "completed file". I don't know the SMB protocol well, but if it has that concept, you could write an app that exposes an SMB interface (or patch Samba) and an API to get notified for completed file copies. Probably a lot of work though.
This is a middleware problem as old as the hills, and the short answer is: no.
The two 'solutions' put the onus on the file-uploader: (1) upload the file in a staging directory and then move it into the destination directory (2) upload the file, and then create/upload a 'ready' file that indicates the state of the content file.
The 1st one is the better, but both are inelegant. The truth is that better communication media exist than the filesystem. Consider using some IPC that involves only a push or a pull (and not both, as does the filesystem) such as an HTTP POST, a JMS or MSMQ queue, etc. Furthermore, this can also be synchronous, allowing the process receiving the file to acknowledge the content, even check it for worthiness, and hand the client a receipt - this is the righteous road to non-repudiation. Follow this, and you will never suffer arguments over whether a file was or was not delivered to your server for processing.
M.
One simple possibility would be to poll at a fairly large interval (2 to 5 minutes) and only acknowledge the new file the second time you see it.
I don't know of a way in any OS to determine whether a file is still being copied, other than maybe checking if the file is locked.
How are the files getting there? Can you set an attribute on them as they are written and then change the attribute when write is complete? This would need to be done by the thing doing the writing ... which sounds like it isn't an option.
Otherwise, caching the listing and treating a file as new if it has the same file size for two consecutive listings is the best way I can think of.
Alternatively, you could use the modified time on the file - the file has to be new and have a modified time that is at least x in the past. But I think this will be about equivalent to caching the listing.
It you are polling the folder every few seconds, its not much of a time penalty is it? And its platform agnostic.
Also, linux only: http://www.linux.com/feature/144666
Like cron but for files. Not sure how it deals with your specific problem - but may be of use?
What is your OS. In unix you can use the "lsof" utility to determine if a user has the file open for write. Apparently somewhere in the MS Windows Process Explorer there is the same functionality.
Alternativly you could just try an exclusive open on the file and bail out of this fails. But this can be a little unreliable and its easy to tread on your own toes.