Verify .mat file exists and is not Corrupt - Matlab - matlab

I have 2 independent Matlab workers, with FIRST getting/saving data and SECOND reading it (and doing some calculations etc).
FIRST saves data as .mat file on the hard-disk while SECOND reads it from there. It takes ~20 seconds to SAVE this data as .mat and 8millisec to DELETE it. Before SAVING data, FIRST deletes the old file and then saves a newer version.
How can the SECOND verify that data exists and is not corrupt? I can use exists but that doesn't tell me if the data is corrupt or not. For eg, if SECOND tries to read data exactly when FIRST is saving it, exists passes but LOAD gives you an error saying - Data Corrupt etc.
Thanks.

You can't, without some synchronization mechanism - by the time SECOND completes its check and starts to read the file, FIRST might have started writing it again. You need some sort of lock or mutex.
Two options for base Matlab.
If this is on a local filesystem, you could use a separate lock file sitting next to the data file to manage concurrent access to the data file. Use Java's NIO FileChannel and FileLock objects from within Matlab to lock the first byte of the lock file and use that as a semaphore to control access to the data file, so the reader waits until the writer is finished and vice versa. (If this is on a network filesystem, don't try this - file locking may seem to work but usually is not officially supported and in my experience is unreliable.)
Or you could just put a try/catch around your load() call and have it pause a few seconds and retry if you get a corrupt file error. The .mat file format is such that you won't get a partial read if the writer is still writing it; you'll get that corrupt file error. So you could use this as a lazy sort of collision detection and backoff. This is what I usually do.
To reduce the window of contention, consider having FIRST write to a temporary file in the same directory, and then use a rename to move it to its final destination. That way the file is only unavailable during a quick filesystem move operation, not the 20 seconds of data writing. If you have multiple writers, stick the PID and hostname in the temp file name to avoid collisions.

Sounds like a classic resource sharing problem between 2 threads (R-W)
In short, you should find a method of inter-workers safe communication. Check this out.
Also, try to type
showdemo('paralleldemo_communic_prof')
in Matlab

Related

Append data to .csv file while it is read at that moment

I would like to force overwrite logdata to a CSV file. It might well be that another user is reading that file at the moment.
What give me the possibility take no care about this kind of locking and write in the file? Of course, the appended data should displayed after writing when the user close and open file again.
Maybe a useful information:
Writing data would occur many times for a very short term.

spark save simple string to text file

I have a spark job that needs to store the last time it ran to a text file.
This has to work both on HDFS but also on local fs (for testing).
However it seems that this is not at all so straight forward as it seems.
I have been trying with deleting the dir and getting "can't delete" error messages.
Trying to store a simple sting value into a dataframe to parquet and back again.
this is all so convoluted that it made me take a step back.
What's the best way to just store a string (timestamp of last execution in my case) to a file by overwriting it?
EDIT:
The nasty way I use it now is as follows:
sqlc.read.parquet(lastExecution).map(t => "" + t(0)).collect()(0)
and
sc.parallelize(List(lastExecution)).repartition(1).toDF().write.mode(SaveMode.Overwrite).save(tsDir)
This sounds like storing simple application/execution metadata. As such, saving a text file shouldn't need to be done by "Spark" (ie, it shouldn't be done in distributed spark jobs, by workers).
The ideal place for you to put it is in your driver code, typically after constructing your RDDs. That being said, you wouldn't be using the Spark API to do this, you'd rather be doing something as trivial as using a writer or a file output stream. The only catch here is how you'll read it back. Assuming that your driver program runs on the same computer, there shouldn't be a problem.
If this value is to be read by workers in future jobs (which is possibly why you want it in hdfs), and you don't want to use the Hadoop API directly, then you will have to ensure that you have only one partition so that you don't end up with multiple files with the trivial value. This, however, cannot be said for the local storage (it gets stored on the machine where the worker executing the task is running), managing this will simply be going overboard.
My best option would be to use the driver program and create the file on the machine running the driver (assuming it is the same that will be used next time), or, even better, to put it in a database. If this value is needed in jobs, then the driver can simply pass it through.

Perl share hashmap through file

Currently I have a script that collects data of the specified server. The data is stored inside a hash which I store into a file for persistence.
If the script is being called with another server it should load the hash from the file and extend the hash with the data from the second server. Then save it back.
I use the storable module.
use Storable;
$recordedpkgs = retrieve($MONPKGS_DATA_FILE) if ( -e $MONPKGS_DATA_FILE);
store $recordedpkgs, $MONPKGS_DATA_FILE;
Obviously there is a access issue if one writes while the other has already read the file. Some data will be then lost.
What would be an ideal solution to that? Use basic file locking? Is there better ways to achieve that?
It depends - what you're talking about is inter process communication, and perl has a whole documentation segment on the subject perlipc
But to answer your question directly - yes, file locking is the way to go. It's exactly the tool for the job you describe.
Unfortunately, it's often OS dependent. Windows and Linux locking semantics are different. Take a look at flock - that's the basic start on Unix based systems. Take a look at: http://www.perlmonks.org/?node_id=7058
It's an advisory lock, where you can request a shared (read) or exclusive (write) lock. And either block (until released), or fail and return if you cannot acquire that lock.
Storable does implement some locking semantics: http://perldoc.perl.org/Storable.html#ADVISORY-LOCKING
But you might find you want to use a lock file if you're doing a read-modify-write cycle on the saved content.
I would just use a basic lock file that is checked before operations are performed upon the file, if the lock file is in place then simply make your other process either wait + check (either infinite or a set amount of times before exiting), or simply exit with an error.

Sharing a file among several processes [Perl]

I have an application that updates a CSV file (single one), the CSV is being updated randomly from several processes, and I guess if two processes try to update it (add a row...) on the same time, some data will be lost I guess, or overwritten(?).
what is the best way to avoid this?
thanks,
Use Perl's DBI with the DBD::CSV driver to access your data; that'll take care of the flocking for you. (Unless you're using Windows 95 or the old Mac OS.) If you decide to switch to an RDBMS later on, you'll be well prepared.
Simple flocking as suggested by #Fluff should also be fine, of course.
If you want to have a simple and manual way to take care of file locking.
1) As soon as a process opens the csv, it creates a lock.
(Lock can be in the form of creating a dummy file. The process has to delete
the file(lock) as soon as it is done reading/updating the csv)
2) Have each process check for file lock before trying to update the csv.
(If dummy file is present, some process is accessing the csv,
else it can update the csv)

Sharing a file on an overloaded machine

I have a computer that is running Windows XP that I am using to process a great deal of data, update monitors, and bank data. Generally it is pretty loaded with work.
One particular file that has real time data is useful to a number of users. We have two programs that need this file, one that displays the numerical data and one that plots the numerical data. Any user can run an instance of either program on their machine. These programs search for the real time data file which is updated every second. They are both written in Perl and I was asked not to change either program.
Because of the large load on the computer, I am currently running the program that does calculations and creates the real time data file on a separate computer. This program simply writes the real time file onto the overloaded computer. Because Windows doesn't have an atomic write, I created a method that writes to a different extension, deletes the old real time file, and then moves the new one to the correct name. Unfortunately, as the user load on the computer increases, the writes take longer (which isn't ideal but is live-able) but more annoyingly, the time between deleting the old real time file and moving the new file to the correct name increases a great deal, causing errors with the Perl programs. Both programs check to see if the file modify time has changed (neither check for file locks). If the file goes missing they get angry and output error messages.
I imagine a first course of action would be to move this whole process away from the overloaded computer. My other thought was to create a number of copies of the files on different machines and have different users read the file from different places (this would be a real hack though).
I am new to the world networking and file sharing but I know there is a better way to do this. Frankly this whole method is a little hacked but that's how it was when I came here.
Lastly, it's worth mentioning that this same process runs on a UNIX machine and has none of these problems. For this reason I feel the blame falls on a need for an atomic write. I have been searching the internet for any work around to this problem and have tried a number of different methods (eg my current extension switching method).
Can anyone point me in the right direction so I can solve this problem?
My code is written in Python.
os.rename() says:
os.rename(src, dst)
Rename the file or directory src to dst. If dst is a directory,
OSError will be raised. On Unix, if dst exists and is a file,
it will be replaced silently if the user has permission. The
operation may fail on some Unix flavors if src and dst are on
different filesystems. If successful, the renaming will be an
atomic operation (this is a POSIX requirement). On Windows, if
dst already exists, OSError will be raised even if it is a file;
there may be no way to implement an atomic rename when dst names
an existing file.
Given that on Windows you are forced to delete the old file before renaming the new one to it, and you are prohibited from modifying the reading scripts to tolerate the missing file for a configurable timeout (the correct solution) or do proper resource locking with the producer (another correct solution), your only workaround may be to play with the process scheduler to make the {delete, rename} operation appear atomic. Write a C program that does nothing but look for the new file, delete the old, and rename the new. Run that "pseudo-atomic rename" process at high priority and pray that it doesn't get task-switched between the delete and the rename.