Is it more efficient to assign data from a structure once, or to extract it multiple times? - matlab

I have to run a certain functions many times; this function takes a certain structure sc as input. Within the function, certain values from the structure (say sc.a and sc.b) are used multiple times.
I have two options:
Assign a=sc.a and use a every time it is needed within the function;
Extract sc.a every time I need it within the function.
Which of these is more efficient? In (1) I am using extra memory to assign a, while in (2) I am extracting sc.a multiple times.

Arrays would quite faster if you have a plenty of operations.
This is almost language agnostic. Arrays are easier to access due to being next to each other in memory, while with structs you break the memory pattern, so you disable the possibility of caching, thus requiring more time for memory reads. On top of that, MATLAB's openMP/multi-thread operations work great in arrays, while they don't in structs.

Related

What's the fastest way to create a C-compatible unbounded string in Ada?

I'm creating an Ada program for Windows that needs to be able to pass strings to some functions written in C. Until now I have been manipulating the strings in Ada using the Unbounded_String type, and then converting the data to an Interfaces.C.char_array before passing it to the C functions.
This works fine, only performance is a bit of an issue on slower, older computers. The C function is sometimes called repeatedly on a slightly modified version of a string, and requires the Unbounded_String to be converted to a similar char_array every time. The strings aren't modified by the C functions, so the only ever have to be converted to char_array.
I have thought of storing the strings in char_array, and converting from an Ada type each time the string is manipulated. The data is passed to C more often than it is changed, so it would improve performance. The problem with this approach is that often the length of the string will change, sometimes by a lot, and there is no way of knowing the maximum length beforehand.
The ideal solution would be to have something similar to an Unbounded_String only storing the string as a char_array. By this I mean something that is dynamically sized, allocating a new array when the old one isn't big enough and it should allow Ada Characters/Strings to be inserted (and also removed) into the array, converting only those characters to C chars.
Is there any (relatively) easy, fast way of doing this without having to implement it myself? Or is there any other quick way of manipulating C-compatible strings in Ada? Thanks in advance for any suggestions.
You don't mention how many objects you expect to have of your type, but I will assume that we are not talking about so many that you will be anywhere near exhausting your available address space.
Just encapsulate a sufficiently large char_array (say 10 times the largest expected size) in a private record, and create the needed operations to manipulate it.
If you're very unlucky, you may need to tell your compiler/run-time environment that you need an unusually large stack, but save that worry for when you actually experience it.

Write performance scala immutable collections

Quick question. I'm currently designing some database queries to extract reasonably large, but not massive datasets into memory, say approximately 10k-100k records.
So far I've been testing loading these resultsets into a scala.collection.immutable.Seq and have discovered it seems to take an incredibly long time to build the collection. Whereas if I change to a Vector or List the write into memory takes fractions of a second.
MY question is therefore why is Seq so slow in this case? If so in what cases would using Seq be more appropriate than Vector?
Thanks
It would help if you'd post the relevant snippet and which operations you call on the sequence -- immutable.Seq is represented using a List (see https://github.com/scala/scala/blob/v2.10.2/src/library/scala/collection/immutable/Seq.scala#L42). My guess is that you've been using :+ on the immutable.Seq, which under the hood appends to the end of the list by copying it (probably giving you quadratic overall performance), and when you switched to using immutable.List directly, you've been attaching to the beginning using :: (giving you linear performance).
Since Seq is just a List under the hood, you should use it when you attach to the beginning of the sequence -- the cons operator :: only creates a one node and links it to the rest of the list, which is as fast as it can get when it comes to immutable data structures. Otherwise, if you add to the end, and you insist on immutability, you should use a Vector (or the upcoming Conc lists!).
If you would like a validation of these claims, see this link where the performance of the two operations is compared using ScalaMeter -- lists are 8 times faster than vectors when you add to the beginning.
However, the most appropriate data structure should be either an ArrayBuffer or a VectorBuilder. These are mutable data structures that resize dynamically and if you build them using += you will get a reasonable performance. This is assuming that you are not storing primitives.

what is the efficiency of an assign statement in progress-4gl

why is an assign statement more efficient than not using assign?
co-workers say that:
assign
a=3
v=7
w=8.
is more efficient than:
a=3.
v=7.
w=8.
why?
You could always test it yourself and see... but, yes, it is slightly more efficient. Or it was the last time I tested it. The reason is that the compiler combines the statements and the resulting r-code is a bit smaller.
But efficiency is almost always a poor reason to do it. Saving a micro-second here and there pales next to avoiding disk IO or picking a more efficient algorithm. Good reasons:
Back in the dark ages there was a limit of 63k of r-code per program. Combining statements with ASSIGN was a way to reduce the size of r-code and stay under that limit (ok, that might not be a "good" reason). One additional way this helps is that you could also often avoid a DO ... END pair and further reduce r-code size.
When creating or updating a record the fields that are part of an index will be written back to the database as they are assigned (not at the end of the transaction) -- grouping all assignments into a single statement helps to avoid inconsistent dirty reads. Grouping the indexed fields into a single ASSIGN avoids writing the index entries multiple times. (This is probably the best reason to use ASSIGN.)
Readability -- you can argue that grouping consecutive assignments more clearly shows your intent and is thus more readable. (I like this reason but not everyone agrees.)
basically doing:
a=3.
v=7.
w=8.
is the same as:
assign a=3.
assign v=7.
assign w=8.
which is 3 separate statements so a little more overhead. Therefore less efficient.
Progress does assign as one statement whether there is 1 or more variables being assigned. If you do not say Assign then it is assumed so you will do 3 statements instead of 1. There is a 20% - 40% reduction in R Code and a 15% - 20% performance improvement when using one assign statement. Why this is can only be speculated on as I can not find any source with information on why this is. For database fields and especially key/index fields it makes perfect sense. For variables I can only assume it has to do with how progress manages its buffers and copies data to and from buffers.
ASSIGN will combine multiple statements into one. If a, v and w are fields in your db, that means it will do something like INSERT INTO (a,v,w)...
rather than
INSERT INTO (a)...
INSERT INTO (v)
etc.

Efficient disk access of large number of small .mat files containing objects

I'm trying to determine the best way to store large numbers of small .mat files, around 9000 objects with sizes ranging from 2k to 100k, for a total of around half a gig.
The typical use case is that I only need to pull a small number (say 10) of the files from disk at a time.
What I've tried:
Method 1: If I save each file individually, I get performance problems (very slow save times and system sluggishness for some time after) as Windows 7 has difficulty handling so may files in a folder (And I think my SSD is having a rough time of it, too). However, the end result is fine, I can load what I need very quickly. This is using '-v6' save.
Method 2: If I save all of the files in one .mat file and then load just the variables I need, access is very slow (loading takes around three quarters of the time it takes to load the whole file, with small variation depending on the ordering of the save). This is using '-v6' save, too.
I know I could split the files up into many folders but it seems like such a nasty hack (and won't fix the SSD's dislike of writing many small files), is there a better way?
Edit:
The objects are consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, plus a bunch of small identifying properties (char and numeric).
Five ideas to consider:
Try storing in an HDF5 object - take a look at http://www.mathworks.com/help/techdoc/ref/hdf5.html - you may find that this solves all of your problems. It will also be compatible with many other systems (e.g. Python, Java, R).
A variation on your method #2 is to store them in one or more files, but to turn off compression.
Different datatypes: It may also be the case that you have some objects that compress or decompress inexplicably poorly. I have had such issues with either cell arrays or struct arrays. I eventually found a way around it, but it's been awhile & I can't remember how to reproduce this particular problem. The solution was to use a different data structure.
#SB proposed a database. If all else fails, try that. I don't like building external dependencies and additional interfaces, but it should work (the primary problem is that if the DB starts to groan or corrupts your data, then you're back at square 1). For this purpose consider SQLite, which doesn't require a separate server/client framework. There is an interface available on Matlab Central: http://www.mathworks.com/matlabcentral/linkexchange/links/1549-matlab-sqlite
(New) Considering that the objects are less than 1GB, it may be easier to just copy the entire set to a RAM disk and then access through that. Just remember to copy from the RAM disk if anything is saved (or wrap save to save objects in two places).
Update: The OP has mentioned custom objects. There are two methods to consider for serializing these:
Two serialization program from Matlab Central: http://www.mathworks.com/matlabcentral/fileexchange/29457 - which was inspired by: http://www.mathworks.com/matlabcentral/fileexchange/12063-serialize
Google's Protocol Buffers. Take a look here: http://code.google.com/p/protobuf-matlab/
Try storing them as blobs in a database.
I would also try the multiple folders method as well - it might perform better than you think. It might also help with organization of the files if that's something you need.
The solution I have come up with is to save object arrays of around 100 of the objects each. These files tend to be 5-6 meg so loading is not prohibitive and access is just a matter of loading the right array(s) and then subsetting them to the desired entry(ies). This compromise avoids writing too many small files, still allows for fast access of single objects and avoids any extra database or serialization overhead.

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).