Stream in production code - scala

Do people really use Scala's Stream class in production code, or is it primarily of academic interest?

There's no problem with Stream, except when people use it to replace Iterator -- as opposed to replacing List, which is the collection most similar to it. In that particular case, one has to be careful in its use. On the other hand, one has to be careful using Iterator as well, since each element can only be iterated through once.
So, since both have their own problems, why single out Stream's? I daresay it's simply that people are used to Iterator from Java, whereas Stream is a functional thing.

Even though I wrote that Iterator is what I want to use nearly all the time I do use Stream in production code. I just don't automatically assume that the cells are garbage collected.
Sometimes Stream fits the problem perfectly. I think the api gives some good examples where recursion is involved...

Look here. This blog post describes how to use Scala Streams (along with memory mapped file) to read large files (1-2G) efficiently.
I did not try it yet but the solution looks reasonable. Stream provides a nice abstraction on top of the low level ByteBuffer Java API to handle a memory mapped file as a sequence of records.

Yes, I use it, although it tends to be for something like this:
(as.toStream collect expensiveConversionToB) match {
case b #:: _ => //found my expensive b
case _ =>
}
Of course, I might use a non-strict view and a find for this example

Since the only reason not to use Streams is that it can be tricky to ensure the JVM isn't keeping references to early conses around, one approach I've used that's fairly nice is to build up a Stream and immediately convert it to an Iterator for actual use. It loses a little of Stream's nice properties on the use side especially with respect to backtracking, but if you're only going to make one pass over the result it's frequently easier to build a structure this way than to contort into the hasNext/next() model of Iterator directly.

Related

What's the point of using "map()" for two elements in perl?

I've seen code where there are just two rather static elements to be mapped such as time intervals with start and end dates, yet map() is being used rather than explicit code for mapping, e.g.
{ map { ... } qw(start end) } # vs.
{ start => ..., end => ... }
Which way is preferrable, and why?
The map form may be less concise but looks more functional (as in functional programming), so I guess that's why it may be preferred over explicit code and is perhaps more DRY.
However, it looks less legible to me because there is more logic going on behind, and mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
EDIT
There is a conflicting goal in programming: KISS (keep it { pick 2 from: small, simple, stupid }). Using map slightly complicates code.
Assuming you're not just setting both items to the same constant or something similarly trivial, I would expect the map version to be more concise.
IMO, the main point in favor of the map version is that you know the same process will be used to produce both values. Not only for the sake of DRY, but also because it eliminates any concern that one might have a subtle change which the other doesn't.
As for the performance concern... If your use case is sufficiently performance-sensitive for any potential difference to matter, then you shouldn't be using Perl in the first place. Switching to well-written C (not C#, not C++, not Objective C - just plain C) will have a far greater performance impact than micro-optimizing whether you assign two values individually vs. using a loop to set them. But the odds of your use case being that sensitive are approximately zero anyhow.
There is a principle of coding known as DRY. Don't Repeat Yourself.
It asserts that:
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
And that can be interpreted as condensing duplicate typing with (things like) map/for.
I use idioms like the one you've quoted when I'm trying to expand some text - for example:
my #defs = map { "DEF:$_=$source_file:$_:MAX" } qw ( read write );
This generates me some DEF lines for rrdtool.
I'm doing it this way, because for some cases, I've got considerably longer lists of 'things I want to define' and want to be consistent. (Sometimes I have say, 10 similar lines that differ only by a single word).
But also because:
my #defs = ( "DEF:read=$source_file:read:MAX",
"DEF:write=$source_file:write:MAX" );
There's not much in it for two elements, and I'd suggest it's as much a matter of style as anything. However, if you've got more than that, it quickly becomes very beneficial because you can change the single line - say you've got a different file location? Want to swap MAX for AVERAGE?
It's also quite shockingly easy to go 'punctuation blind' when looking at a long sequence of similar statements, where someone's typo-ed and added a , where it should be . or similar.
And ... you probably don't lose a great deal in terms of readability. But will acknowledge that's something of a style point, because whilst map is pretty amazing, it can make for some rather hard to read code if you're not careful.
Also to specifically address:
mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
A wise man once said:
premature optimization is the root of all evil
Don't think about the efficiency of a statement - look at the legibility/readability. Compilers are pretty clever. Most "obvious" optimisations, they already make for you. Processors are also pretty fast. Your limiting factor in most code isn't the amount of CPU cycles you need, it's IO throughput and memory footprint. So don't worry about it - write clear code.
And if there's a performance critical demand on your code, you should be using a code profiler to look at where you gain the most efficiency for your effort at refactoring. You may end up with less clear code in doing so (sometimes) but that's a more clear tradeoff.

Disk-persisted-lazy-cacheable-List ™ in Scala

I need to have a very, very long list of pairs (X, Y) in Scala. So big it will not fit in memory (but fits nicely on a disk).
All update operations are cons (head appends).
All read accesses start in the head, and orderly traverses the list until it finds a pre-determined pair.
A cache would be great, since most read accesses will keep the same data over and over.
So, this is basically a "disk-persisted-lazy-cacheable-List" ™
Any ideas on how to get one before I start to roll out my own?
Addendum: yes.. mongodb, or any other non-embeddable resource, is an overkill. If you are interested in a specific use-case for this, see the class Timeline here. Basically, I which to have a very, very big timeline (millions of pairs throughout months), although my matches only need to touch the last hours.
The easiest way to do something like this is to extend Traversable. You only have to define foreach, and you have full control over the traversal, so you can do things like open and close the file.
You can also extend Iterable, which requires defining iterator and, of course, returning some sort of Iterator. In this case, you'd probably create an Iterator for the disk data, but it's going to be much harder to control things like open files.
Here's one example of a Traversable such as I described, written by Josh Suereth:
class FileLinesTraversable(file: java.io.File) extends Traversable[String] {
override def foreach[U](f: String => U): Unit = {
val in = new java.io.BufferedReader(new java.io.FileReader(file))
try {
def loop(): Unit = in.readLine match {
case null => ()
case line => f(line); loop()
}
loop()
} finally {
in.close()
}
}
}
You write:
mongodb, or any other non-embeddable resource, is an overkill
Do you know that there are embeddable database engines, including some really small ones? If you know, I'm not sure about your exact requirement and why would you not use them.
You sure that Hibernate + an embeddable DB (say SQLite) would not be enough?
Alternatively, BerkeleyDB Java Edition, HSQLDB, or other embedded databases could be an option.
If you do not perform queries on the object themselves (and it really sounds like you do not), maybe serialization would be simpler than object-relational mapping for complex objects, but I've never tried, and I don't know which would be faster. But serialization is probably the only way to be completely generic in the type, assuming that your framework of choice offers a suitable interface to write [T <: Serializable]. If not, you could write [T: MySerializable] after creating your own "type-class" MySerializable[T] (like for instance Ordering[T] in the Scala standard library).
However, you don't want to use standard Java serialization for this task. "Anything serializable" sounds a bad requirement because it suggests the use of serialization for this, but I guess you can relax that to "anything serializable with my framework of choice". Serialization is extremely inefficient in time and space and is not designed to serialize a single object, instead it gives you back a file complete with special headers. I would suggest to use some different serialization framework - have a look here for a comparison.
Additional reasons not to go on the road of a custom implementation
In addition, it sounds like you would be reading the file essentially backward, and that's a quite bad access pattern, performance-wise, on non-SSD disks: after reading a sector, it takes an almost complete disk rotation to access the previous one.
Moreover, as Chris Shain pointed out in the comment above, you'd need to use a page-based solution, and you'd need to cope with variable-sized objects.
If you don't want to step up to one of the embeddable DBs, how about a stack in memory mapped files?
A stack seems to meet your desired access characteristics. (Push a bunch of data, and iterate over the most recently pushed data frequently)
You can use Java's MappedByteBuffer directly from Scala. You get to address the file like its memory, without trying to actually load the file into memory.
You'd get some caching for free from the OS this way, since the mapped file would function like virtual memory. Recently written/accessed pages would stay in the OSs file cache until the OS saw fit to flush them (or you flushed them manually) back to disk
You could build your stack from either end of the file if you're worried about sequential read performance, but if you're usually reading data you just wrote I wouldn't expect that would be a problem since it will still be in memory. (Though if you're reading data that youve written over hours/days across pages then it might be a problem)
A file addressed in this way is limited in size to 2GB even on a 64 bit JVM, but you can use multiple files to overcome this limitation.
These Java libraries may contain what you need. They aim to store entries more efficiently than standard Java collections.
github.com/OpenHFT/Chronicle-Queue
github.com/OpenHFT/Chronicle-Map

When should you use XS?

I am writing up a talk on XS and I need to know when the community thinks it is proper to reach for XS.
I can think of at least three reasons to use XS:
You have a C library you want to access in Perl 5
You have a block of code that is provably slowing down your program and it would be faster if written in C
You need access to something only available in XS
Reason 1 is obvious and should need no explaination.
When you really need reason 2 is less obvious. Often you are better off looking at how the code is structured. You should only invoke reason 2 if you have profiled your code and have a benchmark and test suite to prove that the XS code is faster and correct.
Reason 3 is a dangerous reason. It is rare that you actually need to look into Perl's guts to do something, but there is at least one valid case.
In a few cases, better memory management is another reason for using XS. For example, if you have a very large block of objects of some similar type, this can be managed more efficiently through XS. KinoSearch uses this for tokens, for example, where start and end offsets in a large string can be managed more effectively through XS than as a huge pool of scalars. PDL also has a memory management aspect to it, as well as speed.
There are proposals to integrate some of this approach into core Perl in the long term, initially because it offers a chance to make sharing data in threading better: see: http://openparallel.com/2011/07/05/a-new-hope-for-efficient-safe-data-sharing-between-threads-in-perl/.

Scala: What is the right way to build HashMap variant without linked lists?

How mouch Scala standard library can be reused to create variant of HashMap that does not handle collisions at all?
In HashMap implementation in Scala I can see that traits HashEntry, DefaultEntry and LinkedEntry are related, but I'm not sure whether I have any control over them.
You could do this by extending HashMap (read the source code of HashMap to see what needs to be modified); basically you'd override put and += to not call findEntry, and you'd override addEntry (from HashTable) to simply compute the hash code and drop the entry in place. Then it wouldn't handle collsions at all.
But this isn't a wise thing to do since the HashEntry structure is specifically designed to handle collisions--the next pointer becomes entirely superfluous at that point. So if you are doing this for performance reasons, it's a bad choice; you have overhead because you wrap everything in an Entry. If you want no collision checking, you're better off just storing the (key,value) tuples in a flat array, or using separate key and value arrays.
Keep in mind that you will now suffer from collisions in hash value, not just in key. And, normally, HashMap starts small and then expands, so you will initially destructively collide things which would have survived had it not started small. You could override initialSize also if you knew how much you would add so that you'd never need to resize.
But, basically, if you want to write a special-purpose high-speed unsafe hash map, you're better off writing it from scratch or using some other library. If you modify the generic library version, you'll get all the unsafety without all of the speed. If it's worth fiddling with, it's worth redoing entirely. (For example, you should implement filters and such that map f: (Key,Value) => Boolean instead of mapping the (K,V) tuple--that way you don't have to wrap and unwrap tuples.)
I guess it depends what you mean by "does not handle collisions at all". Would a thin layer over a MultiMap be sufficient for your needs?

What's the best way to make a deep copy of a data structure in Perl?

Given a data structure (e.g. a hash of hashes), what's the clean/recommended way to make a deep copy for immediate use? Assume reasonable cases, where the data's not particularly large, no complicated cycles exist, and readability/maintainability/etc. are more important than speed at all costs.
I know that I can use Storable, Clone, Clone::More, Clone::Fast, Data::Dumper, etc. What's the current best practice?
Clone is much faster than Storable::dclone, but the latter supports more data types.
Clone::Fast and Clone::More are pretty much equivalent if memory serves me right, but less feature complete than even Clone, and Scalar::Util::Clone supports even less but IIRC is the fastest of them all for some structures.
With respect to readability these should all work the same, they are virtually interchangeable.
If you have no specific performance needs I would just use Storable's dclone.
I wouldn't use Data::Dumper for this simply because it's so cumbersome and roundabout. It's probably going to be very slow too.
For what it's worth, if you ever want customizable cloning then Data::Visitor provides hooking capabilities and fairly feature complete deep cloning is the default behavior.
My impression is that Storable::dclone() is somewhat canonical.
Clone is probably what you want for that. At least, that's what all the code I've seen uses.
Try to use fclone from Panda::Lib which seems the fastest one (written in XS)
Quick and dirty hack if you're already dealing with JSONs and using the JSON module in your code: convert the structure to a JSON and then convert the JSON back to a structure:
use JSON;
my %hash = (
obj => {},
arr => []
);
my $hash_ref_to_hash_copy = from_json(to_json(\%hash));
The only negative possibly being having to deal with a hash reference instead of a pure hash, but still, this has come in handy a few times for me.