Scala: What is the right way to build HashMap variant without linked lists? - scala

How mouch Scala standard library can be reused to create variant of HashMap that does not handle collisions at all?
In HashMap implementation in Scala I can see that traits HashEntry, DefaultEntry and LinkedEntry are related, but I'm not sure whether I have any control over them.

You could do this by extending HashMap (read the source code of HashMap to see what needs to be modified); basically you'd override put and += to not call findEntry, and you'd override addEntry (from HashTable) to simply compute the hash code and drop the entry in place. Then it wouldn't handle collsions at all.
But this isn't a wise thing to do since the HashEntry structure is specifically designed to handle collisions--the next pointer becomes entirely superfluous at that point. So if you are doing this for performance reasons, it's a bad choice; you have overhead because you wrap everything in an Entry. If you want no collision checking, you're better off just storing the (key,value) tuples in a flat array, or using separate key and value arrays.
Keep in mind that you will now suffer from collisions in hash value, not just in key. And, normally, HashMap starts small and then expands, so you will initially destructively collide things which would have survived had it not started small. You could override initialSize also if you knew how much you would add so that you'd never need to resize.
But, basically, if you want to write a special-purpose high-speed unsafe hash map, you're better off writing it from scratch or using some other library. If you modify the generic library version, you'll get all the unsafety without all of the speed. If it's worth fiddling with, it's worth redoing entirely. (For example, you should implement filters and such that map f: (Key,Value) => Boolean instead of mapping the (K,V) tuple--that way you don't have to wrap and unwrap tuples.)

I guess it depends what you mean by "does not handle collisions at all". Would a thin layer over a MultiMap be sufficient for your needs?

Related

Scala - Does pattern matching break the Open-Closed principle? [duplicate]

If I add a new case class, does that mean I need to search through all of the pattern matching code and find out where the new class needs to be handled? I've been learning the language recently, and as I read about some of the arguments for and against pattern matching, I've been confused about where it should be used. See the following:
Pro:
Odersky1 and
Odersky2
Con:
Beust
The comments are pretty good in each case, too. So is pattern matching something to be excited about or something I should avoid using? Actually, I imagine the answer is "it depends on when you use it," but what are some positive use cases for it and what are some negative ones?
Jeff, I think you have the right intuition: it depends.
Object-oriented class hierarchies with virtual method dispatch are good when you have a relatively fixed set of methods that need to be implemented, but many potential subclasses that might inherit from the root of the hierarchy and implement those methods. In such a setup, it's relatively easy to add new subclasses (just implement all the methods), but relatively difficult to add new methods (you have to modify all the subclasses to make sure they properly implement the new method).
Data types with functionality based on pattern matching are good when you have a relatively fixed set of classes that belong to a data type, but many potential functions that operate on that data type. In such a setup, it's relatively easy to add new functionality for a data type (just pattern match on all its classes), but relatively difficult to add new classes that are part of the data type (you have to modify all the functions that match on the data type to make sure they properly support the new class).
The canonical example for the OO approach is GUI programming. GUI elements need to support very little functionality (drawing themselves on the screen is the bare minimum), but new GUI elements are added all the time (buttons, tables, charts, sliders, etc). The canonical example for the pattern matching approach is a compiler. Programming languages usually have a relatively fixed syntax, so the elements of the syntax tree will change rarely (if ever), but new operations on syntax trees are constantly being added (faster optimizations, more thorough type analysis, etc).
Fortunately, Scala lets you combine both approaches. Case classes can both be pattern matched and support virtual method dispatch. Regular classes support virtual method dispatch and can be pattern matched by defining an extractor in the corresponding companion object. It's up to the programmer to decide when each approach is appropriate, but I think both are useful.
While I respect Cedric, he's completely wrong on this issue. Scala's pattern matching can be fully-encapsulated from class changes when desired. While it is true that a change to a case class would require changing any corresponding pattern matching instances, this is only when using such classes in a naive fashion.
Scala's pattern matching always delegates to the deconstructor of a class's companion object. With a case class, this deconstructor is automatically generated (along with a factory method in the companion object), though it is still possible to override this auto-generated version. At all times, you can assert complete control over the pattern matching process, insulating any patterns from potential changes in the class itself. Thus, pattern matching is simply another way of accessing class data through the safe filter of encapsulation, just like any other method.
So, Dr. Odersky's opinion would be the one to trust here, particularly given the sheer volume of research he has performed in the area of object-oriented programming and design.
As for where it should be used, that is entirely according to taste. If it makes your code more concise and maintainable, use it! Otherwise, don't. For most object-oriented programs, pattern matching is unnecessary. However, once you begin to integrate more functional idioms (Option, List, etc) I think you'll find that pattern matching will significantly reduce syntactic overhead as well as improving the safety offered by the type system. In general, any time you want to extract data while simultaneously testing some condition (e.g. extracting a value from Some), pattern matching will likely be of use.
Pattern matching is definitely good if you are doing functional programming. In case of OO, there are some cases where it is good. In Cedric's example itself, it depends on how you view the print() method conceptually. Is it a behavior of each Term object? Or is it something outside it? I would say it is outside, and makes sense to do pattern matching. On the other hand if you have an Employee class with various subclasses, it is a poor design choice to do pattern matching on an attribute of it (say name) in the base class.
Also pattern matching offers an elegant way of unpacking members of a class.

Basic principle of auto complete

How do they perform auto complete of code in eclipse or other ides? What is basic principle behind it?
You know how you have to explicitly attach source code to non-standard libraries you imported in Eclipse? When you do that, text-search index is built over that source and this way IDE knows to offer you auto-complete feature. Roughly, I suppose it is something as associative array where key is the prefix of method you typed, and value is description of that method.
Now what is important for this functionality is to be implemented efficiently regarding both time and memory consumption. It would be very inefficient to store the same entry for every possible prefix of some method. (Or even to store every prefix!)
One of interesting structures that could be suitable for this problem is Trie, which is inherently optimized for prefix search while keeping acceptable memory usage.
Take a look here for a simple example:
http://www.sarathlakshman.com/2011/03/03/implementing-autocomplete-with-trie-data-structure/
Besides Tries, used for the case when you have already typed the beginning of the name of a method/var, I think it also uses some sort of type comparison/analysis for the case when you try to invoke a method and the IDE suggests you a local/global variable to pass as parameter to that method call.

Disk-persisted-lazy-cacheable-List ™ in Scala

I need to have a very, very long list of pairs (X, Y) in Scala. So big it will not fit in memory (but fits nicely on a disk).
All update operations are cons (head appends).
All read accesses start in the head, and orderly traverses the list until it finds a pre-determined pair.
A cache would be great, since most read accesses will keep the same data over and over.
So, this is basically a "disk-persisted-lazy-cacheable-List" ™
Any ideas on how to get one before I start to roll out my own?
Addendum: yes.. mongodb, or any other non-embeddable resource, is an overkill. If you are interested in a specific use-case for this, see the class Timeline here. Basically, I which to have a very, very big timeline (millions of pairs throughout months), although my matches only need to touch the last hours.
The easiest way to do something like this is to extend Traversable. You only have to define foreach, and you have full control over the traversal, so you can do things like open and close the file.
You can also extend Iterable, which requires defining iterator and, of course, returning some sort of Iterator. In this case, you'd probably create an Iterator for the disk data, but it's going to be much harder to control things like open files.
Here's one example of a Traversable such as I described, written by Josh Suereth:
class FileLinesTraversable(file: java.io.File) extends Traversable[String] {
override def foreach[U](f: String => U): Unit = {
val in = new java.io.BufferedReader(new java.io.FileReader(file))
try {
def loop(): Unit = in.readLine match {
case null => ()
case line => f(line); loop()
}
loop()
} finally {
in.close()
}
}
}
You write:
mongodb, or any other non-embeddable resource, is an overkill
Do you know that there are embeddable database engines, including some really small ones? If you know, I'm not sure about your exact requirement and why would you not use them.
You sure that Hibernate + an embeddable DB (say SQLite) would not be enough?
Alternatively, BerkeleyDB Java Edition, HSQLDB, or other embedded databases could be an option.
If you do not perform queries on the object themselves (and it really sounds like you do not), maybe serialization would be simpler than object-relational mapping for complex objects, but I've never tried, and I don't know which would be faster. But serialization is probably the only way to be completely generic in the type, assuming that your framework of choice offers a suitable interface to write [T <: Serializable]. If not, you could write [T: MySerializable] after creating your own "type-class" MySerializable[T] (like for instance Ordering[T] in the Scala standard library).
However, you don't want to use standard Java serialization for this task. "Anything serializable" sounds a bad requirement because it suggests the use of serialization for this, but I guess you can relax that to "anything serializable with my framework of choice". Serialization is extremely inefficient in time and space and is not designed to serialize a single object, instead it gives you back a file complete with special headers. I would suggest to use some different serialization framework - have a look here for a comparison.
Additional reasons not to go on the road of a custom implementation
In addition, it sounds like you would be reading the file essentially backward, and that's a quite bad access pattern, performance-wise, on non-SSD disks: after reading a sector, it takes an almost complete disk rotation to access the previous one.
Moreover, as Chris Shain pointed out in the comment above, you'd need to use a page-based solution, and you'd need to cope with variable-sized objects.
If you don't want to step up to one of the embeddable DBs, how about a stack in memory mapped files?
A stack seems to meet your desired access characteristics. (Push a bunch of data, and iterate over the most recently pushed data frequently)
You can use Java's MappedByteBuffer directly from Scala. You get to address the file like its memory, without trying to actually load the file into memory.
You'd get some caching for free from the OS this way, since the mapped file would function like virtual memory. Recently written/accessed pages would stay in the OSs file cache until the OS saw fit to flush them (or you flushed them manually) back to disk
You could build your stack from either end of the file if you're worried about sequential read performance, but if you're usually reading data you just wrote I wouldn't expect that would be a problem since it will still be in memory. (Though if you're reading data that youve written over hours/days across pages then it might be a problem)
A file addressed in this way is limited in size to 2GB even on a 64 bit JVM, but you can use multiple files to overcome this limitation.
These Java libraries may contain what you need. They aim to store entries more efficiently than standard Java collections.
github.com/OpenHFT/Chronicle-Queue
github.com/OpenHFT/Chronicle-Map

Stream in production code

Do people really use Scala's Stream class in production code, or is it primarily of academic interest?
There's no problem with Stream, except when people use it to replace Iterator -- as opposed to replacing List, which is the collection most similar to it. In that particular case, one has to be careful in its use. On the other hand, one has to be careful using Iterator as well, since each element can only be iterated through once.
So, since both have their own problems, why single out Stream's? I daresay it's simply that people are used to Iterator from Java, whereas Stream is a functional thing.
Even though I wrote that Iterator is what I want to use nearly all the time I do use Stream in production code. I just don't automatically assume that the cells are garbage collected.
Sometimes Stream fits the problem perfectly. I think the api gives some good examples where recursion is involved...
Look here. This blog post describes how to use Scala Streams (along with memory mapped file) to read large files (1-2G) efficiently.
I did not try it yet but the solution looks reasonable. Stream provides a nice abstraction on top of the low level ByteBuffer Java API to handle a memory mapped file as a sequence of records.
Yes, I use it, although it tends to be for something like this:
(as.toStream collect expensiveConversionToB) match {
case b #:: _ => //found my expensive b
case _ =>
}
Of course, I might use a non-strict view and a find for this example
Since the only reason not to use Streams is that it can be tricky to ensure the JVM isn't keeping references to early conses around, one approach I've used that's fairly nice is to build up a Stream and immediately convert it to an Iterator for actual use. It loses a little of Stream's nice properties on the use side especially with respect to backtracking, but if you're only going to make one pass over the result it's frequently easier to build a structure this way than to contort into the hasNext/next() model of Iterator directly.

How to make Scala's immutable collections hold immutable objects

I'm evaluating Scala and am having a problem with its immutable collections.
I want to make immutable collections, which are completely immutable, right down through all the contained objects, the objects they reference, ad infinitum.
Is there a simple way to do this?
The code on http://www.finalcog.com/immutable-containers-scala illustrates what I'm trying to achieve, and a nasty work around (ImmutablePoint).
The problem with the workaround is that every time I want to change an object I have to manually make a new copy. I understand that the runtime will have to implement copy-on-write, but can this be made transparent to the developer?
I suppose I'm looking to make Immutable Objects, where methods change the current object state, but all other 'val' (and all immutable container) references to the object retain the 'old' state.
This is not possible out-of-the-box with scala via some specific language construct unless you have followed the idiom that all of your objects are immutable, in which case this behaviour comes for free!
With 2.8, named parameters have made "copy constructors" quite nice to use, from a readability perspective. But you are correct, this works as copy-on-write. The behaviour you are asking for, where the "current" object is the only one mutated goes completely against the way the JVM works, unfortunately (for you)!
Actually the phrase "the current object" makes no sense; really you mean "the current reference"! All other references (outside the current lexical scope) which point to the same object, erm, point to the same object! There is only one object!
Hence it's just not possible for this object to appear to be mutable from the perspective of the current lexical scope but immutable for others
If you're interested in some more general theory on how to handle updates to immutable data structures efficiently,
http://en.wikipedia.org/wiki/Zipper_%28data_structure%29
might prove interesting.