Disk-persisted-lazy-cacheable-List ™ in Scala

Disk-persisted-lazy-cacheable-List ™ in Scala - scala

I need to have a very, very long list of pairs (X, Y) in Scala. So big it will not fit in memory (but fits nicely on a disk).
All update operations are cons (head appends).
All read accesses start in the head, and orderly traverses the list until it finds a pre-determined pair.
A cache would be great, since most read accesses will keep the same data over and over.
So, this is basically a "disk-persisted-lazy-cacheable-List" ™
Any ideas on how to get one before I start to roll out my own?
Addendum: yes.. mongodb, or any other non-embeddable resource, is an overkill. If you are interested in a specific use-case for this, see the class Timeline here. Basically, I which to have a very, very big timeline (millions of pairs throughout months), although my matches only need to touch the last hours.

The easiest way to do something like this is to extend Traversable. You only have to define foreach, and you have full control over the traversal, so you can do things like open and close the file.
You can also extend Iterable, which requires defining iterator and, of course, returning some sort of Iterator. In this case, you'd probably create an Iterator for the disk data, but it's going to be much harder to control things like open files.
Here's one example of a Traversable such as I described, written by Josh Suereth:
class FileLinesTraversable(file: java.io.File) extends Traversable[String] {
override def foreach[U](f: String => U): Unit = {
val in = new java.io.BufferedReader(new java.io.FileReader(file))
try {
def loop(): Unit = in.readLine match {
case null => ()
case line => f(line); loop()
}
loop()
} finally {
in.close()
}
}
}

You write:
mongodb, or any other non-embeddable resource, is an overkill
Do you know that there are embeddable database engines, including some really small ones? If you know, I'm not sure about your exact requirement and why would you not use them.
You sure that Hibernate + an embeddable DB (say SQLite) would not be enough?
Alternatively, BerkeleyDB Java Edition, HSQLDB, or other embedded databases could be an option.
If you do not perform queries on the object themselves (and it really sounds like you do not), maybe serialization would be simpler than object-relational mapping for complex objects, but I've never tried, and I don't know which would be faster. But serialization is probably the only way to be completely generic in the type, assuming that your framework of choice offers a suitable interface to write [T <: Serializable]. If not, you could write [T: MySerializable] after creating your own "type-class" MySerializable[T] (like for instance Ordering[T] in the Scala standard library).
However, you don't want to use standard Java serialization for this task. "Anything serializable" sounds a bad requirement because it suggests the use of serialization for this, but I guess you can relax that to "anything serializable with my framework of choice". Serialization is extremely inefficient in time and space and is not designed to serialize a single object, instead it gives you back a file complete with special headers. I would suggest to use some different serialization framework - have a look here for a comparison.
Additional reasons not to go on the road of a custom implementation
In addition, it sounds like you would be reading the file essentially backward, and that's a quite bad access pattern, performance-wise, on non-SSD disks: after reading a sector, it takes an almost complete disk rotation to access the previous one.
Moreover, as Chris Shain pointed out in the comment above, you'd need to use a page-based solution, and you'd need to cope with variable-sized objects.

If you don't want to step up to one of the embeddable DBs, how about a stack in memory mapped files?
A stack seems to meet your desired access characteristics. (Push a bunch of data, and iterate over the most recently pushed data frequently)
You can use Java's MappedByteBuffer directly from Scala. You get to address the file like its memory, without trying to actually load the file into memory.
You'd get some caching for free from the OS this way, since the mapped file would function like virtual memory. Recently written/accessed pages would stay in the OSs file cache until the OS saw fit to flush them (or you flushed them manually) back to disk
You could build your stack from either end of the file if you're worried about sequential read performance, but if you're usually reading data you just wrote I wouldn't expect that would be a problem since it will still be in memory. (Though if you're reading data that youve written over hours/days across pages then it might be a problem)
A file addressed in this way is limited in size to 2GB even on a 64 bit JVM, but you can use multiple files to overcome this limitation.

These Java libraries may contain what you need. They aim to store entries more efficiently than standard Java collections.
github.com/OpenHFT/Chronicle-Queue
github.com/OpenHFT/Chronicle-Map

Related

Reading & writing text in Scala, getting the encoding right?

I'm reading and writings some text files in Scala. As a complete beginner in the language, I wanted to make sure to find the right way to do it, e.g. get the encoding right.
So most of the stuff I found (also on SO ) recommends I use io.Source.fromFile.However, after trying it out like so, reading a UTF-8 file:
val user_list = Source.fromFile("usernames.txt").getLines.toList
val user_list = Source.fromFile("usernames.txt", enc="UTF8").getLines.toList
I looked at the docs but was left with some questions.
Get the encoding right:
the docs show that I can set an encoding in Source.fromFile as I tried above. Looking at the man on Codec and the types listed there, I was wondering if those are all my codec options - is there e.g. no Utf-16, Big-Endian vs Little-Endian, etc.?
I am slightly obsessed with this since it used to trip me up in Python a lot. Is this less of concern with Scala for some reason?
Get the reading in right:
All the examples I looked at used the getLines method and postprocessed it with MkString or List, etc. Is there any advantage to that over just reading in the entire file (my files are small) in one go?
Get the writing out right:
Every source I could find tells me that Scala has no file writing function and to use the Java FileWriter. I was surprised by this - is this still accurate?
Looking at it I feel the question might be a little broad for SO, so I'd be happy to take it back if it does not meet the requirements. At this point, I'm not struggling with specific examples but rather trying to set things up in a way I don't get in trouble later.
Thanks!

Scala only has a basic IO api in the standard library. For the most part you just use the java apis. The fact that a decent api from java exists is probably why the Scala team is not prioritizing having a robust and fully featured IO api.
There are also third party scala libraries you could use as well however. Better Files I've never used but heard good things about as a Scala file api. As well as fs2 which provides functional, streaming IO. I'm sure there are others out there as well.
For encoding, there are many possible encoding available. It's just that only a couple of the most common ones are available as static fields, the rest you typically access through Codec("Encoding Name"). Most apis will also let you just enter a String directly instead of needing to get a Codec instance first. The codec is really just a wrapper over java.nio.charset.Charset. You can run java.nio.charset.Charset.availableCharsets() to see all of the encodings available on your system.
As far as reading, if the files are small you can load them fully into memory if you prefer that. The only reason not to do so is if you want to avoid the extra memory use of loading the entire file at once if reading through line by line is enough. You may want to use Vector instead of List for efficiency reasons (Vector is better in many cases and should probably be preferred as a default collection, but tradition and old habits die hard and most people/guides seem to default to List, but this is a whole other topic)

Refactoring an OOP "decorator" to Free monad structure(s)

I have a bit of “legacy” Scala code (Java-like), which does a bit of data access. There’s a decorator which tracks usage of the DAO methods (collecting metrics), like this:
class TrackingDao(tracker: Tracker) extends Dao {
def fetchById(id: UUID, source: String): Option[String] = {
tracker.track("fetchById", source) {
actualFetchLogic(...)
}
}
...
}
I'm trying to model this as a Free monad. I've defined the following algebra for the DAO operations:
sealed trait DBOp[A]
case class FetchById(id: UUID) extends DBOp[Option[String]]
...
I see two options:
a) I can either make two interpreters that take DBOp, one performs the actual data access, the other does the tracking, and compose them together OR
b) I make Tracking an explicit algebra, and use a Coproduct to use them both in the same for composition OR
c) Something completely different!
The first option looks more like a "decorator" approach, which is tied to DBOp, the second is more generic solution, but would require calling the 'tracking' algebra explicitly.
In addition, notice the source parameter on the original fetchById call: it's only used for tracking. I much rather remove it from the API.
Here's the actual question: how do I model the tracking?

It's not totally clear from your question, but if tracking is a sort of ambient effect that should "happen" when you perform db access and source is just an argument for tracking purposes, you may not have to mention it in your Free language at all. You can use the ADT you have now and interpret into (Tracker, Source, OtherStuff) => IO[A] for instance, so what you get back is a function that will produce a program to do DB access once you give it a Tracker and source and whatever else you need (DB connection for instance), and the tracking implementation is entirely private to the interpreter. This lets you write your database program without thinking about tracking at all.
If on the other hand you do need to talk about tracking in your business logic then we probably need more information about what it would mean to have multiple Trackers and sources and how they're introduced and used. A coproduct or extended language or nested language might be necessary to deal with what you need to express.

As in everything in our industry, the straight answer is "it depends" :). Since "tracking" is vague concept here (I don't know details of the domain), I would say that you have two possible scenarios (or at least I see two)
a) "tracking" is an element of your business vocabulary
If tracking is a separate concern that is part of the vocabulary that is used by your business, then I would go with a separate algebra representing that concern. Something similar to this would be "authentication & authorization" - even though it is a "low-level" concern it is still part of the business language ("As admin I want to...") I would go here with separate algebra
b) "tracking" is mechanism to some 'debugging', 'logging'
If tracking is not part of the language, but element of machinery that you keep for maintenance, then I would keep that in where it belongs - the machinery. I would go with an interpreter that would side effect with 'tracking' (logging, debugging) those different calls.
In other words, if right now you don't have a single test that tests "if I do this business thingy, then this should be tracked" then most definitely I would go with option b) here

Scala. Too much small functions, too many classes?

I am a Scala newcomer, but have some Java background.
When writing Scala code it is useful to deal with Option parameter in such style:
val text = Option("Text")
val length = text.map(s => s.size)
but each s => s.size as I know brings a new Function1[A, B]. And if I do for example 8 such conversions it will bring 8 additional classes. When binding forms I use such snippets very heavily, so the question is:
Should I use it less and, maybe substitute it with an if-notation, or is such class-flood not critical for the JVM, or maybe the Scala compiler does some kind of magic?
Update: a maybe more concrete example is:
case class Form(name: Option[String], surname: Option[String])
val bindedForm = Form(Option("John"), Option("Smith"))
val person = new Person
bindedForm.name.foreach(a => person.setName(a))
bindedForm.surname.foreach(a => person.setSurname(a))
will it produce two different Function1[String, String] classes? What if there are hundreds of such conversions?

If you are developing for Android, then you're probably going to run the code using Dalvik, which has an annoying 64k method limitation (though there are ways around it). Since each class requires a couple of methods (constructor and apply), this can be a problem.
Otherwise, classes on the Sun/Oracle JVM go into PermGen space, which you can adjust when launching the JVM if you really need to. It really doesn't matter. Yes, you'll have lots of classes, maybe tens of thousands, but the JVM can handle it fine (at least if you're willing to give it a heads-up about what to expect). Unless you know you're very likely to run into some unusual constraint, this is not something you should be worrying much about.
Far more often one might be worried that creating all those functions has a performance penalty--but if you're not actually running into that penalty now, don't worry about that either; this is something that the Scala compiler can in principle fix, and it's getting cleverer all the time. So just write code the idiomatic way and unless it's a big performance problem now, just hope that the compiler will come save you. There's a decent chance it will, and an even better chance that you'll find it easier to write it the "right" way and then refactor for performance where needed than to adopt a policy of using a more awkward construct just in case there might be a problem. Of course, there are some places that you may know in advance are certain to be a bottleneck, like wrapping each byte you read from a huge file in an option, but aside from blatant stuff like that, you're better off acting reactively to known problems than proactively avoiding closures.

Elegant AST model

I am in the process of writing a toy compiler in scala. The target language itself looks like scala but is an open field for experiment.
After several large refactorings I can't find a good way to model my abstract syntax tree. I would like to use the facilities of scala's pattern matching, the problem is that the tree carries moving information (like types, symbols) along the compilation process.
I can see a couple of solutions, none of which I like :
case classes with mutable fields (I believe the scala compiler does this) : the problem is that those fields are not present a each stage of the compilation and thus have to be nulled (or Option'd) and it becomes really heavy to debug/write code. Moreover, if for exemple, I find a node with null type after the typing phase I have a really hard time finding the cause of the bug.
huge trait/case class hierarchy : something like Node, NodeWithSymbol, NodeWithType, ... Seems like a pain to write AND work with
something completly hand crafted with extractors
I'm also not sure if it is good practice to go with a fully immutable AST, especially in scala where there is no implicit sharing (because the compiler is not aware of immutability) and it could hurt performances to copy the tree all the time.
Can you think of an elegant pattern to model my tree using scala's powerful type system ?

TL;DR I prefer to keep the AST immutable and carry things like type information in a separate structure, e.g. a Map, that can be referred by IDs stored in the AST. But there is no perfect answer.
You're by no means the first to struggle with this question. Let me list some options:
1) Mutable structures that get updated at each phase. All the up and downsides you mention.
2) Traits/cake pattern. Feasible, but expensive (there's no sharing) and kinda ugly.
3) A new tree type at each phase. In some ways this is the theoretically cleanest. Each phase can deal only with a structure produced for it by the previous phase. Plus the same approach carries all the way from front end to back end. For instance, you may "desugar" at some point and having a new tree type means that downstream phase(s) don't have to even consider the possibility of node types that are eliminated by desugaring. Also, low level optimizations usually need IRs that are significantly lower level than the original AST. But this is also a lot of code since almost everything has to be recreated at each step. This approach can also be slow since there can be almost no data sharing between phases.
4) Label every node in the AST with an ID and use that ID to reference information in other data structures (maps and vectors and such) that hold information computed for each phase. In many ways this is my favorite. It retains immutability, maximizes sharing and minimizes the "excess" code you have to write. But you still have to deal with the potential for "missing" information that can be tricky to debug. It's also not as fast as the mutable option, though faster than any option that requires producing a new tree at each phase.

I recently started writing a toy verifier for a small language, and I am using the Kiama library for the parser, resolver and type checker phases.
Kiama is a Scala library for language processing. It enables convenient analysis and transformation of structured data. The programming styles supported by the library are based on well-known formal language processing paradigms, including attribute grammars, tree rewriting, abstract state machines, and pretty printing.
I'll try to summarise my (fairly limited) experience:
[+] Kiama comes with several examples, and the main contributor usually responds quickly to questions asked on the mailing list
[+] The attribute grammar paradigm allows for a nice separation into "immutable components" of the nodes, e.g., names and subnodes, and "mutable components", e.g., type information
[+] The library comes with a versatile rewriting system which - so far - covered all my use cases
[+] The library, e.g., the pretty printer, make nice examples of DSLs and of various functional patterns/approaches/ideas
[-] The learning curve it definitely steep, even with examples and the mailing list at hand
[-] Implementing the resolving phase in a "purely function" style (cf. my question) seems tricky, but a hybrid approach (which I haven't tried yet) seems to be possible
[-] The attribute grammar paradigm and the resulting separation of concerns doesn't make it obvious how to document the properties nodes have in the end (cf. my question)
[-] Rumour has it, that the attribute grammar paradigm does not yield the fastest implementations
Summarising my summary, I enjoy using Kiama a lot and I strongly recommend that you give it a try, or at least have a look at the examples.
(PS. I am not affiliated with Kiama)

Stream in production code

Do people really use Scala's Stream class in production code, or is it primarily of academic interest?

There's no problem with Stream, except when people use it to replace Iterator -- as opposed to replacing List, which is the collection most similar to it. In that particular case, one has to be careful in its use. On the other hand, one has to be careful using Iterator as well, since each element can only be iterated through once.
So, since both have their own problems, why single out Stream's? I daresay it's simply that people are used to Iterator from Java, whereas Stream is a functional thing.

Even though I wrote that Iterator is what I want to use nearly all the time I do use Stream in production code. I just don't automatically assume that the cells are garbage collected.
Sometimes Stream fits the problem perfectly. I think the api gives some good examples where recursion is involved...

Look here. This blog post describes how to use Scala Streams (along with memory mapped file) to read large files (1-2G) efficiently.
I did not try it yet but the solution looks reasonable. Stream provides a nice abstraction on top of the low level ByteBuffer Java API to handle a memory mapped file as a sequence of records.

Yes, I use it, although it tends to be for something like this:
(as.toStream collect expensiveConversionToB) match {
case b #:: _ => //found my expensive b
case _ =>
}
Of course, I might use a non-strict view and a find for this example

Since the only reason not to use Streams is that it can be tricky to ensure the JVM isn't keeping references to early conses around, one approach I've used that's fairly nice is to build up a Stream and immediately convert it to an Iterator for actual use. It loses a little of Stream's nice properties on the use side especially with respect to backtracking, but if you're only going to make one pass over the result it's frequently easier to build a structure this way than to contort into the hasNext/next() model of Iterator directly.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse