I'm looking for a little assistance in Scala similar to that provided by pyTables. PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
Any suggestions?
I had a quick look at pyTables, and I don't think there's anything remotely like it in Scalaland (or indeed Javaland), but we have a few of the ingredients necessary to make it a possibility if you want to invest the time:
scala.Dynamic to do idiomatic selection on data-driven structures
A bunch of graph databases to provide the underlying navigational persistence substrate (I've had acceptable results from OrientDB, which has a better license than most)
PyTables is a python implementation of HDF5 with some added niceties to let you work on it in a pythonic way, and get good indexing support. I'm not sure if there's a package implemented in a similar way in Scala, but you can use the same HFD5 based hierarchical data storageusing the HDF5 implementation in Java: HDF Java
Related
There are ways to replace SQL databases in Haskell, Clojure:
http://www.datomic.com/ (Clojure)
https://github.com/dmbarbour/haskell-vcache
https://hackage.haskell.org/package/acid-state
However I cannot find a library for doing so in Scala , using akka-persistence.
I wonder why ?
I heard that https://www.querki.net/ is doing something similar (https://github.com/jducoeur/Querki), but it not a copyleft library (unlike acid-state for Haskell).
I am wonder if I am looking at this from the wrong angle, I am wondering why other languages have these solutions and Scala does not seem to have it, maybe there is a fundamental reason for that ? Am I missing something ?
The libraries you mention do quite different things:
akka-persistence Store the state of an actor. If you have an actor that uses internal state. This is quite specialized.
acid-state serializes Haskell data to disk.
Datomic is a system for overriding temporal data in a way that does not destroy the original data.
Object stores works well with dynamic languages like Clojure and Python, since they work dynamic data that can be serialize to disk.
I found it much nicer to work with MongoDB in Python than in Scala.
When the NoSQL movement started there were initial excitement, but after using these systems some people realized that you are giving up good properties that databases have.
Datomic is an interesting project with new ideas. There is a Scala clone of it. Not sure how stable it is:
https://github.com/dwhjames/datomisca
I see that you can use datastore to hold key value pairs, process data in chunks, and pass it to mapreduce. Does this mean that the datastore object in Matlab is like a NoSQL database? If not, how does it differ?
In case of any ambiguity about what characterises a NoSQL database, I am considering as a starting point these characteristics obtained from dba.stackexchange: https://dba.stackexchange.com/a/25/35729
You'll find that NoSQL database have few common characteristics. They
can be roughly divided into a few categories:
key/value stores
Bigtable inspired databases (based on the Google Bigtable paper)
Dynamo inspired databases
distributed databases
document databases
In Matlab you can always import java Classes and use any java library, (with the one difference that there is no multithreading). So typically you won’t find many libraries written in matlab that do the same thing as a java library for this reason. In general I would also say it’s harder to write a library in matlab which may be a factor for the lack of libraries as well.
I think your only option is to use a java library, IMHO is a much better choice anyway because java is so much more popular to programmers working with databases, it will always have better libraries which are maintained. The one drawback is that you can’t implement java interfaces in matlab (correct me if I’m wrong). This can become a massive pain.
So not really, here is a Mongo examples on github https://github.com/HanOostdijk/matlab_mongodb
I've just finished watching Week 6 of Martin Odersky's lectures about Scala on Coursera. In Lecture 5 he says that
"...the translation of for is not limited to lists or
sequences, or even collections;
It is based solely on the presence of the methods map, flatMap and
withFilter.
This lets you use the for syntax for your own types as well – you
must only define map, flatMap and withFilter for these types."
The problem I'm trying to solve is that we have a batch process that loads data from a couple of databases, combines the data and exports the results in some way. The data is small enough to fit in memory (a couple of 100,000 records from each source system), but large enough that it is important to think about performance.
I could use a traditional in-memory database (like H2) and access it via ScalaQuery or something similar, but what I really need is just a way to be able to search and join data from the different source systems efficiently - equal to SQL's indexes and JOINs. It feels really awkward to use a full-blown relational database + a Scala ORM for something that could be solved easily and efficiently by some data structure that is native to Scala.
My first naive approach would be a Vector data structure (for fast direct access) combined with one or more "indexes" (which could be implemented as B-Trees just like in database systems). The map, flatMap, withFilter methods of this combined data structure could be intelligent enough to use an index if they have one for the queried field(s) - or they could have a "hint" to use one.
I was just wondering if such data structures already exist and are available, or do I need to implement them myself? Is there a library or a collection framework for Scala that solves this problem?
Not in the standard library (except for Vector, of course), and I don't know of any non-standard library providing them.
As a personal project, I'm looking to build a rudimentary DBMS. I've read the relevant sections in Elmasri & Navathe (5ed), but could use a more focused text- something a bit more practical and detail-oriented, with real-world recommendations- as E&N only went so deep.
The rub is that I want to play with novel non-relational data models. While a lot of E&N was great- indexing implementation details in particular- the more advanced DBMS implementation was only targeted to a relational model.
I'd like to defer staring at DBMS source for a while if I can until I've got a better foundation. Any ideas?
First of all you have to understand the properties of each system. i can offer you to read this post. it's the first step to understand NOSQL or Not Only SQL.Secondly you can check this blog post to understand all these stuff visually.
Finally glance at open source projects such as Mongodb, Couchdb etc. to see the list you can go here
Actually, the first step would be to understand hierarchal, network, navigational, object models which are alternatives to relational. I'm not sure where XML fits in i.e. what model it is. As far as structure, research B-tree (not binary trees) implementation.
I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
There is also Scoobi framework, that has awesome expressiveness.
http://hadoop.apache.org/ is language agnostic.
Personally, I've become a big fan of Spark
http://spark-project.org/
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.
You may be interested in scouchdb, a Scala interface to using CouchDB.
Another idea is to use GridGain. ScalaDudes have an example of using GridGain with Scala. And here is another example.
A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.
For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
There is also a new scala wrapper for cascading from Twitter, called Scalding.
After looking very briefly over the documentation for Scalding it seems
that while it makes the integration with cascading smoother it still does
not solve what I see as the main problem with cascading: type safety.
Every operation in cascading operates on cascading's tuples (basically a
list of field values with or without a separate schema), which means that
type errors, I.e. Joining a key as a String and key as a Long leads
to run-time failures.
to further jshen's point:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).
I've added MapReduce implementation using Hadoop on Github with few test cases here: https://github.com/sauravsahu02/MapReduceUsingScala.
Hope that helps. Note that the application is already tested.