Polyglot persistence for startup app - scala

I'm in the process of developing my next app, and I'm really interested in using polyglot persistence. I like the idea of being able to query different data structures for different services. I'm essentially wanting to sync MongoDB, Neo4j/Titan, SQL, and maybe Cassandra/Hbase.
Currently, I'm wrapping everything in a try/catch block and rolling them all back if one fails. However, this is taxing my write times. I've also looked into AMQP systems like Kafka or ZeroMQ, but these seem more big data centric, whereas my app is still small and I want to keep it efficient.
Has anyone had experience with this? Is a MQ a good idea for a small app or am I prematurely optimizing?
Thanks

I know quite a bit about ZeroMQ, but not a lot about the database servers you mention.
First, you're a bit confused about ZeroMQ. Although it is derived from experience with AMQP, it uses the ZMTP wire protocol. That was custom designed during ZeroMQ development [but other applications do now use it].
ZeroMQ is a small and very fast MQ library that is symmetrical for all nodes; it is very good for small apps. The problem here is that you need something on the other systems that talks ZMTP, whether it's ZeroMQ or a bridge. If you intend making plugins or the like for the other systems then fine.
I presume though, that you are using JMS to talk to the other systems without intending to develop add-ons for them. In which case you're probably stuck with JMS. Kafka is a new one that I haven't caught up with, but RabbitMQ is a good, fast, and small, broker. FWIW. There are a great many broker comparisons out there for you to find. Many are dodgy in the sense that one small tweak of a setting can affect the performance greatly and are not necessarily comparing apples with apples. If want to compare broker performance in your environment, there isn't much of a shortcut to doing it yourself.
One thing that is confusing me is how you expect a broker to help your rollback performance. You'll still need to do the rollback in essentially the same manner, albeit asyncronously via the broker.

I work at CloudBoost.io (https://www.cloudboost.io) and we build a layer that sits on top of databases and give you the power of Polyglot Persistence persistence. We integrate MongoDB, ElasticSearch, Redis, Cassandra, and Neo4j and give you the one single APIwhere you can query / store your data. We automatically shard your data into various databases based on your query / storage request patterns.
Let me know if this helps. :)

For syncing data from MongoDB to Neo4j there is now the Neo4j Doc Manager project. It works by monitoring MongoDB for operations and converts the document operation into a property graph model and immediately writes to Neo4j.

Related

NoSQL (e.g. MongoDB) or RDMS (e.g. PostgreSQL) for new Scala project?

I'm developing a brand new project in Scala. It's just an application for a bunch of CRUD operations, however, because of some eccentric requirements, Play2 or Lift does not fit the bill, so I'm going to develop the application from the ground up. This means that Anorm or ScalaQuery becomes less obvious choices for database integration, and leaves me with the question: is it time to try something new?
My past technology stacks mostly included Java and PostgreSQL and I have experience with both ORM and plain SQL. Are NoSQL database management systems like MongoDB a good replacement for a typical RDBMS or are they special case application data stores? Also, how does the choice of database effect the greater Scala system design (if at all)? For example, the fact that you are using a JSON-like interface to talk to the database, and JSON between the web and a REST service, does not mean that much if everything in the middle becomes Scala objects, or does it?
I'm basically asking for someone's experience on moving from relational to object/document type databases, using Scala in particular. I know that good RDBMS integration is promised in the upcoming release of SLICK. So, if a company like TypeSafe decides to make a RDBMS integration part of the TypeSafe stack, then will I be swimming upstream by integrating to MongoDB using Casbah for example?
Apologies if this question appears a bit vague. I do hope that someone with the right insights or experience will be able to help though.
Update:
Apologies for not adding links to SLICK (it being fairly new). Here goes:
Quick overview
Project home
Update 2:
My personal first win for a technology is usually developer productivity - this translates to lightweight and simple: quick to learn, easy to maintain, no magic
I am currently in a similar situation, and since I have some experience with web development and SQL databases, I took it as an opportunity to work with MongoDB, Cashbah (and Scalatra). My experience is still very limited and the project and the amount of data I am working with is pretty small, but here are a few observations I've made.
For the few sets of data I have, performance does not seem to motivate either SQL or NoSQL. However, performance in the presence of huge amounts of data is often listed as a reason for using NoSQL, e.g., by Wikipedia
My documents (entries in the database) arise from benchmarking test suits, and mainly have a static structure, and I am optimistic that I could store them in a fixed-schema SQL database. However, a few substructures are not static, e.g., new test cases are added, new statistics are tracked, others are removed. This was my main motivation for trying a schema-free NoSQL database. Also, because I had the feeling that the document approach of MongoDB makes it much more obvious which data belongs together (i.e., to a document), in contrast to entries in a relational database, where the data would be distributed over various tables and rows, and where a full "document" would need to be reconstructed by joins.
Tools such as Lift-Json or Rogue allow you to work with regular Scala objects in a type-safe, although the data is regularly (de-)serialised as (from) JSON. However, this naturally works best if the structure of your data is mainly static, otherwise, you you are left with using strings to access your data (e.g., for expanding the results of a query using Cashbah).
If you are mainly concerned about a coherent representation of data on server and client side, languages such as Opa or Haxe might be of interest, since they compile to code that can executed on both sides. See this page for "multitarget" or "tierless" languages.
Got too long for a comment. Was just trying to relate my short experience with Scala (about 6 months now, since about when Play2 came out--it's quickly become my go to language).
I've enjoyed using Salat/Casbah with MongoDB in my last few projects; most have been in Play2, but the latest was without a webapp framework. It definitely hasn't felt like swimming upstream.
I would say that there are particular use cases for which I wouldn't use mongo, but it works nicely as a general purpose object data store, especially if you expect to query by id or index and don't need transactions (and will need minimal ad-hoc aggregation type stuff).
Expect to require a separate set of servers dedicated to mongodb (or to use a service dedicated to mongodb), but I guess that's normal for most serious database apps.
I've also used Play2/Anorm, which was surprisingly enjoyable to use for some ad-hoc query dashboard-style report pages. I started trying to go the Squeryl route, but Anorm seemed easier to use for one-off aggregation queries. Haven't looked at SLICK, but it sounds interesting.
It's really hard to say without knowing what problems you would like the app to solve.
I've personally found my productivity increased using NoSQL DBs via REST/JSON. Though bear in mind most NoSQL DBs offer REST interfaces which preclude the need for much middleware, Scala or otherwise, unless you intend to write a webapp with a UI.
If this is a learning exercise, I recommend you try multiple things out, as each NoSQL DB has something different to offer to your toolkit, and have personally found CouchDB, Riak, Neo4j, and MongoDb all with various pluses and drawbacks and good for different purposes.
Hope this helps, good luck.

Best suited NoSQL database for Content Recommender

I am currently working in a project which includes migrating a content recommender from MySQL to a NoSQL database for performarce reasons. Our team has been evaluating some alternatives like MongoDB, CouchDB, HBase and Cassandra. The idea is to choose a database that is capable of running in a single server or in a cluster.
So far we have discarded the use of Hbase due to its dependency on a distributed environment. Even having the idea of scaling horizontally, we need to run the DB in a single server for a little while in production. MongoDB was also discarded because it does not support map/reduce features.
We have still 2 alternatives and we have no solid background to decide. Any guidance or help is appreciated
NOTE: I do not pretend to create a religion-like discussion with non-founded arguments. It is a strictly technical question to be discussed in the problem's context
Graph databases are usually considered as best suited for recommendation engines, since a lot of the recommendation algorithms are actually graph based. I recommend looking into Neo4J - it can handle billions of nodes/edges on a single machine and it supports a so-called high availability mode which is a master-slave setup with automatic master selection.

Real-time auction updates - Comet? Tornado? ActiveMQ?

I'm in the process of deciding how to write an online auction application. I would like to provide real-time updates to the site users. My background is with LAMP (although, in my case, the 'P' would be more for Perl than PHP). I've considered ActiveMQ, but I'm wondering if there are better options.
My primary concerns are scalability and speed. It could have several simultaneous auctions taking place, with [hopefully] many users participating in each auction. Whatever solution that I decide on would have to accommodate such a scenario. Of course, this is all in theory so I have no idea how many concurrent users that I might have, but I'd like to have the means to support tens of thousands of users.
Another concern is ease of implementation. I've spent the past few days reading docs and tutorials and, so far, nothing has come across as anything less than a bit of a pain in the rear to deal with, which is actually what has led me here to seek some advice.
I was hoping to use a web framework, such as Codeigniter (PHP) or Catalyst (Perl), because I intend to pay a contractor or two to help with some of the bulk of the coding, and I like the idea of having a framework to somewhat enforce a design pattern. However, the more that I look into this, I'm just not seeing an obvious solution to 1) use a framework, and 2) provide real-time auction updates (other than Tornado, I guess - maybe I'm answering my own question. ;)).
So, with all that said, short of using polling (which I'm not really interested in doing), is there a way that I can accomplish these real-time updates using a language like Perl or PHP for my server-side code? I know that ActiveMQ supports STOMP, and I actually have this working on my local machine (using Jetty since it requires a servlet to publish/consume messages from client-side javascript), but is there a better option here?
I'm sorry that I don't have a more direct question, but after several days of looking at docs and tutorials, I'm more lost than ever!
Part of your problem is that your mixing a variety of concepts together. If I read things correctly you have a problem statement of:
I'm building an online auction site and would like to insure that my visitors have real-time updates of prices on the items they are viewing.
Now between the Browser and the Server you'll probably use a Comet style request pattern to handle communications, you could also look at socket.io as a backup pattern. This polling will require a server that is able to handle lots of simultaneous open connections, which Tornado is a good candidate (there are others, but given you asked in relationship to Tornado it's good).
Now that we've gone from 1000+ of Browsers to a handful of Tornado servers, you need a way to communicate between them. In the the last of publish/subscribe message patterns you have a few choices:
RabbitMQ (AMQP)
ZeroMQ
Redis Pub/Sub
All three a good choices, with their own pros/cons. Personally I've used Redis and Rabbit on different projects and just toyed with ZeroMQ. The message broker is a whole decision tree that is going to be based on what you have available.

How should I architect my DB & API server for a turn based multiplayer iPhone board game? (thinking about nodejs, mongo, couch, etc)

I'm working on a turn based board game for iPhone and eventually Android. I'm using Appcelerator Titanium to develop it. My multiplayer design is similar to Words With Friends. Users take turns when ready, and then the opponent's game board is updated accordingly.
One of my needs is to have a messaging API which enables the 2 players' devices to update one another on the status of the game board after a move. Thinking of doing this with JSON and keeping a JSON object on the device which contains the location of all game board pieces at any given time. This is the object that will need to update on the local device and then send a change to the opponent's device after a move is made.
I've done APIs in the past for mobile platforms and to do so I've used PHP with MySQL and sent JSON back and forth between the API server and the mobile device. Works just dandy for low concurrent users, and generally non-massive apps. Here's to hoping this one will get massive ;)
So now, instead of a general httpd server and the like, I'm starting to think about persistent sockets and if they're needed or not for my new game. I'm also thinking that it might be smart to forgo the big LAMP stack, and for scalability and maybe ease of development, to lean more towards a data flow of something like Mongo/Couch -> node.js -> iPhone. I'll be honest, it would be my first foray into a non-sql db and node.js as well.
Interested to hear others' takes and experiences on this, more options/thoughts, and whether I am thinking about it the right way, or just creating headaches for myself.
First of all, Nodejs is awesome for writing reverse TCP proxies to NoSQL databases. You could let all the standard commands pass through but alter/extend their APIs with your own magic, e.g. making MongoDB speak HTTP or CouchDB speak a binary protocol over sockets.
When it comes to choosing a NoSQL solution for storing board game pieces and monitoring for player moves I think either Redis and CouchDB are the best candidates.
CouchDB. It's fast, reliable, and can handle a lot of concurrent HTTP connections. It's probably the best option because unlike Redis it can broadcast a message when a document changes. The continous changes API makes it super simple for you to have each player's app monitor for changes to their board. The request might look like:
curl "$HOST/dbname/_changes?filter=app/gameboard&feed=continuous&gameid=38934&heartbeat=1000Each client will receive a JSON object per line in the response anytime a pertinent document is changed. (And a blank newline every 1000ms as a sort of keepalive.)
Redis. It uses a simple line-based protocol (like MemcacheD++) to talk over a socket and allows you to store Lists, Sets, Hashes with arbitrary--even binary--values. It's very fast because everything happens in memory but is persisted to disk asynchronously. But most of all you should evaluate it because it already has PubSub notifications baked in. Note that you'll have to explicitly publish move notifications over a channel the players share because Redis won't automatically publish when a key/value changes.
Since MongoDB does not have a mechanism for observing changes as-they-happen or doing pubsub I don't consider it a good option, though with extra effort you could make it work.
So to conclude, you may be able to replace "the big LAMP stack" with CouchDB alone, Redis alone, or either one placed behind a node app for filtering/extending the APIs they already provide into something that fits your game.
Best of luck!
I've just started learning mongo, and it isn't hard to learn. Things like indexes and explain are there and work the same. When it comes to architecture, you want to think the opposite of SQL; instead of needing a good reason to de-normalize, you need to come up with a good reason to normalize. The guys at 10gen (who make mongo) will say that thinking of hierarchical is a more natural way of thinking about things, which I would agree with (tentatively). Finders feel sort of sql-ish as well, although you will still use map-reduce for aggregation queries.
From what I understand about couch, the big difference is there is a strong focus on the distributed replication thing. Mongo focuses more on performance over massive amounts of data (although they have autosharding and a great scaling story too). I would go mongo, unless you are actually going to use the distributed aspects of couch.
Node has got to be the coolest thing ever, and I think this would be a great application for it. I have zero experience with it, but from what I have read, it is great for loads of small requests, and scales up wonderfully. Idiomatic javascript lends itself quite well to the whole eventing model, and with v8 it runs just obscenely fast.

How best to integrate several systems?

Ok where I work we have a fairly substantial number of systems written over the last couple of decades that we maintain.
The systems are diverse in that multiple operating systems (Linux, Solaris, Windows), Multiple Databases (Several Versions of oracle, sybase and mysql), and even multiple languages (C, C++, JSP, PHP, and a host of others) are used.
Each system is fairly autonomous, even at the cost of entering the same data into multiple systems.
Management recently decided that we should investigate what it will take to get all the systems happily talking to each other and sharing data.
Keep in mind that while we can make software changes to any of the individual systems, a complete rewrite of any one system (or more) is not something management is likely to entertain.
The first thought of several of the developers here was the straight forward: If system A needs data from system B it should just connect to system B's database and get it. Likewise if it needs to give B data it should just insert it into B's database.
Due to the mess of databases (and versions) used, other developers were of the opinion that we should have one new database, combining the tables from all the other systems to avoid having to juggle multiple connections. By doing this they hope that we might be able to consolidate some tables and get rid of the redundant data entry.
This is about the time I was brought in for my opinion on the whole mess.
The whole idea of using the database as a means of system communication smells funny to me. Business logic will have to be placed into multiple systems (if System A wants to add data to System B it better understand B's rules concerning the data before doing the insert), several systems will most likely have to do some form of database polling to find any changes to their data, continuing maintenance will be a headache, as any change to a database schema now propagates several systems.
My first thought was to take the time and write APIs/Services for the different systems, which once written could be easily used to pass/retrieve data back and forth. A lot of the other developers feel that is excessive and far more work than just using the database.
So what would be the best way to go about getting these systems to talk to each other?
Integrating disparate systems is my day job.
If I were you, I would go to great effort to avoid accessing System A's data from directly within System B. Updating System A's database from System B is extremely unwise. It is exactly the opposite of good practice to make your business logic so diffuse. You will end up regretting it.
The idea of the central database isn't necessarily bad ... but the amount of effort involved is probably within an order of magnitude of rewriting the systems from scratch. It is certainly not something I would attempt, at least in the form you describe. It can succeed, but it is much, much harder and it takes a lot more discipline than the point-to-point integration approach. It's funny to hear it suggested in the same breath as the 'cowboy' approach of just shoving data directly into other systems.
Overall your instincts seem pretty good. There are a couple of approaches. You mention one: implementing services. That's not a bad way to go, especially if you need updates in real time. The other is a separate integration application that is responsible for shuffling the data around. That's the approach I usually take, but usually because I can't change the systems I'm integrating to ask for the data it needs; I have to push the data in. In your case the services approach isn't a bad one.
One thing I would like to say that might not be obvious to someone coming to system integration for the first time is that every piece of data in your system should have a single, authoritative point of truth. If the data is duplicated (and it is duplicated), and the copies disagree with each other, the copy in the point of truth for that data must be taken to be correct. There is just no other way to integrate systems without having the complexity scream skyward at an exponential rate. Spaghetti integration is like spaghetti code, and it should be avoided at all costs.
Good luck.
EDIT:
Middleware addresses the problem of transport, but that is not the central problem in integration. If the systems are close enough together that one app can shove data directly in to another, they're probably close enough that a service offered by one can be called directly by another. I wouldn't recommend middleware in your case. You might get some benefit from it, but that would be outweighed by the increased complexity. You need to solve one problem at a time.
Sounds like you may want to investigate Message Queuing and message-oriented middleware.
MSMQ and Java Message Service being examples.
It seems you are looking for opinions, so I will provide mine.
I agree with the other developers that writing an API for all the different systems is excessive. You would likely get it done faster and have much more control over it if you just take the other suggestion of creating a single database.
One of the challenges that you will have is to align the data in each of the different systems so that it can be integrated in the first place. It may be that each of the systems that you want to integrate holds entirely different sets of data but more likely it is data that is overlapping. Before diving into writing API:s (which is the route I would take as well given your description) I would recommend that you try and come up with a logical data model for the data that needs to be integrated. This data model will then help you leverage the data that you are having in the different systems and make it more useful to the other databases.
I would also highly recommend an iterative approach to the integration. With legacy systems there is so much uncertainty that trying to design and implement it all in one go is too risky. Start small and work your way to a reasonably integrated system. "Fully integrated" is hardly ever worth aiming for.
Directly interfacing via pushing/ poking databases exposes a lot of internal detail of one system to another. There are obvious disadvantages: upgrading one system can break the other. Moreover, there can be technical limitations in how one system can access the database of the other (consider how an application written in C on Unix will interact with a SQL Server 2005 database running on Windows 2003 Server).
The first thing you have to decide is the platform where the "master database" will reside, and the same for the middleware providing the much required glue. Instead of going towards API level middleware-integration (such as CORBA), I would suggest you to consider Message Oriented Middleware. MS Biztalk, Sun's eGate and Oracle's Fusion can be some of the options.
Your idea of a new database is a step in the right direction. You might like to read a little bit on Enterprise Entity Aggregation pattern.
A combination of "data integration" with a middleware is the way to go.
If you are going towards Middleware + Single Central Database strategy, you might want to consider achieving this in multiple phases. Here's a logical stepped process which can be considered:
Implementation of services/APIs for different systems which expose the functionality for each system
Implementation of Middleware which accesses these APIs and provides an interface to all the systems to access the data/services from other systems (accesses data from central source if available, else gets it from another system)
Implementation of Central Database only, without data
Implementation of Caching/Data-Storage Services at the Middleware level which can store/cache data in the central database whenever that data is accessed from any of the Systems e.g. IF System A's records 1-5 are fetched by System B through Middleware, the Middleware Data Caching Services can store these records in the centralized database and the next time these records will be fetched from the central database
Data Cleansing can happen in Parallel
You can also create a import mechanism to push data from multiple systems to the central database on a daily basis (automated or manual)
This way, the effort is distributed across multiple milestones and data is gradually stored in the central database on first-accessed-first-stored basis.