Caching EAV data - XML or NoSQL / MongoDB? - mongodb

I'm building a web app that relies heavily on the EAV pattern for storing data. This basically means that each attribute of an object has it's own row in a massive database table. I'm using MySQL to store everything. This is a very simplified example of what I'm storing...
OBJECTS ATTRIBUTES
objId | type objId | attribute | value
============= =========================
1 | fruit 1 | color | green
2 | fruit 1 | shape | round
3 | book 2 | color | red
I know some people hate EAV, but I need to be able to add new object attributes arbitrarily without modifying the database schema, and it's working very well for me so far.
As I think anyone else finds when building a system using an EAV data structure, the weakness of this approach is the retrieval of multiple objects together with each object's attributes. At the moment my app only displays 10 objects at a time, so I just query my EAV table 10 times (once for each object) and it's still very fast. However, I'd like to remove this limitation and allow hundreds of objects to be fetched in one go. I also want to be able to query objects in a more flexible way than I'm doing currently.
Doing this with SQL joins would be hideous, so I'm considering caching the data. On average the database gets about 300 reads for every 1 write, so I think this it's a good candidate for caching.
So far these are the options I've come up with...
XML database column: Every time a write is performed, update an XML text column in the objects table containing all the object's attributes. This would work for reading the data quickly, but querying XML data hidden in a database table is messy.
XML file: Every time a write is performed, write an XML file to disk which contains each object and it's attributes. This has the benefit that I can then use XQuery to query the objects.
NoSQL (eg. MongoDB): Perhaps I should have built the system on a schemaless database like MongoDB. Re-writing the entire app to use MongoDB would be quite time consuming, but it struck me that I could use it as a cache. So for example, every time data is written to the EAV store, the equivalent object would be updated in MongoDB which would then be used for reads and queries.
Originally I thought an XML file would be the best approach, but I can see the file getting really big and unmanageable. At the moment I'm leaning towards using MongoDB. I know it seems crazy running two database servers for one app, but I think it could work in my case.
I'd love to hear your thoughts on this.

I see only two ways, both of them were mentioned in comments.
First, you can really migrate to document-oriented db like Mongo - this is suitable as alternative to EAV. Since it'll be no JOINs and other logic, it'll be very fast and slightly scaled. (So, perhaps you'll be able to avoid using cache).
Second, you can use specific tool for caching like Redis or Mongo or Memcached to save every query result for some time.
But I want to turn our mind to the future of this system. What is planned loading and scaling?
If you want to reduce system load, I think the best way is to migrate to document-oriented db.
Or, if you want to have result immediately (cache data for reading) - it can be reached by using caching tool, even [if possible] on network level (for example nginx support memcached out of the box).
So, as usual, you should find balance between one-time and continious costs.

Related

PostgreSQL - store real-time data of multiple subjects

Scenario:
I'm trying to build a real-time monitoring webpage for ship operations
I have 1,000 - 10,000 ships operating
All ships are sending real-time data to DB, 24 hours - for 30 days
Each new data inserted has a dimension of 1 row X 100 col
When loading the webpage, all historic data of a chosen ship will be fetched and visualized
Last row of the ship's real-time data table will be queried, and fetched on the webpage to update real-time screen
Each ship has its own non-real-time data, such as ship dimensions, cargo, attendants, etc...
So far I've been thinking about creating a new schema for each ship. Something like this:
public_schema
ship1_schema
ship2_schema
ship3_schema
|--- realtime_table
|--- cargo_table
|--- dimensions_table
|--- attendants_table
ship4_schema
ship5_schme
Is this a good way to store individual ship's real-time data, and fetch them on a webserver? What other ways would you recommend?
For time-series wise, I'm already using a PostgreSQL extension called Timescale DB. My question rather about storing time-series data, in case I have many ships. Is it a good idea to differentiate each ship's RT data my constructing a new schema?
++ I'm pretty new to PostgreSQL, and some of the advice I got from other people was too advanced for me... I would greatly appreciated if you suggest some method, briefly explain what it is
This seems personally like the wrong way to work.
In this case i would have all the ship data in one table and from there on i would include a shipid to
realtime_table
cargo_table
dimensions_table
attendants_table
From there on if you believe that your data will reach a lot of volume you have the following choices.
Create indexes on the fields that are important to query, Postgres query planner is very useful for that.
Latest Postgres has implemented table partitioning based on criteria you provide without having to use table inheritance.**
Since you will be needing live data on the web page you can use
Listen command for Postgres
for when data are received from the ship (Unless you have another way of sending this data to the web server like web sockets)
Adding a bit of color here - if you are already using the TimescaleDB extension, you won't need to use table partitioning, since TimescaleDB will handle that for you automatically.
The approach of storing all ship data in a single table with a metadata table outside of the time series table is a common practice. As long as you build the correct indexes, as others have suggested, you should be fine. An important thing to note is that if you (for example) build an index on time, you want to make sure to include time in your queries to benefit from constraint exclusion.

Why would using a nosql/document/MongoDB as a relational database be inferior?

I have recently been introduced to MongoDB and I've come to like a lot (compared to MySQL i used for all projects).
However in some certain situations, storing my data with documents "linking" to each other with simple IDs makes more sense (to reduce duplicated data).
For example, I may have Country and User documents, where a user's location is actually an ID to a Country (since a Country document includes more data, hence duplicating Country data in each user makes no sense).
What I am curious about is.. why would MongoDB be inferior compared to using a proper relationship database?
Is it because I can save transactions by doing joins (as opposed to doing two transactions with MongoDB)?
Thats a good question..!!
I would say there is definitely nothing wrong in using nosql db for the type of data you have described. For simple usecases it will work perfectly well.
The only point is that relational databases have been designed long time back to serve the purpose of storing and querying WELL STRUCTURED DATA.. with proper relations defined. Hence for a large amount of well structured data the performance and the features provided will be a lot more than that provided by a nosql database. Since they are more matured.. its their ball game..!!
On the other hand nosql databases have been designed to handle very large amount of unstructured data and has out of the box support for distributed environment scaling. So its a completely different ball game now..
They basically treat data differently and hence have different strategies / execution plans to fetch a given data..
MongoDB was designed from the ground up to be scalable over multiple servers. When a MongoDB database gets too slow or too big for a single server, you can add additional servers by making the larger collections "sharded". That means that the collection is divided between different servers and each one is responsible for managing a different part of the collection.
The reason why MongoDB doesn't do JOINs is that it is impossible to have JOINs perform well when one or both collections are sharded over multiple nodes. A JOIN requires to compare each entry of table/collection A with each one of table/collection B. There are shortcuts for this when all the data is on one server. But when the data is distributed over multiple servers, large amounts of data need to be compared and synchronized between them. This would require a lot of network traffic and make the operation very slow and expensive.
Is it correct that you have only two tables, country and user. If so, it seems to me the only data duplicated is a foreign key, which is not a big deal. If there is more duplicated, then I question the DB design itself.
In concept, you can do it in NOSQL but why? Just because NOSQL is new? OK, then do it to learn but remember, "if it ain't broke, don't fix it." Apparently the application is already running on relational. If the data is stored in separate documents in MongoDB and you want to interrelate them, you will need to use a link, which will be more work than a join and be slower. You will have to store a link, which would be no better than storing the foreign key. Alternatively, you can embed one document in another in MongoDB, which might even increase duplication.
If it is currently running on MySQL then it is not running on distributed servers, so Mongo's use of distributed servers is irrelevant. You would have to add servers to take advantage of that. If the tables are properly indexed in relational, it will not have to search through large amounts of data.
However, this is not a complex application and you can use either. If the data is stored on an MPP environment with relational, it will run very well and will not need to search to large amounts of data at all. There are two requirements, however, in choosing a partitioning key in MPP: 1. pick one that will achieve an even distribution of data; and 2. pick a key that can allow collocation of data. I recommend you use the same key as the partitioning key (shard key) in both files.
As much as I love MongoDB, I don't see the value in moving your app.

Redis GET vs. SQL SELECT

I am pretty new to NoSQL, but I always liked the idea of it. I took a look at Redis, and got a few questions about the best ways of storing and recieving multiple hashes.
Assuming the following scenario:
Store a list of objects (redis 'Hashes') and select them by their timestamp.
To archive this in SQL, it would require one table and two simple queries (INSERT & SELECT).
Trying to do this in Redis, I ended up creating the following structure:
Key object:$id (hash) containing the object
Key index:timestamp:$id (sorted set)
score equals timestamp and value includes id
While I can live with the additional maintenance work of two keys instead of one table (SQL), I am curious about the process of selecting multiple objects:
ZRANGEBYSCORE index:timestamp:$id timestampStart timestampEnd
This returns an array of all IDs which got created between timestampStart and timestampEnd. To get the object itself I am requesting every single one by:
GET object:$id
Is this the right way of doing it?
In comparison with an SQL Database: Is it still appreciably faster or might it even become slower caused by the high number of GETs?
A ZRANGEBYSCORE costs O(log(N) + M) where N=|items in your set| and M=|items you're selecting|. So, doing the ZRANGEBYSCORE and then M GET operations is just O(long(N)+M+M) = O(log(N)+M) and would at most be twice as slow. The network back and forth could have been a major slow down, but since each of your gets is an independent operation, you can just pipeline them. You can also put the whole thing in a Lua script and just have one back and forth, which would be the most optimal. I'd say with 99% certainty this would be faster than doing the same thing in SQL.
Also, if this is a very frequent operation for you, you can get even more speed up by just storing the entire object in your sorted set instead of just the id. You'd have key = object encoded as json, score = timestamp. This would save you O(M) on your operation in terms of not needing to do any GETs.
Whether or not this is a good way of doing things really depends on your use case. How much speed do you really need, and how important are other features of a traditional database to you? Remember, Redis is much more just datastructures accessible by clients than a traditional database, and it must store everything in RAM. To know whether it's the right thing for you, we'd need more information.

Creating a JSON Store For iPhone

We have loads of apps where we fetch data from remote web services as JSON and then use a parser to translate that into a Core-Data model.
For one of our apps, I'm thinking we should do something different.
This app has read-only data, which is volatile and therefore not cached locally for very long. The JSON is deeply hierarchal with tons of nested "objects". Documents usually contain no more than 20 top level items, but could be up to 100K.
I don't think I want to create a Core Data model with 100's of entities, and then use a mapper to import the JSON into it. It's seems like such a song and dance. I think I just want to persist the JSON somewhere easy, and have the ability to query it. MongoDB would be fine, if it ran on iPhone.
Is there a JSON document store on the iPhone that supports querying?
Or, can I use some JSON parser to convert the data to some kind of persistent NSDictionary and query that using predicates?
Or perhaps use SQLite as a BLOB store with manually created indexes on the JSON structures?
Or, should I stop whining, and use Core Data? :)
Help appreciated.
When deciding what persistence to use, it's important to remember that Core Data is first and foremost an object graph management system. It true function is to create the runtime model layer of Model-View-Controller design patterned apps. Persistence is actually a secondary and even optional function of Core Data.
The major modeling/persistence concerns are the size of the data and the complexity of the data. So, the relative strengths and weaknesses of each type of persistence would break down like this:
_______________________________
| | |
2 | | |
| SQL | Core Data | 4
s | | |
i |_______________ ______________|
z | | |
e | | |
1 | Collection | Core Data | 3
| plist/xml | |
| | |
-------------------------------
Complexity--->
To which we could add a third lessor dimension, volatility i.e. how often the data changes
(1) If the size, complexity and volatility of the data are low, then using a collection e.g. NSArray, NSDictionary, NSSet of a serialized custom object would be the best option. Collections must be read entirely into memory so that limits their effective persistence size. They have no complexity management and all changes require rewriting the entire persistence file.
(2) If the size is very large but the complexity is low then SQL or other database API can give superior performance. E.g. an old fashion library index card system. Each card is identical, the cards have no relationships between themselves and the cards have no behaviors. SQL or other procedural DBs are very good at processing large amounts of low complexity information. If the data is simple, then SQL can handle even highly volatile data efficiently. If the UI is equally simple, then there is little overhead in integrating the UI into the object oriented design of an iOS/MacOS app.
(3) As the data grows more complex Core Data quickly becomes superior. The "managed" part of "managed objects" manages complexity in relationships and behaviors. With collections or SQL, you have manually manage complexity and can find yourself quickly swamped. In fact, I have seen people trying manage complex data with SQL who end up writing their own miniature Core Data stack. Needless to say, when you combine complexity with volatility Core Data is even better because it handles the side effects of insertions and deletion automatically.
(Complexity of the interface is also a concern. SQL can handle a large, static singular table but when you add in hierarchies of tables in which can change on the fly, SQL becomes a nightmare. Core Data, NSFetchedResultsController and UITableViewController/delegates make it trivial.)
(4) With high complexity and high size, Core Data is clearly the superior choice. Core Data is highly optimized so that increase in graph size don't bog things down as much as they do with SQL. You also get highly intelligent caching.
Also, don't confuse, "I understand SQL thoroughly but not Core Data," with "Core Data has a high overhead." It really doesn't. Even when Core Data isn't the cheapest way to get data in and out of persistence, it's integration with the rest of the API usually produces superior results when you factor in speed of development and reliability.
In this particular case, I can't tell from the description whether you are in case (2) or case (4). It depends on the internal complexity of the data AND the complexity of the UI. You say:
I don't think I want to create a Core
Data model with 100's of entities, and
then use a mapper to import the JSON
into it.
Do you mean actual abstract entities here or just managed objects? Remember, entities are to managed objects what classes are to instances. If the former, then yes Core Data will be a lot of work up front, if the latter, then it won't be. You can build up very large complex graphs with just two or three related entities.
Remember also that you can use configuration to put different entities into different stores even if they all share a single context at runtime. This can let you put temporary info into one store, use it like more persistent data and then delete the store when you are done with it.
Core Data gives you more options than might be apparent at first glance.
I use SBJson to parse JSON to NSDictionaries then save them as .plist files using [dict writeToFile:saveFilePath atomically:YES]. Loading is also just as simple NSMutableDictionary *dict = [NSDictionary dictionaryWithContentsOfFile:saveFilePath]. Its fast, efficient and easy. No need for a database.
JSON Framework is one. It'll turn your JSON into native NSDictionary and NSArray objects. I don't know anything about its performance on a large document like that, but lots of people use it and like it. It's not the only JSON library for iOS, but it's a popular one.

Too much data duplication in mongodb?

I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.
So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.
So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?
Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.
Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.
I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.
Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.
A few tips:
If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.
Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."
Try adding an IList of type UserEvent property to your User object. You didn't specify much about how your domain model is designed. Check the NoRM group http://groups.google.com/group/norm-mongodb/topics
for examples.