Should I use Core Data or SQLite to represent a graph structure with weighted edges?

Should I use Core Data or SQLite to represent a graph structure with weighted edges? - iphone

I am building an iPhone application that will be using a weighted graph (probably around 1000 nodes and 500-700 relationships, ball park estimate, could be quite wrong).
Core data would be wonderful since it's easy to work with and is optimized for the iPhone. At the same time, the graph nodes need to be weighted. I could add a level of indirection (edit/clarification: I would represent the weighted edges as relationships.) in core data but feel that might defeat the point of the increase in performance.
SQLite would enable me to do it without any extra indirection the problem becomes performance.
Another consideration is that I use core data to store the stuff and then store only the weights in sqlite in tables and build custom core data objects that when I fetch or create creates the necessary links in sqlite and then just use the sqlite for calculation (since the calculation would only be traversing and looking at weights, so I don't need anything but the final result/id of the object). I then would look up the actually object in core data.
Also, there will be a server that is taking in consideration all of the users graphs into a much larger graph. So it might be easier to use sqlite straight up since replicating the information on the beastly server would be easier :-)
Thanks for the help.

At first glance, Core Data is wonderful for your use case because its an object graph management framework (with persistence). It's lazy loading and caching features allow you to load a partial graph in memory, which is usually desired if you have a large graph.
Core Data doesn't seem to support weighted relationship out of box (I'm not an Core Data expert at all). If "a level of indirection" means Core Data objects representing relationships, I agree it will be a performance hit.
For performance and portability considerations, straight SQLite may be a better choice because you have complete control over the schemas and indices.
I vote for your idea of "store only the weights in sqlite". In my understanding, this allows you to take advantage of Core Data features while keeping performance critical tasks efficient.
The author of NetNewsWire has written a blog article explaining why he has switched away from Core Data (mostly because performance and flexibility reasons) which may be helpful for you. http://inessential.com/2010/02/26/on_switching_away_from_core_data

Related

What is a graph database?

Looking at options for an embedded NoSQL database written in Java, graph databases come up. What is a graph database (especially in contrast to a key-value store and a document-oriented database) and when would I use one (and when not)?

I learned a little bit about them in school (long time ago). Relational hadn't quite taken over the world yet, but it was close, so graph databases got a cursory mention. IIRC they were pretty much dead at the time. I'm not sure how informative this will really be, but I'll put it out there in case it helps somebody.
Basically, if what I recall is true, a graph database is essentially a graph. You retrieve data (nodes) from the graph, and then to find related information, you would traverse links (edges) to related data in the graph structure.
Other than the obvious case where your data is graph-like and it might be faster/more natural to use, I can't recall any advantages. I don't recall any of the disadvantages, but I would suspect that it might do poorly with with the sorts of things that relational databases do well (i.e. cranking through large sets of tuples).

Several sources available to answer the "what" question, including this:
http://www.infinitegraph.com/what-is-a-graph-database.html
When should you use a graph database?
If your data contains a lot of many-to-many relationships, if
recursive self-joins are too costly or limiting to your application
and scaling needs, and/or your primary objective is quickly finding
connections, patterns and relationships between the objects in your
data.

Graph Databases are useful in scenarios where the information has an inherent graph-like nature such as social networks, bibliographical databases like Wikipedia, fraud detection, media analysis, recommendation, biological networks analysis, ... In these scenarios, it is not likely to obtain only a list of results but a set of entities that are satisfying a given constraint.
Graph databases are useful because:
Relationships between entities are implicit in the model
They are more flexible to manage unknown or dynamic schemas
Favor structural and navigational queries
They are more efficient solving network operations

Too much data duplication in mongodb?

I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.
So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.
So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?

Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.
Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.
I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.
Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.
A few tips:
If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.
Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."

Try adding an IList of type UserEvent property to your User object. You didn't specify much about how your domain model is designed. Check the NoRM group http://groups.google.com/group/norm-mongodb/topics
for examples.

What's the best way to store static data in an iOS app?

I have in my app a considerable amount of data that it needs to access, but will never be changed by the app. Currently I'm using this data in other applications in JSON files and SQL databases, but neither seems very straightforward to use in iOS.
I don't want to use CoreData, which provides tons of unnecessary functionality and complexity.
Would it be a good idea store the data in PropertyList file and build an accessor class? Are there any simple ways to incorporate SQLite without going the CoreData route?

You can only use plist if the amount of data is relatively small. Plist are entirely loaded into memory so you can only really use them if you can sustain all the objects created by the plist in memory at once for as long as you need them.
Core Data has a learning curve but in use it is usually less complex than SQL. In most cases the "simpler" SQL leads to more coding because you end up having to duplicate much of the functionality of Core Data to shoehorn the procedural SQL into the object-oriented API. You have to manually manage the memory use of all the data by tracking retention. You've write a lot of SQL code every time you want data. I've updated several apps from SQL to Core Data and in all cases the Core Data implementation was smaller and cleaner than the SQL.
Neither is the memory or processor "overhead" any larger. Core Data is highly optimized. In most cases, off the shelf Core Data is more efficient than hand tuned SQL. One minor sub optimization in SQL usually destroys any theoretical advantage it might have.
Of course, if you're already highly skilled at managing SQL in C then you personally might get the app to market more quickly by using SQL. However, if you're wondering what you should plan to use in general on on Apple Platforms, Core Data is almost always the answer and you should take the time to learn it.

You can just use SQLite directly without the overhead of Core Data using the SQLite C API.
Here is a tutorial I found on your use-case - simply loading some data from an SQLite database. Hope this helps.

Depending on the type of your data, the size and how often it changes, you may desire to just keep things simple and use a property list. Otherwise, using SQLite (documented in Jergason's answer) would be where I'd go. Though let me say that if you have a relatively small (less than a couple hundred) set of basic types (arrays, dictionaries, numbers, strings) that don't change frequently, then a property list will be a better choice in my opinion.
As an example to that, in one of my games, I create the levels from a single property list per difficulty. Since there are only a handful of levels per difficulty (99) and a small set of parameters for each (number of elements in play, their initial positions, mass, etc) then it makes sense, and I avoid having to deal with SQLite directly or worse yet, setting up and maintaining CoreData.

What do you mean by "best"? What kind of data?
If it's a bunch of objects, then JSON or (binary) plist aren't terrible formats, since you'll want the whole thing loaded in memory to walk the object graph. Compare space efficiency and loading performance to pick which one to use.
If it's a bunch of binary blobs, then store the blobs in a big file, memory-map the file (NSDataReadingMapped a.k.a. NSMappedRead), and use indexes into the blobs. iOS frameworks use a mixture of these (e.g. there are a lot of .pngs, but also "other.artwork" which just contains raw image data).
You can also use NSKeyedArchiver and friends if your classes implement the NSCoding protocol, but there's some object graph management overhead and the plist format it produces isn't exactly nice to work with.

Cocoa Touch Data Persistence

I'm experimenting with Core Data, plist files, flat files and sqlite.
I can't seem to differentiate in terms of efficiency for small data sets.
In terms of the differences on the surface ( i.e the API ), i know the difference.
But what I'm trying to get a feel for is which persistence model is best for which situation.

For small data sets, if you need read - write capability, you should go with NSUserDefaults - if gives you the power of key-value store and retrieval without too much hassle.
If you need read-only access, plist files are a viable option, as it keeps the abstraction to the concept of key-value and offers an accessible API to work with.
Flat files would be recommended if you need a different model of persistence than key-value, otherwise it would mean just reinventing the wheel.
Sqlite would fit the case where your data is organized in a strong relational manner and instead of key-value, you'd rather prefer having the power of sql to work directly with your data.
If for your dataset, however small it may be, would be an unnecessary inconvenience to manage the low-level storage and retrieval, then you could choose CoreData. With CoreData, code can retrieve and manipulate data on a purely object level without having to worry about the details of storage and retrieval, so you'd be more focused on your domain logic rather than fitting it to the storage and data manipulation logic.

Optimal way to persist an object graph to flash on the iPhone

I have an object graph in Objective-C on the iPhone platform that I wish to persist to flash when closing the app. The graph has about 100k-200k objects and contains many loops (by design). I need to be able to read/write this graph as quickly as possible.
So far I have tried using NSCoder. This not only struggles with the loops but also takes an age and a significant amount of memory to persist the graph - possibly because an XML document is used under the covers. I have also used an SQLite database but stepping through that many rows also takes a significant amount of time.
I have considered using Core-Data but fear I will suffer the same issues as SQLite or NSCoder as I believe the backing stores to core-data will work in the same way.
So is there any other way I can handle the persistence of this object graph in a lightweight way - ideally I'd like something like Java's serialization? I've been thinking of trying Tokyo Cabinet or writing the memory occupied by bunch of C structs out to disk - but that's going to be a lot of rewrite work.

I would reccomend re-writing as c structs. I know it will be a pain, but not only will it be quick to write to disk but should perform much better.
Before anyone gets upset, I am not saying people should always use structs, but there are some situations where this is actually better for performance. Especially if you pre-allocate your memory in say 20k contiguous blocks at a time (with pointers into the block), rather than creating/allocating lots of little chunks within a repeated loop.
ie if your loop continually allocates objects, that is going to slow it down. If you have preallocated 1000 structs and just have an array of pointers (or a single pointer) then this is a large magnitude faster.
(I have had situations where even my desktop mac was too slow and did not have enough memory to cope with those millions of objects being created in a row)

Rather than rolling your own, I'd highly recommend taking another look at Core Data. Core Data was designed from the ground up for persisting object graphs. An NSCoder-based archive, like the one you describe, requires you to have the entire object graph in memory and all writes are atomic. Core Data brings objects in and out of memory as needed, and can only write the part of your graph that has changed to disk (via SQLite).
If you read the Core Data Programming Guide or their tutorial guide, you can see that they've put a lot of thought into performance optimizations. If you follow Apple's recommendations (which can seem counterintuitive, like their suggestion to denormalize your data structures at some points), you can squeeze a lot more performance out of your data model than you'd expect. I've seen benchmarks where Core Data handily beat hand-tuned SQLite for data access within databases of the size you're looking at.
On the iPhone, you also have some memory advantages when using controlling the batch size of fetches and a very nice helper class in NSFetchedResultsController.
It shouldn't take that long to build up a proof-of-principle Core Data implementation of your graph to compare it to your existing data storage methods.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse