Scaler values vs array in MongoDB - mongodb

Consider following two collections and followed note. Which one do you think is more appropriate ?
// #1
{x:'a'}
{x:'b'}
{x:'c'}
{x:['d','e']}
{x:'f'}
.
//#2
{x:['a']}
{x:['b']}
{x:['c']}
{x:['d','e']}
{x:['f']}
some facts:
field x have usually only one value (95%) and some times more (5%).
Mongodb behaves with {x:['a']} like {x:'a'} while querying.
MongoVUE shows scaler values in #1 directly and shows Array[0] for #2.
Using #1, when you want append a new value you have to cast data types
#1 May be a little faster in some CRUD operation (?)

To amplify #ZaidMasud's point I recommend staying with sclars or arrays and not mix both. If you have unavoidable reasons for having both (legacy data, say) then I recommend that you get very familiar with how Mongo queries work with arrays; it is not intuitive at first glance. See for example this puzzler.

From a schema design perspective, even though MongoDB allows you to store different data types for a key value pair, it's not necessarily a good idea to do so. If there is no compelling reason to use different data types, it's often best to use the same datatype for a given key/value pair.
So given that reasoning, I would prefer #2. Application code will generally be simpler in this case. Additionally, if you ever need to use the Aggregation Framework, you will find it useful to have uniform types.

Related

Polymorphic datastructure in GraphQL with MongoDB

I need to create a datastructure in GraphQL (MongoDB as database) which represents the following schema:
A bill can have multiple articles
One article has a article-group-number and numerous attributes that 100% depend on the article-group-number and may vary heavily between these groups.
So one article may look like this:
{
article-group-number: 100,
length: 5.5
}
Another article may look like this:
{
article-group-number: 101,
width: 2
}
There may even be articles that have a reference to another article (but no circular structure).
I need to be able to create and maintain different validation-rules (functions) for each article-number.
How should i do this?
I was thinking about union types in GraphQL but those can only be read, not written (by "mutation") at this point in time.
Am i supposed to save those articles as pure JSON-data and do all validation by hand? This way i would loose the logical structure since this JSON-data field would just be some sort of a "messy data-blob" in my eyes because all logic is lost by passing through GraphQL.
Isn't there a more clean way?
Two options:
You could take a cue from GraphQL itself -- the __Type type that is used in schemas has a "kind" field, and a lot of attributes that depend on the kind of the type. An object type, for example, can have fields, but other kinds of types cannot. GraphQL just puts all of the possible attributes in the __Type type and lets them take null values when they aren't applicable to the type's kind. You could do a similar thing, and just let your articles have all possible attributes.
Or you could use a union of article types, maybe all implementing a common article interface. This is not really a problem for mutations. Since mutations are in a completely different namespace from queries, they can do whatever you want -- the type you mutate doesn't have to be the same as the type you query, and in fact usually isn't.
Both ways work. The second way is more complicated, though, so if you're going to do that then make sure you're going to get some kind of benefit from it.

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

Why do Eclipse APIs use Arrays instead of Collections?

In the Eclipse APIs, the return and argument types are mostly arrays instead of collections. An example is the members method on IContainer, which returns IResources[].
I am interested in why this is the case. Maybe it is one of the following:
The APIs were designed before generics generics were available, so IResource[] was better than just Collection or List
Memory concerns, e.g. ArrayList internally holds an array which has more space than is needed (to offer an efficient implementation of add), whereas an array is always constructed for just the needed target size
It's not possible to add/remove elements on an array, so it is safe for iterating (but defensive copying is still necessary, because one can still change elements, e.g. set them to null)
Does anyone have any insights or other ideas why the API was developed that way?
Posting this as an answer, so it can be accepted.
Eclipse predates generics and they are really serious about API stability. Also, at the low level of SWT passing arrays seems to be used to reflect the operating system APIs that are being wrapped. Once you have a bunch of tooling using Arrays I guess it makes sense to keep things consistent. Also note that arrays aren't subject to all of the type erasure issues when using reflection.
Yeah, I hear you as far as the collections api being generally much easier to work with for dynamic lists of items.

What is the prefered way in using the parallel collections in Scala?

At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?
Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).

Whats more efficent Core Data Fetch or manipulate/create arrays?

I have a core data application and I would like to get results from the db, based on certain parameters. For example if I want to grab only the events that occured in the last week, and the events that occured in the last month. Is it better to do a fetch for the whole entity and then work with that result array, to create arrays out of that for each situation, or is it better to use predicates and make multiple fetches?
The answer depends on a lot of factors. I'd recommend perusing the documentation's description of the various store types. If you use the SQLite store type, for example, it's far more efficient to make proper use of date range predicates and fetch only those in the given range.
Conversely, say you use a non-standard attribute like searching for a substring in an encrypted string - you'll have to pull everything in, decrypt the strings, do your search, and note the matches.
On the far end of the spectrum, you have the binary store type, which means the whole thing will always be pulled into memory regardless of what kind of fetches you might do.
You'll need to describe your managed object model and the types of fetches you plan to do in order to get a more specific answer.