Storing and accessing large amounts of data - mongodb

My application creates pieces of data that, in xml, would look like this:
<resource url="someurl">
<term>
<name>somename</name>
<frequency>somenumber</frequency>
</term>
...
...
...
</resource>
This is how I'm storing these "resources" now. A resource per XML file. As many "term" per "resource" as needed.
The problem is, I'll need to generate about 2 million of these resources.
I've generated almost 500.000 and my mac isn't very happy about it.
So my question is: how should I store this data?
A database? that would be hard, because the structure of the data isn't fixed...
Maybe merge some resources into larger XML files?
...?
I don't need to change the data once it's created.
Right now I'm accessing a specific resource by the name of that resource's file.
Any suggestions are greatly appreciated!

Not all databases are relational. Have a look at for example mongodb. It stores your data as json-like objects, similar to your resources.
An example using the shell:
$ mongo
> db.resources.save({url: "someurl",
terms: [{name: "name1", frequency: 17.0},
{name: "name2", frequency: 42.0}]})
> db.resources.find()
{"_id" : ObjectId( "4b00884b3a77b8b2fa3a8f77"),
"url" : "someurl" ,
"terms" : [{"name" : "name1" , "frequency" : 17},
{"name" : "name2" , "frequency" : 42}]}

If your can't predict how your data is going to be organized, maybe http://couchdb.apache.org/ can be interesting for you. It is a schema-less database.
Anyways, XML is maybe not the best choice for big amout of data.
Maybe trying JSON or YAML works out better? They need less space and are easier to parse (I have however no experience on using those formats on larger scale. Maybe I'm wrong).

You should deffinetely have several resourses per XML file, but only if you are expected to have all the resources toguether at the same time. If you need to send only a handfull of resourses to anybody, then keep making the individual XML.
Even in that situation, you could keep the large XML file, and generate on demand the smaller ones from the original dataset.
Using a database like SQLite3 would allow you to have faster seek times and easier manipulation of the data, using SQL syntax.

Related

Couchbase many to many relationship modeling

I trying to figure out how to model my data - in a many to many relationship in Couchbase (im using n1ql as well).
I have two entities: Clients and Projects.
Client - each client can create many projects - approximately 2000 projects per year.
Project - each project can belong to many clients (maximum 50 clients).
I thought maybe creating a new document for each site/project, but according to Couchbase documentation on data modeling:
This typically isn’t a good approach in Couchbase Server as
referencing and embedding provides a great deal of flexibility to
avoid creating this redundant document.
How should I store the data ?
Any suggestion/advice would be helpful.
Thanks.
Please refer following URL to resolve above issue:
https://developer.couchbase.com/documentation/server/current/data-modeling/modeling-relationships.html
That quote is referencing "relationship documents". In your case, that would mean you'd have a client document, a project document, and some sort of client-project mapping document. I would agree that a document only for a relationship would not be a useful approach, unless you intend to store a lot of information about that relationship.
Based on the information you've given, I'd recommend storing Client documents and Project documents. Based on the numbers, I'd say the projects should contain a list of Client document IDs.
Something like:
key client::001
{
"name" : "Clienty McClientface",
"address" : "123 main st",
"foo" : "bar",
"type" : "client"
}
key project::001
{
"name" : "Alan Parsons Project",
"startDate" : "2012-09-27",
"clients" : [
"client::001",
"client::007",
"client::123",
// ... etc ...
],
"type" : "project"
}
But in general, it depends on what your use cases are for reads, writes, queries. No data model will fit every use case.

Mongodb Indexing Issue

I have a collection in which below is the data:
"sel_att" : {
"Technical Specifications" : {
"In Sales Package" : "Charger, Handset, User Manual, Extra Ear Buds, USB Cable, Headset",
"Warranty" : "1 year manufacturer warranty for Phone and 6 months warranty for in the box accessories"
},
"General Features" : {
"Brand" : "Sony",
"Model" : "Xperia Z",
"Form" : "Bar",
"SIM Size" : "Micro SIM",
"SIM Type" : "Single Sim, GSM",
"Touch Screen" : "Yes, Capacitive",
"Business Features" : "Document Viewer, Pushmail (Mail for Exchange, ActiveSync)",
"Call Features" : "Conference Call, Hands Free, Loudspeaker, Call Divert",
"Product Color" : "Black"
},
"Platform/Software" : {
"Operating Frequency" : "GSM - 850, 900, 1800, 1900; UMTS - 2100",
"Operating System" : "Android v4.1 (Jelly Bean), Upgradable to v4.4 (KitKat)",
"Processor" : "1.5 GHz Qualcomm Snapdragon S4 Pro, Quad Core",
"Graphics" : "Adreno 320"
}
}
The data mentioned above is too huge and the fields are all dynamically inserted, how can I index such fields to get faster results?
It seems to me that you have not fully understood the power of document based databases such as MongoDB.
Bellow are just a few thoughts:
you have 1 million records
you have 1 million index values for that collection
you have to RAM available to store 1 million index values in-memory, otherwise the benefits of indexing would not be so keen to show up
yes you can have sharding but you need lots of hardware to accommodate basic needs
What you for sure need is something that can make dynamically link random text to valuable indexes and that allows you to search in vast amounts of text very fast. And for that you should use a tool like ElasticSearch.
Note that you can and should store your content in a NoSQL database and yes MongoDB is a viable option. And for the indexing part ElasticSearch has plugins available to enhance the communication between the two.
P.S. If I recall correctly the plugin is called MongoDB River
EDIT:
I've also added a more comprehensive definition for ElasticSearch. I won't take credit for it since I've grabbed it from Wikipedia:
Elasticsearch is a search server based on Lucene. It provides a
distributed, multitenant-capable full-text search engine with a
RESTful web interface and schema-free JSON documents
EDIT 2:
I've scaled down a bit on the numbers since it might be far-fetched for most projects. But the main idea remains the same. Indexes are not recommended for the use-case described in the question.
Based on what you want to query, you will end up indexing those fields. You can also have secondary indexes in MongoDB. But beware creating too many indexes may improve your query performance but consume additional disk space and make inserts slower due to re-indexing.
MongoDB indexes
Short answer: you can't. Use Elastic Search.
Here is a good tutorial to setup MongoDB River on Elastic Search
The reason is simple, MongoDB does not work like that. It helps you store complex schemaless sets of documents. But you cannot index dozens of different fields and hope to get good performance. Generally a max of 5-6 indices are recommended per collection.
Elastic Search is commonly used in the fashion described above in many other use-cases, so it is an established pattern. For example, Titan Graph DB has the built-in option to use ES for this purpose. If I were you, I would just use that and would not try to make MongoDB do something it is not built to do.
If you have the time and if your data structure lends itself to (I think it might from the json above), then you could also use rdbms to break down these pieces and store them on-the-fly with an EAV like pattern. Elastic Search would be easier to start and probably easier to achieve performance quickly.
Well, there are lots of problems w.r.t having many indexes and has been discussed here. But if at all you need to add indexes for dynamic fields you actually create index from you mongo db driver.
So, lets say if you are using the Mongodb JAVA driver then you could create an index like below: http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-java-driver/#creating-an-index
coll.createIndex(new BasicDBObject("i", 1)); // create index on "i", ascending
PYTHON
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.create_index
So, when you are populating data using any of the drivers and you find a new field which has come thru then you could fire index creation using driver itself and not have to do it manually.
P.S.: I have not tried this and it might not be suitable or advisable.
Hope this helps!
Indexing of dynamic fields is tricky. There is no such thing as wildcard-indexes. Your options would be:
Option A: Whenever you insert a new document, do an ensureIndex with the option sparse:true for each of its fields. This does nothing when the index already exists and creates a new one when it's a new field. The drawback will be that you will end up with a very large number of indexes and that inserts could get slow because of all the new and old indexes which need to be created/updated.
Option B: Forget about the field-names and refactor your documents to an array of key/value pairs. So
"General Features" : {
"Brand" : "Sony",
"Form" : "Bar"
},
"Platform/Software" : {,
"Processor" : "1.5 GHz Qualcomm",
"Graphics" : "Adreno 320"
}
becomes
properties: [
{ category: "General Features", key: "Brand", value: "Sony" },
{ category: "General Features", key: "Form", value: "Bar" },
{ category: "Platform/Software", key: "Processor", value: "1.5 GHz Qualcomm" },
{ category: "Platform/Software", key: "Graphics", value: "Adreno 320" }
]
This allows you to create a single compound index on properties.category and properties.key to cover all the array entries.

Server side paging and grouping of large dataset

I'll try to explain the issue as best I can. Implement a grid with server paging. On request for N entities, DB should return a set of data which should be grouped or better said transformed in such a way that when the transformation phase is done it should result in those N entities.
Best way as I can see is something like this:
Query_all_data() => Result; (10000000 documents)
Transform(Result) => Transformed (100 groups)
Transformed.Skip(N).Take(N)
Transformation phase should be something like this:
Result = [d0, d1, d2..., dN]
Transformed = [
{ info: "foo", docs: [d0. d2, d21, d67, d100042] },
{ info: "bar", docs: [d3. d28, d121, d6271, d100042] },
{ info: "baz", docs: [d41. d26, d221, d567, d100043] },
{ info: "waz", docs: [d22. d24, d241, d167, d1000324] }
]
Every object in Transformed is an entity in grid.
I'm not sure if it's important but the DB in question is MongoDB and all documents are stored in one collection. Now, the huge pitfall of this approach is that it's way to slow on large dataset which will most certainly be the case.
Is there a better approach. Maybe different DB design?
#dakt, you can store your data in couple of different ways based on how you are going to use the data. In the process it may also be useful to store data in de-normalized form where in some duplication of data may occur.
Store data as individual documents as mentioned in your problem statement
Store the data in transformed format in your problem statement. It looks like you have a consistent way of mapping the docs to some tag. If so, why not maintain documents such that they are always embedded for those tags. This certainly has limitation on number of docs that you may be able to contain base on the 16MB document limit.
I would suggest looking at the MongoDB use-cases - http://docs.mongodb.org/ecosystem/use-cases/ and see if any of those are similar to what you are trying to achieve.

API pagination best practices

I'd love some some help handling a strange edge case with a paginated API I'm building.
Like many APIs, this one paginates large results. If you query /foos, you'll get 100 results (i.e. foo #1-100), and a link to /foos?page=2 which should return foo #101-200.
Unfortunately, if foo #10 is deleted from the data set before the API consumer makes the next query, /foos?page=2 will offset by 100 and return foos #102-201.
This is a problem for API consumers who are trying to pull all foos - they will not receive foo #101.
What's the best practice to handle this? We'd like to make it as lightweight as possible (i.e. avoiding handling sessions for API requests). Examples from other APIs would be greatly appreciated!
I'm not completely sure how your data is handled, so this may or may not work, but have you considered paginating with a timestamp field?
When you query /foos you get 100 results. Your API should then return something like this (assuming JSON, but if it needs XML the same principles can be followed):
{
"data" : [
{ data item 1 with all relevant fields },
{ data item 2 },
...
{ data item 100 }
],
"paging": {
"previous": "http://api.example.com/foo?since=TIMESTAMP1"
"next": "http://api.example.com/foo?since=TIMESTAMP2"
}
}
Just a note, only using one timestamp relies on an implicit 'limit' in your results. You may want to add an explicit limit or also use an until property.
The timestamp can be dynamically determined using the last data item in the list. This seems to be more or less how Facebook paginates in its Graph API (scroll down to the bottom to see the pagination links in the format I gave above).
One problem may be if you add a data item, but based on your description it sounds like they would be added to the end (if not, let me know and I'll see if I can improve on this).
If you've got pagination you also sort the data by some key. Why not let API clients include the key of the last element of the previously returned collection in the URL and add a WHERE clause to your SQL query (or something equivalent, if you're not using SQL) so that it returns only those elements for which the key is greater than this value?
You have several problems.
First, you have the example that you cited.
You also have a similar problem if rows are inserted, but in this case the user get duplicate data (arguably easier to manage than missing data, but still an issue).
If you are not snapshotting the original data set, then this is just a fact of life.
You can have the user make an explicit snapshot:
POST /createquery
filter.firstName=Bob&filter.lastName=Eubanks
Which results:
HTTP/1.1 301 Here's your query
Location: http://www.example.org/query/12345
Then you can page that all day long, since it's now static. This can be reasonably light weight, since you can just capture the actual document keys rather than the entire rows.
If the use case is simply that your users want (and need) all of the data, then you can simply give it to them:
GET /query/12345?all=true
and just send the whole kit.
There may be two approaches depending on your server side logic.
Approach 1: When server is not smart enough to handle object states.
You could send all cached record unique id’s to server, for example ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10"] and a boolean parameter to know whether you are requesting new records(pull to refresh) or old records(load more).
Your sever should responsible to return new records(load more records or new records via pull to refresh) as well as id’s of deleted records from ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10"].
Example:-
If you are requesting load more then your request should look something like this:-
{
"isRefresh" : false,
"cached" : ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10"]
}
Now suppose you are requesting old records(load more) and suppose "id2" record is updated by someone and "id5" and "id8" records is deleted from server then your server response should look something like this:-
{
"records" : [
{"id" :"id2","more_key":"updated_value"},
{"id" :"id11","more_key":"more_value"},
{"id" :"id12","more_key":"more_value"},
{"id" :"id13","more_key":"more_value"},
{"id" :"id14","more_key":"more_value"},
{"id" :"id15","more_key":"more_value"},
{"id" :"id16","more_key":"more_value"},
{"id" :"id17","more_key":"more_value"},
{"id" :"id18","more_key":"more_value"},
{"id" :"id19","more_key":"more_value"},
{"id" :"id20","more_key":"more_value"}],
"deleted" : ["id5","id8"]
}
But in this case if you’ve a lot of local cached records suppose 500, then your request string will be too long like this:-
{
"isRefresh" : false,
"cached" : ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10",………,"id500"]//Too long request
}
Approach 2: When server is smart enough to handle object states according to date.
You could send the id of first record and the last record and previous request epoch time. In this way your request is always small even if you’ve a big amount of cached records
Example:-
If you are requesting load more then your request should look something like this:-
{
"isRefresh" : false,
"firstId" : "id1",
"lastId" : "id10",
"last_request_time" : 1421748005
}
Your server is responsible to return the id’s of deleted records which is deleted after the last_request_time as well as return the updated record after last_request_time between "id1" and "id10" .
{
"records" : [
{"id" :"id2","more_key":"updated_value"},
{"id" :"id11","more_key":"more_value"},
{"id" :"id12","more_key":"more_value"},
{"id" :"id13","more_key":"more_value"},
{"id" :"id14","more_key":"more_value"},
{"id" :"id15","more_key":"more_value"},
{"id" :"id16","more_key":"more_value"},
{"id" :"id17","more_key":"more_value"},
{"id" :"id18","more_key":"more_value"},
{"id" :"id19","more_key":"more_value"},
{"id" :"id20","more_key":"more_value"}],
"deleted" : ["id5","id8"]
}
Pull To Refresh:-
Load More
It may be tough to find best practices since most systems with APIs don't accommodate for this scenario, because it is an extreme edge, or they don't typically delete records (Facebook, Twitter). Facebook actually says each "page" may not have the number of results requested due to filtering done after pagination.
https://developers.facebook.com/blog/post/478/
If you really need to accommodate this edge case, you need to "remember" where you left off. jandjorgensen suggestion is just about spot on, but I would use a field guaranteed to be unique like the primary key. You may need to use more than one field.
Following Facebook's flow, you can (and should) cache the pages already requested and just return those with deleted rows filtered if they request a page they had already requested.
Option A: Keyset Pagination with a Timestamp
In order to avoid the drawbacks of offset pagination you have mentioned, you can use keyset based pagination. Usually, the entities have a timestamp that states their creation or modification time. This timestamp can be used for pagination: Just pass the timestamp of the last element as the query parameter for the next request. The server, in turn, uses the timestamp as a filter criterion (e.g. WHERE modificationDate >= receivedTimestampParameter)
{
"elements": [
{"data": "data", "modificationDate": 1512757070}
{"data": "data", "modificationDate": 1512757071}
{"data": "data", "modificationDate": 1512757072}
],
"pagination": {
"lastModificationDate": 1512757072,
"nextPage": "https://domain.de/api/elements?modifiedSince=1512757072"
}
}
This way, you won't miss any element. This approach should be good enough for many use cases. However, keep the following in mind:
You may run into endless loops when all elements of a single page have the same timestamp.
You may deliver many elements multiple times to the client when elements with the same timestamp are overlapping two pages.
You can make those drawbacks less likely by increasing the page size and using timestamps with millisecond precision.
Option B: Extended Keyset Pagination with a Continuation Token
To handle the mentioned drawbacks of the normal keyset pagination, you can add an offset to the timestamp and use a so-called "Continuation Token" or "Cursor". The offset is the position of the element relative to the first element with the same timestamp. Usually, the token has a format like Timestamp_Offset. It's passed to the client in the response and can be submitted back to the server in order to retrieve the next page.
{
"elements": [
{"data": "data", "modificationDate": 1512757070}
{"data": "data", "modificationDate": 1512757072}
{"data": "data", "modificationDate": 1512757072}
],
"pagination": {
"continuationToken": "1512757072_2",
"nextPage": "https://domain.de/api/elements?continuationToken=1512757072_2"
}
}
The token "1512757072_2" points to the last element of the page and states "the client already got the second element with the timestamp 1512757072". This way, the server knows where to continue.
Please mind that you have to handle cases where the elements got changed between two requests. This is usually done by adding a checksum to the token. This checksum is calculated over the IDs of all elements with this timestamp. So we end up with a token format like this: Timestamp_Offset_Checksum.
For more information about this approach check out the blog post "Web API Pagination with Continuation Tokens". A drawback of this approach is the tricky implementation as there are many corner cases that have to be taken into account. That's why libraries like continuation-token can be handy (if you are using Java/a JVM language). Disclaimer: I'm the author of the post and a co-author of the library.
Pagination is generally a "user" operation and to prevent overload both on computers and the human brain you generally give a subset. However, rather than thinking that we don't get the whole list it may be better to ask does it matter?
If an accurate live scrolling view is needed, REST APIs which are request/response in nature are not well suited for this purpose. For this you should consider WebSockets or HTML5 Server-Sent Events to let your front end know when dealing with changes.
Now if there's a need to get a snapshot of the data, I would just provide an API call that provides all the data in one request with no pagination. Mind you, you would need something that would do streaming of the output without temporarily loading it in memory if you have a large data set.
For my case I implicitly designate some API calls to allow getting the whole information (primarily reference table data). You can also secure these APIs so it won't harm your system.
Just to add to this answer by Kamilk : https://www.stackoverflow.com/a/13905589
Depends a lot on how large dataset you are working on. Small data sets do work on effectively on offset pagination but large realtime datasets do require cursor pagination.
Found a wonderful article on how Slack evolved its api's pagination as there datasets increased explaining the positives and negatives at every stage : https://slack.engineering/evolving-api-pagination-at-slack-1c1f644f8e12
I think currently your api's actually responding the way it should. The first 100 records on the page in the overall order of objects you are maintaining. Your explanation tells that you are using some kind of ordering ids to define the order of your objects for pagination.
Now, in case you want that page 2 should always start from 101 and end at 200, then you must make the number of entries on the page as variable, since they are subject to deletion.
You should do something like the below pseudocode:
page_max = 100
def get_page_results(page_no) :
start = (page_no - 1) * page_max + 1
end = page_no * page_max
return fetch_results_by_id_between(start, end)
Another option for Pagination in RESTFul APIs, is to use the Link header introduced here. For example Github use it as follow:
Link: <https://api.github.com/user/repos?page=3&per_page=100>; rel="next",
<https://api.github.com/user/repos?page=50&per_page=100>; rel="last"
The possible values for rel are: first, last, next, previous. But by using Link header, it may be not possible to specify total_count (total number of elements).
I've thought long and hard about this and finally ended up with the solution I'll describe below. It's a pretty big step up in complexity but if you do make this step, you'll end up with what you are really after, which is deterministic results for future requests.
Your example of an item being deleted is only the tip of the iceberg. What if you are filtering by color=blue but someone changes item colors in between requests? Fetching all items in a paged manner reliably is impossible... unless... we implement revision history.
I've implemented it and it's actually less difficult than I expected. Here's what I did:
I created a single table changelogs with an auto-increment ID column
My entities have an id field, but this is not the primary key
The entities have a changeId field which is both the primary key as well as a foreign key to changelogs.
Whenever a user creates, updates or deletes a record, the system inserts a new record in changelogs, grabs the id and assigns it to a new version of the entity, which it then inserts in the DB
My queries select the maximum changeId (grouped by id) and self-join that to get the most recent versions of all records.
Filters are applied to the most recent records
A state field keeps track of whether an item is deleted
The max changeId is returned to the client and added as a query parameter in subsequent requests
Because only new changes are created, every single changeId represents a unique snapshot of the underlying data at the moment the change was created.
This means that you can cache the results of requests that have the parameter changeId in them forever. The results will never expire because they will never change.
This also opens up exciting feature such as rollback / revert, synching client cache etc. Any features that benefit from change history.
Refer to API Pagination Design, we could design pagination api through cursor
They have this concept, called cursor — it’s a pointer to a row. So you can say to a database “return me 100 rows after that one”. And it’s much easier for a database to do since there is a good chance that you’ll identify the row by a field with an index. And suddenly you don’t need to fetch and skip those rows, you’ll go directly past them.
An example:
GET /api/products
{"items": [...100 products],
"cursor": "qWe"}
API returns an (opaque) string, which you can use then to retrieve the next page:
GET /api/products?cursor=qWe
{"items": [...100 products],
"cursor": "qWr"}
Implementation-wise there are many options. Generally, you have some ordering criteria, for example, product id. In this case, you’ll encode your product id with some reversible algorithm (let’s say hashids). And on receiving a request with the cursor you decode it and generate a query like WHERE id > :cursor LIMIT 100.
Advantage:
The query performance of db could be improved through cursor
Handle well when new content was inserted into db while querying
Disadvantage:
It’s impossible to generate a previous page link with a stateless API

How can I efficiently use MongoDB to create real-time analytics with pivots?

So I'm getting a ton of data continuously that's getting put into a processedData collection. The data looks like:
{
date: "2011-12-4",
time: 2243,
gender: {
males: 1231,
females: 322
},
age: 32
}
So I'll get lots and lots of data objects like this continually. I want to be able to see all "males" that are above 40 years old. This is not an efficient query it seems because of the sheer size of the data.
Any tips?
Generally speaking, you can't.
However, there may be some shortcuts, depending on actual requirements. Do you want to count 'males above 40' across all dataset, or just one day?
1 day: split your data into daily collections (processedData-20111121, ...), this will help your queries. Also you can cache results of such query.
whole dataset: pre-aggregate data. That is, upon insertion of new data entry, do something like this:
db.preaggregated.update({_id : 'male_40'},
{$set : {gender : 'm', age : 40}, $inc : {count : 1231}},
true);
Similarly, if you know all your queries beforehand, you can just precalculate them (and not keep raw data).
It also depends on how you define "real-time" and how big a query load you will have. In some cases it is ok to just fire ad-hoc map-reduces.
My guess your target GUI is a website? In that case you are looking for something called comet. You should make a layer which processes all the data and broadcasts new mutations to your client or event bus (more on that below). Mongo doesn't enable real-time data as it doesn't emit anything on an mutation. So you can use any data store which suites you.
Depending on the language you'll use you have different options (for comet):
Socket.io (nodejs) - Javascript
Cometd - Java
SignalR - C#
Libwebsocket - C++
Most of the times you'll need an event bus or message queue to put the mutation events on. Take a look at JMS, Redis or NServiceBus (depending on what you'll use).