I'm wondering what the best way would be to store opening hours and retrieving if a certain place is open right now (or to a specific time). For humans, Mo-Fr 9am-5pm, Sa 10am-2pm is fine, but how can I get a computer to understand that and query it in a NoSQL / document based database like Elasticsearch?
FWIW: David Smiley (one of the Solr / Lucene guru's ) and I have come up with a working solution (on epaper, never implemented at least by me) in Solr. The solution could be somewhat simplified if you only require 1 -slot per day of week, which may be what you want.
http://lucene.472066.n3.nabble.com/Modeling-openinghours-using-multipoints-td4025336.html
Problem is that this is based on fairly new spatial-stuff in Solr 4 (which stuff -> read the post), which I'm pretty sure hasn't made it's way into ES yet although I might be mistaken.
No guarentees, no docs :)
A straightforward alternative, if indeed you only have 1 -slot per day of week is to have 14 dynamic-fields, representing 7 closing and 7 opening-hours and do a simple boolean-query on the correct fields.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I'm trying to figure out what can I use for a future project, we plan to store about 500k records per month in the first year and maybe more for the next years this is a vertical application so there's no need to use a database for this, that's the reason why I decided to choose a NoSQL data storage.
The first option that came to my mind was mongo DB since is a very mature product with a lot of support from the community but on the other hand, we got a brand new product that offers a managed service at top performance, I'll develop this application but there's no maintenance plan (at least for now) so I think that will be a huge advantage since amazon provides an elastic way to scale.
My major concern is about the query structure, I haven't looked at the dynamo DB query capabilities yet but since is a k/v data storage I feel that this could be more limited than mongo DB.
If someone had the experience of moving a project from MongoDB to DynamoDB, any advice will be totally appreciated.
I know this is old, but it still comes up when you search for the comparison. We were using Mongo, have moved almost entirely to Dynamo, which is our first choice now. Not because it has more features, it doesn't. Mongo has a better query language, you can index within a structure, there's lots of little things. The superiority of Dynamo is in what the OP stated in his comment: it's easy. You don't have to take care of any servers. When you start to set up a Mongo sharded solution, it gets complicated. You can go to one of the hosting companies, but that's not cheap either. With Dynamo, if you need more throughput, you just click a button. You can write scripts to scale automatically. When it's time to upgrade Dynamo, it's done for you. That is all a lot of precious stress and time not spent. If you don't have dedicated ops people, Dynamo is excellent.
So we are now going on Dynamo by default. Mongo maybe, if the data structure is complicated enough to warrant it, but then we'd probably go back to a SQL database. Dynamo is obtuse, you really need to think about how you're going to build it, and likely you'll use Redis in Elasticcache to make it work for complex stuff. But it sure is nice to not have to take care of it. You code. That's it.
With 500k documents, there is no reason to scale whatsoever. A typical laptop with an SSD and 8GB of ram can easily do 10s of millions of records, so if you are trying to pick because of scaling your choice doesn't really matter. I would suggest you pick what you like the most, and perhaps where you can find the most online support with.
For quick overview comparisons, I really like this website, that has many comparison pages, eg AWS DynamoDB vs MongoDB; http://db-engines.com/en/system/Amazon+DynamoDB%3BMongoDB
Short answer: Start with SQL and add NoSQL only when/if needed. (unless you don't need anything beyond very simple queries)
My personal experience: I haven't used MongoDB for queries but as of April 2015 DynamoDB is still very crippled when it comes to anything beyond the most basic key/value queries. I love it for the basic stuff but if you want query language then look to a real SQL database solution.
In DynamoDB you can query on a hash or on a hash and range key, and you can have multiple secondary global indexes. I'm doing queries on a single table with 4 possible filter parameters and sorting the results, this is supported (barely) through the use of the global secondary indexes with filter expressions. The problem comes in when you try to get the total results matching the filter, you can't just search for the first 10 items matching the filter, but rather it checks 10 items and you may get 0 valid results forcing you to keep re-scanning from the continue key - pain in the neck and consumes too much of your table read quota for a simple scenario.
To be specific about the limit problem with filters in the query, this is from the docs (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#ScanQueryLimit):
In a response, DynamoDB returns all the matching results within
the scope of the Limit value. For example, if you issue a Query
or a Scan request with a Limit value of 6 and without a filter
expression, the operation returns the first six items in the
table that match the request parameters. If you also supply a
FilterExpression, the operation returns the items within the
first six items in the table that match the filter requirements.
My conclusion is that queries involving FilterExpressions are only usable on very rare occasions and are not scalable because each query can easily read most or all of your of your table which consumes far too many DynamoDB read units. Once you use too many read units you'll get throttled and see poor performance.
Expert opinion: In the AWS summit on Apr 9, 2015 Brett Hollman, Manager, Solutions Architecture, AWS in his talk on scalling to your first 10 million users advocates starting with a SQL database and then using NoSQL only when and if it makes sense. Because sooner or later you'll probably need a SQL server somewhere in your stack. His slides are here: http://www.slideshare.net/AmazonWebServices/deep-dive-scaling-up-to-your-first-10-million-users
See slide 28.
We chose a combination of Mongo/Dynamo for a healthcare product. Basically mongo allows better searching, but the hosted Dynamo is great because its HIPAA compliant without any extra work. So we host the mongo portion with no personal data on a standard setup and allow amazon to deal with the HIPAA portion in terms of infrastructure. We can query certain items from mongo which bring up documents with pointers (ID's) of the relatable Dynamo document.
The main reason we chose to do this using mongo instead of hosting the entire application on dynamo was for 2 reasons. First, we needed to preform location based searches which mongo is great at and at the time, Dynamo was not, but they do have an option now.
Secondly was that some documents were unstructured and we did not know ahead of time what the data would be, so for example lets say user a inputs a document in the "form" collection like this: {"username": "user1", "email": "me#me.com"}. And another user puts this in the same collection {"phone": "813-555-3333", "location": [28.1234,-83.2342]}. With mongo we can search any of these dynamic and unknown fields at any time, with Dynamo, you could do this but would have to make a index every time a new field was added that you wanted searchable. So if you have never had a phone field in your Dynamo document before and then all of the sudden, some one adds it, its completely unsearchable.
Now this brings up another point in which you have mentioned. Sometimes choosing the right solution for the job does not always mean choosing the best product for the job. For example you may have a client who needs and will use the system you created for 10+ years. Going with a SaaS/IaaS solution that is good enough to get the job done may be a better option as you can rely on amazon to have up-kept and maintained their systems over the long haul.
I have worked on both and kind of fan of both.
But you need to understand when to use what and for what purpose.
I don't think It's a great idea to move all your database to DynamoDB, reason being querying is difficult except on primary and secondary keys, Indexing is limited and scanning in DynamoDB is painful.
I would go for a hybrid sort of DB, where extensive query-able data should be there is MongoDB, with all it's feature you would never feel constrained to provide enhancements or modifications.
DynamoDB is lightning fast (faster than MongoDB) so DynamoDB is often used as an alternative to sessions in scalable applications. DynamoDB best practices also suggests that if there are plenty of data which are less being used, move it to other table.
So suppose you have a articles or feeds. People are more likely to look for last week stuff or this month's stuff. chances are really rare for people to visit two year old data. For these purposes DynamoDB prefers to have data stored by month or years in different tables.
DynamoDB is seemlessly scalable, something you will have to do manually in MongoDB. however you would lose on performance of DynamoDB, if you don't understand about throughput partition and how scaling works behind the scene.
DynamoDB should be used where speed is critical, MongoDB on the other hand has too many hands and features, something DynamoDB lacks.
for example, you can have a replica set of MongoDB in such a way that one of the replica holds data instance of 8(or whatever) hours old. Really useful, if you messed up something big time in your DB and want to get the data as it is before.
That's my opinion though.
Bear in mind, I've only experimented with MongoDB...
From what I've read, DynamoDB has come a long way in terms of features. It used to be a super-basic key-value store with extremely limited storage and querying capabilities. It has since grown, now supporting bigger document sizes + JSON support and global secondary indices. The gap between what DynamoDB and MongoDB offers in terms of features grows smaller with every month. The new features of DynamoDB are expanded on here.
Much of the MongoDB vs. DynamoDB comparisons are out of date due to the recent addition of DynamoDB features. However, this post offers some other convincing points to choose DynamoDB, namely that it's simple, low maintenance, and often low cost. Another discussion here of database choices was interesting to read, though slightly old.
My takeaway: if you're doing serious database queries or working in languages not supported by DynamoDB, use MongoDB. Otherwise, stick with DynamoDB.
My Data
It's primarily monitoring data, passed in the form of Timestamp: Value, for each monitored value, on each monitored appliance. It's regularly collected over many appliances and many monitored values.
Additionally, it has the quirky feature of many of these data values being derived at the source, with the calculation changing from time to time. This means that my data is effectively versioned, and I need to be able to simply call up only data from the most recent version of the calculation. Note: This is not versioning where the old values are overwritten. I simply have timestamp cutoffs, beyond which the data changes its meaning.
My Usage
Downstream, I'm going to have various undefined data mining/machine learning uses for the data. It's not really clear yet what those uses are, but it is clear that I will be writing all of the downstream code in Python. Also, we are a very small shop, so I can really only deal with so much complexity in setup, maintenance, and interfacing to downstream applications. We just don't have that many people.
The Choice
I am not allowed to use a SQL RDBMS to store this data, so I have to find the right NoSQL solution. Here's what I've found so far:
Cassandra
Looks totally fine to me, but it seems like some of the major users have moved on. It makes me wonder if it's just not going to be that much of a vibrant ecosystem. This SE post seems to have good things to say: Cassandra time series data
Accumulo
Again, this seems fine, but I'm concerned that this is not a major, actively developed platform. It seems like this would leave me a bit starved for tools and documentation.
MongoDB
I have a, perhaps irrational, intense dislike for the Mongo crowd, and I'm looking for any reason to discard this as a solution. It seems to me like the data model of Mongo is all wrong for things with such a static, regular structure. My data even comes in (and has to stay in) order. That said, everybody and their mother seems to love this thing, so I'm really trying to evaluate its applicability. See this and many other SE posts: What NoSQL DB to use for sparse Time Series like data?
HBase
This is where I'm currently leaning. It seems like the successor to Cassandra with a totally usable approach for my problem. That said, it is a big piece of technology, and I'm concerned about really knowing what it is I'm signing up for, if I choose it.
OpenTSDB
This is basically a time-series specific database, built on top of HBase. Perfect, right? I don't know. I'm trying to figure out what another layer of abstraction buys me.
My Criteria
Open source
Works well with Python
Appropriate for a small team
Very well documented
Has specific features to take advantage of ordered time series data
Helps me solve some of my versioned data problems
So, which NoSQL database actually can help me address my needs? It can be anything, from my list or not. I'm just trying to understand what platform actually has code, not just usage patterns, that support my super specific, well understood needs. I'm not asking which one is best or which one is cooler. I'm trying to understand which technology can most natively store and manipulate this type of data.
Any thoughts?
It sounds like you are describing one of the most common use cases for Cassandra. Time series data in general is often a very good fit for the cassandra data model. More specifically many people store metric/sensor data like you are describing. See:
http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://engineering.rockmelt.com/post/17229017779/modeling-time-series-data-on-top-of-cassandra
As far as your concerns with the community I'm not sure what is giving you that impression, but there is quite a large community (see irc, mailing lists) as well as a growing number of cassandra users.
http://www.datastax.com/cassandrausers
Regarding your criteria:
Open source
Yes
Works well with Python
http://pycassa.github.com/pycassa/
Appropriate for a small team
Yes
Very well documented
http://www.datastax.com/docs/1.1/index
Has specific features to take advantage of ordered time series data
See above links
Helps me solve some of my versioned data problems
If I understand your description correctly you could solve this multiple ways. You could start writing a new row when the version changes. Alternatively you could use composite columns to store the version along with the timestamp/value pair.
I'll also note that Accumulo, HBase, and Cassandra all have essentially the same data model. You will still find small differences around the data model in regards to specific features that each database offers, but the basics will be the same.
The bigger difference between the three will be the architecture of the system. Cassandra takes its architecture from Amazon's Dynamo. Every server in the cluster is the same and it is quite simple to setup. HBase and Accumulo or more direct clones of BigTable. These have more moving parts and will require more setup/types of servers. For example, setting up HDFS, Zookeeper, and HBase/Accumulo specific server types.
Disclaimer: I work for DataStax (we work with Cassandra)
I only have experience in Cassandra and MongoDB but my experience might add something.
So your basically doing time based metrics?
Ok if I understand right you use the timestamp as a versioning mechanism so that you query per a certain timestamp, say to get the latest calculation used you go based on the metric ID or whatever and get ts DESC and take off the first row?
It sounds like a versioned key value store at times.
With this in mind I probably would not recommend either of the two I have used.
Cassandra is too rigid and it's too heirachal, too based around how you query to the point where you can only make one pivot of graph data from (I presume you would wanna graph these metrics) the columfamily which is crazy, hence why I dropped it. As for searching (which Facebook use it for, and only that) it's not that impressive either.
MongoDB, well I love MongoDB and I am an elite of the user group and it could work here if you didn't use a key value storage policy but at the end of the day if your mind is not set and you don't like the tech then let me be the very first to say: don't use it! You will be no good at a tech that you don't like so stay away from it.
Though I would picture this happening in Mongo much like:
{
_id: ObjectID(),
metricId: 'AvailableMessagesInQueue',
formula: '4+5/10.01',
result: NaN
ts: ISODate()
}
And you query for the latest version of your calculation by:
var results = db.metrics.find({ 'metricId': 'AvailableMessagesInQueue' }).sort({ ts: -1 });
var latest = results.getNext();
Which would output the doc structure you see above. Without knowing more of exactly how you wish to query and the general servera and app scenario etc thats the best I can come up with.
I fond this thread on HBase though: http://mail-archives.apache.org/mod_mbox/hbase-user/201011.mbox/%3C5A76F6CE309AD049AAF9A039A39242820F0C20E5#sc-mbx04.TheFacebook.com%3E
Which might be of interest, it seems to support the argument that HBase is a good time based key value store.
I have not personally used HBase so do not take anything I say about it seriously....
I hope I have added something, if not you could try narrowing your criteria so we can answer more dedicated questions.
Hope it helps a little,
Not a plug for any particular technology but this article on Time Series storage using MongoDB might provide another way of thinking about the storage of large amounts of "sensor" data.
http://www.10gen.com/presentations/mongodc-2011/time-series-data-storage-mongodb
Axibase Time-Series Database
Open source
There is a free Community Edition
Works well with Python
https://github.com/axibase/atsd-api-python. There are also other language wrappers, for example ATSD R client.
Appropriate for a small team
Built-in graphics and rule engine make it productive for building an in-house reporting, dashboarding, or monitoring solution with less coding.
Very well documented
It's hard to beat IBM redbooks, but we're trying. API, configuration, and administration is documented in detail and with examples.
Has specific features to take advantage of ordered time series data
It's a time-series database from the ground-up so aggregation, filtering and non-parametric ARIMA and HW forecasts are available.
Helps me solve some of my versioned data problems
ATSD supports versioned time-series data natively in SE and EE editions. Versions keep track of status, change-time and source changes for the same timestamp for audit trails and reconciliations. It's a useful feature to have if you need clean, verified data with tracing. Think energy metering, PHMR records. ATSD schema also supports series tags, which you could use to store versioning columns manually if you're on CE edition or you need to extend default versioning columns: status, source, change-time.
Disclosure - I work for the company that develops ATSD.
Big data = 1TB increasing by 10% every year.
Model is simple.. one table with 25 columns.
No joins with other tables..
I'm looking to do simple query filtering on a subset of the 25 columns..
I'd guess a traditional SQL store with indexes on the filtered columns is what's necessary. Hadoop is overkill and won't make sense as this is for a realtime service. mongo? a bi engine like pentaho?
Any recommendations?
It seems that traditional solution indeed sounds fine, unless there will not be any significant changes to the really simple model as you've described it.
NoSQL sounds like not the best choice for BI / Reporting.
Get a good hardware. Spend time on performance tests and build all the required indexes. Implement a proper new data upload strategy. Implement table-level partitioning in PostgreSQL according to your needs and performance tests.
P.S. If I could have a chance now to switch from ORACLE/DB2, I would definitely go for PostgreSQL.
I'd suggest investigating Infobright here - it's column-based & compressing, so you won't store the full TB, has a open-source version so you can try it out without being called by a bunch of salespeople (but last time I looked the OSS version was missing some really useful stuff, so you may end up wanting a license). Last time I tried it, it looked to the outside world like MySQL, so not hard to integrate. When I last checked it out, it was single-server-oriented, and claims to work with up to 50TB on a single server. I think that Infobright can sit behind Pentaho if you decide to move in that direction.
Something infobright had going for it was it was pretty close to no-admin - there's no manual indexing, or index maintenance.
Sounds like a column store would help. depends how you're handling inserts, and if you ever have to do updates. But as well as infobright if you're going commercial, then checkout vectorwise, it's quicker and similar priced.
If you want free/open source, then check out Luciddb - There's not many docs, but it is very good at what it does!
If you want unbelievable speed, then check out vectorwise. I believe it's about the same price as infobright, but much faster.
I'm relatively new to NoSQL databases and I have to evaluate different NoSQL-Solutions for a monitoring tool.
The situation is the following:
One datum is just about 100 Bytes big, but there are really a lot of them. During a day we get about 15 million records... So I'm currently testing with 900 million records (about 15GB as SQL-Insert Script)
My question is: Does Couchdb fit my needs? I need to do range querys (on the date the records were created) and sum up some of the columns acoording to groups definied by "secondary indexes" stored in the datum.)
I know that MapReduce is probably the best solution to calculate that, but is the JavaScript of CouchDB able to do this in an acceptable time?
I already tried MongoDB but it's really poor MapReduce made a crappy job... I also read about HBase and Cassandra. But maybee CouchDB is also a good possibility
I hope I gave you all the needed information... Thank you for your help!
andy
Frankly, at this time, unless you have very good hardware, Apache CouchDB may run into problems. Map/reduce will probably be fine. CouchDB's incremental map/reduce is ideal for your requirements.
As a developer, you will love it! Unfortunately as a sysadmin, you may notice more disk usage and i/o than expected.
I suggest to try it. Being HTTP and Javascript, it's easy to do a feasibility test. Just remember, the initial view build will take a long time (let's assume for argument it takes longer than every other competing database). But that time will never be spent again. Map/reduce runs only once per document (actually per document update).
We have a national application & the users would like to have accurate business statistics regarding some tables.
We are using tomcat, Spring Ws & hibernate on top of that.
We have thought of many solutions :
plain old query for each user request. The problem is those tables contains millions of records. Every query will take many seconds at least. Solution never used.
the actual solution used: create trigger. But it is painful to create & difficult to maintain (no OO, no cool EDI, no real debug). The only helping part is the possibility to create Junit Test on a higher level to verify the expected result. And for each different statistic on a table we have to create an other trigger for this table.
Using the quartz framework to consolidate data after X minutes.
I have learned that databases are not designedfor these heavy and complicated queries.
A separate data warehouse optimize for reading only queries will be better. (OLAP??)
But I don't have any clue where to start with postGresql. (pentaho is the solution or just a part?)
How could we extract data from the production database ? Using some extractor ?
And when ?Every night ?
If it is periodically - How will we manage to maintain near real time statistics if the data are just dumped on our datawarehouse one time per day ?
"I have learn that databases are NOT DESIGNED for these heavy and complicated queries."
Well you need to unlearn that. A database was designed for just these type of queries. I would blame bad design of the software you are using before I would blame the core technology.
I seems i have been misunderstood.
For those who think that a classic database is design for even processing real-time statistic with queries on billions datas, they might need to read articles on the origin of OLAP & why some people bother to design products around if the answer for performance was just a design question.
"I would blame bad design of the software you are using before I would blame the core technology."
By the way, im not using any software (or pgadmin counts ?). I have two basic tables, you cant make it more simple,and the problem comes when you have billions datas to retreve for statistics.
For those who think it is just a design problm, im glad to hear their clever answer (no trigger i know this one) to a simple problem :
Imagine you have 2 tables: employees & phones. An employee may have 0 to N phones.
Now let say that you have 10 000 000 employees & 30 000 000 phones.
You final users want to know in real time :
1- the average number of phones per user
2-the avegarde age of user who have more than 3 phones
3-the averagae numbers of phones for employees who are in the company for more than 10 years
You have potentially 100 users that want those real time statistics at anytime.
Of course, any queries dont have to take more than 1/4 sec.
Incrementally summarize the data..?
The frequency depends on your requirements, and in extreme cases you may need more hardware, but this is very unlikely.
Bulk load new data
Calculate new status [delta] using new data and existing status
Merge/update status
Insert new data into permanent table (if necessary)
NOTIFY wegotsnewdata
Commit
StarShip3000 is correct, btw.