I wants to store the visits log of 10 websites, receiving a total of 10M visits a month in DynamoDB. After that I want to create a back-end to monitor and detect fraud operations.
Can DynamoDB handle complex queries like:
- List of visitors with a X bounce rate in a specified Interval
- Popular Destination URI between date/time & date/time
- Ordering & Grouping By
List of visitors with a X bounce rate in a specified Interval
DynamoDb can handle a lot of queries, but requires a bit of planning in your access patterns. You must query by hash key and filter by either the range key or a local secondary index. The query must contain a single comparison operator using the familiar >=, BETWEEN, IN, etc and will sort result as well. If you require a query like SELECT col1 FROM table1 WHERE condition1 > x AND condition2 > y AND condition3 > z, you're not necessarily stuck but you need to plan. These queries can be made, but might require querying sequentially across multiple tables or embedding part of the query logic in the hash key (e.g. hashkey = BOUNCE_RATE:1 or BOUNCE_RATE:2, ...) where :N would be some sort of meaningful tier for bounce rates. In my own experience this is not unusual at all. A caveat in this example is that you will possibly get poor distribution of the hash key across nodes (i.e. you could get hot keys that degrade performance which would possibly defeat the scalability advantage of DynamoDB), but my explanation should simply serve to give you ways to thinking about access patterns.
Assuming bounce rate is some precise decimal value, you could put it in either the range key or in a local secondary index which would then require including the time interval in the hash key. This would require that the time intervals are pre-determined (as the first example would require the bounce rate "tiers" to be pre-determined). You'll need to consider these types of trade-offs.
Finally, you can create multiple tables, each table holding the data of either a single tier of bounce rate or time interval. There are also other basic approaches, but food for thought...
Related
I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?
I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you
Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.
When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.
If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.
Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.
I am developing an app where the user will receive geo based information depending on his position. The server should be able to handle huge number of connected clients >100k.
Now i came up with 4 Approaches on how to handle the users position.
Approach - Without geospatial index:
The app server does just hold a list of connected clients and they're location.
Whenever there is a information available the server does loop over the whole list and checks whether the client is within a given radius.
Doubts: Very expensive
Approach - Handle geospatial index in the app server:
The app server does maintain a R Tree with all connected clients and they're location.
Therefore i was looking at JSI Java Spatial Index
Doubts: It is very expensive to update the geospatial index with JSI
Approach - Let the database "mongoDb" do the geospatial index / calculation:
The app server does only hold a reference to the connected client (connection) and saves the key to that reference together with its location into mondoDb.
When a new information is available the server can query the database to get the keys off all clients nearby.
Pro: I guess mongoDb does have a much better implementation of geospatial indexes than i could ever do in the app server.
Doubts: Clients are traveling around which forces me to update the geospatial index frequently. Can i do that or am i running into a performance problem?
Approach - Own "index" using 2 dimensional array
Today i was thinking about creating a very simple index by using a two dimensional array. While the outer array is for the longitude the inner would be for the latitude.
Lets say 3 longitude / altitude degree would be enough precision.
I could receive a list of users in a given area by
ArrayList userList = data[91][35] //91.2548980712891, 35.60869979858;
// i would also need to get the users in the surrounding arrays 90;35, 92;35 ...
// if i need more precision i could use one more decimal data[912][356]
Pro: I would have fast read and write access without a query to the database
Doubts: Radius is shorter at poles. Ugly hack?
I would be very grateful if someone could point me into "the" right direction.
The index used by MongoDB for geospatial indexing is based on a geohash, which essentially converts a 2 dimensional space into a one-dimensional key, suitable for B-tree indexing. While this is somewhat less efficient than an R-tree index it will be vastly more efficient than your scenario 1. I would also argue that filtering the data at the db level with a spatial query will be more efficient and easier to maintain than creating you own spatial indexing strategy on top.
The main thing to be aware of with MongoDB is that you cannot use a geometry column as a shard key, though you can shard a collection containing a geometry field, using another key. Also, if you wish to do any aggregation queries, (which isn't clear from your question) the geometry field must be the first through the pipeline.
There is also a geohaystack index, which is based on small buckets and optimized for search based on small areas, see, http://docs.mongodb.org/manual/core/geohaystack/, which might be useful in your case.
As far as speed is concerned, insertion and search on a B-Tree index are essentially O(log n), see Wikipedia B-Tree while without an index your search will be O(n), so it will not take very long before the difference in perfomance is enormous between having and not having an index.
If you are concerned about heavy writes slowing things down, you can tune the write concern in MongoDB so that you don't have to wait for a majority of replicas to respond to every write (the default), but at the cost of potentially inconsistent data, if you should lose your master.
Let's say we have many data tables structured as timestamp(hash) - value pairs, where values could be for example temperatures or other kinds of varied measurement data.
To get timestamps of certain values we can build a secondary index with value(hash) - timestamp(range), but what if we want to query the value with comparison operations like GT, LT, BETWEEN to get timestamps of a range of values?
Obviously, I want to avoid using scan. The only thing I've come up with is using a dummy hash key and putting the values+timestamps into range attribute, but I'm guessing this has its own problems (better or worse compared to scan?).
Is there a better solution or can this be done with DynamoDB at all?
You need to know the HASH then you can perform a query on the RANGE. To get around this you would need to denormalize your table, i.e. create a duplicate with the keys reversed. Although that seems like a bit of a pain in the butt, it is one of the tradeoffs that is at times required for all the performance benefits of a key value store.
Example for this case:
With both keys completely random, then you are out of luck. Rather than set your HASH as a dummy value you could try using a monthly time stamp instead, that way you should always be able to work out pragmatically what the hash should be. You can also then look at setting the range as a combination of both values separated by a hyphen, i.e. timestamp-value, then in the denormalized table, value-timestamp, that way you should be able to use the comparison operators with no performance hit.
I'm building an application that stores lots of data per user (possibly in gigabytes).
Something like a request log, so lets say you have the following fields for every record:
customer_id
date
hostname
environment
pid
ip
user_agent
account_id
user_id
module
action
id
response code
response time (range)
and possibly some more.
The good thing is that the usage will be mostly write only, but when there are reads
I'd like to be able to answer then quickly in near real time.
Another prediction about the usage pattern is that most of the time people will be looking at the most recent data,
and infrequently query for the past, aggregate etc, so my guess is that the working set will be much smaller then
the whole database, i.e. recent data for most users and ranges of history for some users that are doing analytics right now.
for the later case I suppose its ok for first query to be slower until it gets the range into memory.
But the problem is that Im not quite sure how to effectively index the data.
The start of the index is clear, its customer_id and date. but the rest can be
used in any combination and I can't predict the most common ones, at least not with any degree of certainty.
We are currently prototyping this with mongo. Is there a way to do it in mongo (storage/cpu/cost) effectively?
The only thing that comes to mind is to try to predict a couple of frequent queries and index them and just massively shard the data
and ensure that each customer's data is spread evenly over the shards to allow fast table scan over just the 'customer, date' index for the rest
of the queries.
P.S. I'm also open to suggestions about db alternatives.
with this limited number of fields, you could potentially just have an index on each of them, or perhaps in combination with customer_id. MongoDB is clever enough to pick the fastest index for each case then. If you can fit your whole data set in memory (a few GB is not a lot of data!), then this all really doesn't matter.
You're saying you have a GB per user, but that still means you can have an index on the fields as there are only about a dozen. And with that much data, you want sharding anyway at some point soon.
cheers,
Derick
I think, your requirements don't really mix well together. You can't have lots of data and instantaneous ad-hoc queries.
If you use a lot of indexes, then your writes will be slow, and you'll need much more RAM.
May I suggest this:
Keep your index on customer id and date to serve recent data to users and relax your requirements to either real-timeliness or accuracy of aggregate queries.
If you sacrifice accuracy, you will be firing map-reduce jobs every once in a while to precompute queries. Users then may see slightly stale data (or may not, it's historical immutable data, after all).
If you sacrifice speed, then you'll run map-reduce each time (right now it's the only sane way of calculating aggregates in a mongodb cluster).
Hope this helps :)
I have an HBASE table of about 150k rows, each containing 3700 columns.
I need to select multiple rows at a time, and aggregate the results back, something like:
row[1][column1] + row[2][column1] ... + row[n][column1]
row[1][column2] + row[2][column2] ... + row[n][column2]
...
row[1][columnn] + row[2][columnn] ... + row[n][columnn]
Which I can do using a scanner, the issue is, I believe, that the scanner is like a cursor, and is not doing the work distributed over multiple machines at the same time, but rather getting data from one region, then hopping to another region to get the next set of data, and so on where my results span multiple regions.
Is there a way to scan in a distributed fashion (an option, or creating multiple scanners for each region's worth of data [This might be a can of worms in itself]) or is this something that must be done in a map/reduce job. If it's a M/R job, will it be "fast" enough for real time queries? If not, are there some good alternatives to doing these types of aggregations in realtime with a NOSQL type database?
What I would do in such cases is, have another table where I would have the aggregation summaries. That is When row[m] is inserted into table 1 in table 2 against (column 1) (which is the row key of table 2) I would save its summation or other aggregational results, be it average, standard deviation, max, min etc.
Another approach would be to index them into a search tool such as Lucene, Solr, Elastic Search etc. and run aggregational searches there. Here are some examples in Solr.
Finally, Scan spanning across multiple region or M/R jobs is not designed for real time queries (unless the clusters designed in such way, i.e. oversized to data requirements).
Hope it helps.