Selecting top N records per group in DynamoDB - nosql

Is NoSQL in general, and DynamoDB in particular, well suited to performing greatest-n-per-group type queries, as compared to MySQL?

DynamoDB support only 2 index and can only be queried efficiently on these.
hash key
range key (optional)
Using DynamoDB to find the biggest values in a random "row" is not a good idea at all. Querying on a random row implies scanning the whole dataset which will cost you a lot of money.
Nonetheless, if your data is properly modeled, query method may be used to find the biggest range_key for a given hash_key
Here is how to proceed:
Set the has_key
Set no filter for the range_key
limit the result count to 1
scan the index backward

Related

mongodb - Multiple Compound Indexes involving a common field

We have a collection with millions of data. This data is being rendered in the UI for stats purpose and hence time to render is of key importance.
The queries to render the data involve the below fields:
field_a and field_t
field_b and field_t
field_c and field_t
As we are querying millions of data, we want to use Compound Index to speed up the queries.
To do so, we can simply add 3 different compound indexes as below:
db.mycollection.createIndex( { "field_a": 1, "field_t": 1 }
db.mycollection.createIndex( { "field_b": 1, "field_t": 1 }
db.mycollection.createIndex( { "field_c": 1, "field_t": 1 }
ESR rule is respected while creating the indexes as field_a, field_b and field_c are equality checks and field_t is a range check.
Please note that field_t is common in all the 3 indexes.
Instead of creating 3 different indexes, is there a better approach to this?
Does mongo provide a more efficient way to handle this scenario where same field is being used in multiple compound indexes?
Better or more efficient in what regard?
Having the three indexes that you mentioned is the most efficient approach in terms of query performance. They will allow the database to process only the data that is relevant for each query and nothing else. Any other approach would reduce read efficiency (and speed) which may not be a good tradeoff.
Most databases, MongoDB included, typically use a single index when executing a query. This is mostly a consequence of how indexes work. Typically indexes use a B-tree like data structure, which is an ordered set of information. When following the ESR rule (placing equality conditions before range conditions), all of the information for a specific query is contained within a single bounded subtree in the index which can be directly traversed. It loses the ability to do this when the index is not structured in this way (including putting range keys first).
Other potential approaches using single field indexes would be things like:
Index intersection - where you create (in this case) 4 single field indexes and have the database use 2 for each query. MongoDB typically does not choose this approach very often as it results in scanning larger portions of the index when compared to the compound index approach above.
Using 1 single field index for each query - the database would end up retrieving documents to filter on the other field which could be quite inefficient depending on the selectivity of the other field.
While these may reduce the overall size of the collective indexes, it increases the cost (and decreases the efficiency) of executing the queries. Depending on what you are optimizing for, the approach you've outlined would be considered a best practice in terms of query efficiency.

What type of index is most suitable for a low-selective column

I have table with around 60M of records and potentially it will grow up to ~500M soon (then will be growing slowly). In the table there is a column, say category. Total number of categories is around 20K and grows very slow and occasionally. Records are not distributed evenly among categories, there are categories that cover 5% of all records while other categories are represented by only very small proportion of records.
I have number of queries that work only with one or several categories (use = or IN/ANY conditions) and I want to optimize performance of these queries.
Taking into account low-selective nature of data in the column, which type of Postgres index will be more beneficial: HASH or B-TREE?
Are there any other ways to optimize performance of these queries?
I can only give a generalized answer to this broad question.
Use B-tree indexes, not hash indexes.
If you have several conditions that are not very selective, create an index on each of the columns, then they can be combined with a bitmap index scan.
In general, a column that is not very selective is not a good candidate for an index. Indexes are not free. They need to be maintained, and at query-time, in most cases, Postgres will still have to go out to the table for each row the index search matches (exception is covering indexes).
With that said, I'm not sure of your selectivity analysis. If the highest percent you'll filter down to worst-case is 5%, and most are far lower than that, that I'd say you have a very selective column.
As for which index type to use, b-tree versus hash, I generally go with a b-tree index as my standard unless there is a specific need to deviate.
Hash indexes are faster to query than b-tree indexes, but, they cannot be used for range lookups, only equality. Hash indexes are not supported on all RDBMS's, and as a result, are less well understood in the community, which can hinder support.

What data structure does Google Firebase Firestore use for it's default index

I'm curious if anyone knows, or can guess, the data structure Google's Firestore is using to index arbitrary NoSQL documents by every field. I'm looking to build something similar, making it as efficient as possible.
Some info about how their default index works:
all fields are indexed by default, but only works for equality searches not range (<,>)
any range searches require extra indexes
Source: https://firebase.google.com/docs/firestore/query-data/indexing
It's unlikely it's a standard btree index per field because the range searches would work without adding the requirement for another index. Plus if you added a new field (easy with document storage), it would take time to build an index and collections with billions of items.
One theory: 1 big index per document. Index "field_name:value" for every field in every document. The index maps to a sorted list document IDs which contain that field/value pair. It would be able to to equality search (my merging the sorted doc-ids for every equality requirement), but not a range search. Basically an inverted index.
Any suggestion for a better ways of implementing a pattern like this?
Clarification, single field indexes do support range/inequality queries, composite indexes are about combining multiple field filters in a single query. See this page for more on index types:
https://firebase.google.com/docs/firestore/query-data/index-overview
Each field index is stored in it's own key range with contiguous regions assigned to a server with compute and storage scaling independently under the covers. Cloud Firestore handles indexes fairly similar to Cloud Datastore (but not 100% the same).
You can see a basic overview on my Cloud Next conference session from last year.

Database for filtering XML documents

I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.
It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Cassandra DB Design

I come from RDBMS background and designing an app with Cassandra as backend and I am unsure of the validity and scalability of my design.
I am working on some sort of rating/feedback app of books/movies/etc. Since Cassandra has the concept of flexible column families (sparse structure), I thought of using the following schema:
user-id (row key): book-id/movie-id (dynamic column name) - rating (column value)
If I do it this way, I would end up having millions of columns (which would have been rows in RDBMS) though not essentially associated with a row-key, for instance:
user1: {book1:Rating-Ok; book1023:good; book982821:good}
user2: {book75:Ok;book1023:good;book44511:Awesome}
Since all column families are stored in a single file, I am not sure if this is a scalable design (or a design at all!). Furthermore there might be queries like "pick all 'good' reviews of 'book125'".
What approach should I use?
This design is perfectly scalable. Cassandra stores data in sparse form, so empty cells don't consume disk space.
The drawback is that cassandra isn't very good in indexing by value. There are secondary indexes, but they should be used only to index a column or two, not each of million of columns.
There are two options to address this issue:
Materialised views (described, for example, here: http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/). This allows to build some set of predefined queries, probably quite complex ones.
Ad-hoc querying is possible with some sort of map/reduce job, that effectively iterates over the whole dataset. This might sound scary, but still it's pretty fast: Cassandra stores all data in SSTables, and this iterating might be implemented to scan data files sequentially.
Start from a desired set of queries and structure your column families to support those views. Especially with so few fields involved, each CF can act cheaply as its own indexed view of your data. During a fetch, the key will partition the data ultimately to one specific Cassandra node that can rapidly stream a set of wide rows to your app server in a pre-determined order. This plays to one of Cassandra's strengths, as the fragmentation of that read on physical media (when not cached) is extremely low compared to bouncing around the various tracks and sectors on an indexed search of an RDBMS table.
One useful approach when available is to select your key to segment the data such that a full scan of all columns in that segment is a reasonable proposition, and a good rough fit for your query. Then, you filter what you don't need, even if that filtering is performed in your client (app server). All reviews for a movie is a good example. Even if you filter the positive reviews or provide only recent reviews or a summary, you might still reasonably fetch all rows for that key and then toss what you don't need.
Another option is if you can figure out how to partition data(by time, by category), playOrm offers a solution of doing S-SQL into a partition which is very fast. It is very much like an RDBMS EXCEPT that you partition the data to stay scalable and can have as many partitions as you want. partitions can contain millions of rows(I would not exceed 10 million rows though in a partition).
later,
Dean