How to get distinct count on dynamodb on billion objects? - key-value

What is the most efficient way to get a number of how many distinct objects is stored in mine dynamodb?
Such as my objects have ten properties and I want to get a distinct count based on 3 properties.

In case you need counters it's better to use the AtomicCounters (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithDDItems.html). In your case, DynamoDB doesn't support out of the box keys composed out of 3 attributes, unless you concatenate them, so the option would be to create a redundant table where the key is the concatenation of those 3 attributes and each you manage those objects, also update the AtomicCounter (add, delete, update - not needed actually).
Then you just query the counter, avoiding scans. So, it's space complexity to gain speed of retrieving data.

Perform a Scan with the appropriate ScanFilter (in this case, that the three properties are not_null), and use withCount(true) to return only the number of matching records instead of the records themselves.
See the documentation for some example code.

Related

Redshift Composite Sortkey - how many columns should we use?

I'm building several very large data tables on Amazon Redshift, that should hold data covering several frequently-queried properties with the relevant metrics.
We're using an even distribution style ("diststyle even") to have all the nodes participate in query calculations, but I am not certain about the length of the sortkey.
It definitely should be compound - every query will use first filter on date and network - but after that level I have about 7 additional relevant factors that can be queried on.
All the examples I've seen use a compound sort key of 2-3 fields, 4 at most.
My question is -why not use a sortkey that includes all the key fields in the table? What are the downsides for having a long sortkey?
VACUUM will also take longer if you have several sort keys.

How to create a Table-Like Reliable Collection

What is the best way to create a table-like reliable collection? Can we role out our own?
I am looking for something to store simple lists or bags for indexes and to track keys and other simple details since the cost of enumerating multi-partition dictionaries is so high. Preferably sequential rather than random access.
The obvious options:
IDictionary <Guid, List> has concurrency issues and poor performance
Try to enumerate a queue but I doubt it will be better than dictionary
Use an external data store
None of these seem particularly good.
The partitioning is actually to gain performance. The trick is to shard your data in such a way that cross partition queries aren't needed. You can also create multiple dictionaries with different aggregates of the same data. (use transactions)
Read more in the chapter 'plan for partitioning' here.
Of course you can roll out your own reliable collection. After all, a reliable collection is just a in-memory data structure backed by a Azure Storage object. If you want a reliable list of string, you can implement IList<string>, and in the various methods (add, remove, getEnumerator, ecc.) insert code to track, persist the data structure.
Based on your content, it can be a table (if you can generate a good partition/row key), or just a blob (and you serialize/deserialize the content each time, or at checkpoints, or... your policy!)
I did not get why IReliableDictionary<K, V> is not good for you. Do you need to store key, value pairs, and you do not want keys to be distributed in partitions, for performance reasons? (Because a "getAll" will spawn machines?)
Or do you need just a list of keys (like colors, or like you have in an HashSet?).
In any case, depending on the size of data, you can partition them differently, using something like IReliableDictionary, where the int can be just one (like "42", and you'll have one partition), or a few (hash and then mod (%) the number you want), and you will get a whole bunch of keys (from every key to N sections of keys) at once.

Sorting Cassandra using individual components of Composite Keys

I want to store a list of users in a Cassandra Column Family(Wide rows).
The columns in the CF will have Composite Keys of pattern id:updated_time:name:score
After inserting all the users, i need to query users in a different sorted order each time.
For example, if i specify updated_time, i could be able to fetch the recent 10 users.
And, if i specify score, then i could be able to fetch the top 10 users based on score.
Does Cassandra supports this?
Kindly help me in this regard...
i need to query users in a different sorted order each time...
Does Cassandra supports this
It does not. Unlike a RDBMS, you can not make arbitrary queries and expect reasonable performance. Instead you must design you data model so the queries you anticipate will be made will be efficient:
The best way to approach data modeling for Cassandra is to start with your queries and work backwards from there. Think about the actions your application needs to perform, how you want to access the data, and then design column families to support those access patterns.
So rather than having one column family (table) for your data, you might want several with cross references between them. That is, you might have to denormalise your data.

The most performant way to query against three collections, sort them and limit?

I am in need of querying against 3 different collections.
Then I need to sort the collection results (each based on different field and order but every time a DateTime value), and then finally limit the number of results I want (10 in my case).
Currently I'm just doing three separate queries and limiting each by 10, then manually sorting based on the date times they have. Then I finally limit to 10 myself.
Is there a better way to do this?
As mongodb is no relational database where you can join multiple tables within one query, no. I'm not even sure if you could do such kind of sorting (taking each field equal on the order precedence) for relational DBMS.
What you're doing already sounds really good. You could possibly improve your sorting of these 3 results. Aborting early to iterate over one or more collections if no further element can be within the overal top 10. You could modify your queries accordingly to only return documents for the other two collections, whose date is lower than the last one (10th) of the first queried collection. But maybe you did this already...
While talking about performance you may consider to add indexes on these datetime fields used for your query to keep the fields presorted in memory.

DynamoDB: Querying of non-key values with comparisons

Let's say we have many data tables structured as timestamp(hash) - value pairs, where values could be for example temperatures or other kinds of varied measurement data.
To get timestamps of certain values we can build a secondary index with value(hash) - timestamp(range), but what if we want to query the value with comparison operations like GT, LT, BETWEEN to get timestamps of a range of values?
Obviously, I want to avoid using scan. The only thing I've come up with is using a dummy hash key and putting the values+timestamps into range attribute, but I'm guessing this has its own problems (better or worse compared to scan?).
Is there a better solution or can this be done with DynamoDB at all?
You need to know the HASH then you can perform a query on the RANGE. To get around this you would need to denormalize your table, i.e. create a duplicate with the keys reversed. Although that seems like a bit of a pain in the butt, it is one of the tradeoffs that is at times required for all the performance benefits of a key value store.
Example for this case:
With both keys completely random, then you are out of luck. Rather than set your HASH as a dummy value you could try using a monthly time stamp instead, that way you should always be able to work out pragmatically what the hash should be. You can also then look at setting the range as a combination of both values separated by a hyphen, i.e. timestamp-value, then in the denormalized table, value-timestamp, that way you should be able to use the comparison operators with no performance hit.