How to "EXPIRE" the "HSET" child key in redis? - hash

I need to expire all keys in redis hash, which are older than 1 month.

This is not possible, for the sake of keeping Redis simple.
Quoth Antirez, creator of Redis:
Hi, it is not possible, either use a different top-level key for that
specific field, or store along with the filed another field with an
expire time, fetch both, and let the application understand if it is
still valid or not based on current time.

Redis does not support having TTL on hashes other than the top key, which would expire the whole hash. If you are using a sharded cluster, there is another approach you could use. This approach could not be useful in all scenarios and the performance characteristics might differ from the expected ones. Still worth mentioning:
When having a hash, the structure basically looks like:
hash_top_key
- child_key_1 -> some_value
- child_key_2 -> some_value
...
- child_key_n -> some_value
Since we want to add TTL to the child keys, we can move them to top keys. The main point is that the key now should be a combination of hash_top_key and child key:
{hash_top_key}child_key_1 -> some_value
{hash_top_key}child_key_2 -> some_value
...
{hash_top_key}child_key_n -> some_value
We are using the {} notation on purpose. This allows all those keys to fall in the same hash slot. You can read more about it here: https://redis.io/topics/cluster-tutorial
Now if we want to do the same operation of hashes, we could do:
HDEL hash_top_key child_key_1 => DEL {hash_top_key}child_key_1
HGET hash_top_key child_key_1 => GET {hash_top_key}child_key_1
HSET hash_top_key child_key_1 some_value => SET {hash_top_key}child_key_1 some_value [some_TTL]
HGETALL hash_top_key =>
keyslot = CLUSTER KEYSLOT {hash_top_key}
keys = CLUSTER GETKEYSINSLOT keyslot n
MGET keys
The interesting one here is HGETALL. First we get the hash slot for all our children keys. Then we get the keys for that particular hash slot and finally we retrieve the values. We need to be careful here since there could be more than n keys for that hash slot and also there could be keys that we are not interested in but they have the same hash slot. We could actually write a Lua script to do those steps in the server by executing an EVAL or EVALSHA command. Again, you need to take into consideration the performance of this approach for your particular scenario.
Some more references:
https://redis.io/commands/cluster-keyslot
https://redis.io/commands/cluster-getkeysinslot
https://redis.io/commands/eval

This is possible in KeyDB which is a Fork of Redis. Because it's a Fork its fully compatible with Redis and works as a drop in replacement.
Just use the EXPIREMEMBER command. It works with sets, hashes, and sorted sets.
EXPIREMEMBER keyname subkey [time]
You can also use TTL and PTTL to see the expiration
TTL keyname subkey
More documentation is available here: https://docs.keydb.dev/docs/commands/#expiremember

You can use Sorted Set in redis to get a TTL container with timestamp as score.
For example, whenever you insert a event string into the set you can set its score to the event time.
Thus you can get data of any time window by calling
zrangebyscore "your set name" min-time max-time
Moreover, we can do expire by using zremrangebyscore "your set name" min-time max-time to remove old events.
The only drawback here is you have to do housekeeping from an outsider process to maintain the size of the set.

Elon Musk will soon send people to the moon and we still cannot expire fields on redis :(
Anyway the solution I've been come up with is:
Lets say I want to expire every 3 minutes:
So im holding the data in 3 fields 0 1 2.
and then i do module% 3 to current time in minutes.
if the module for example == 0
so im using only 1 2 and 0 i delete;
then it change to 1 so im using 2 and 0 and delete 1.
Im not using it and i didnt checked it but im just let you know its possible

There is a Redisson java framework which implements hash Map object with entry TTL support. It uses hmap and zset Redis objects under the hood. Usage example:
RMapCache<Integer, String> map = redisson.getMapCache('map');
map.put(1, 30, TimeUnit.DAYS); // this entry expires in 30 days
This approach is quite useful.

We had the same problem discussed here.
We have a Redis hash, a key to hash entries (name/value pairs), and we needed to hold individual expiration times on each hash entry.
We implemented this by adding n bytes of prefix data containing encoded expiration information when we write the hash entry values, we also set the key to expire at the time contained in the value being written.
Then, on read, we decode the prefix and check for expiration. This is additional overhead, however, the reads are still O(n) and the entire key will expire when the last hash entry has expired.

Regarding a NodeJS implementation, I have added a custom expiryTime field in the object I save in the HASH. Then after a specific period time, I clear the expired HASH entries by using the following code:
client.hgetall(HASH_NAME, function(err, reply) {
if (reply) {
Object.keys(reply).forEach(key => {
if (reply[key] && JSON.parse(reply[key]).expiryTime < (new Date).getTime()) {
client.hdel(HASH_NAME, key);
}
})
}
});

If your use-case is that you're caching values in Redis and are tolerant of stale values but would like to refresh them occasionally so that they don't get too stale, a hacky workaround is to just include a timestamp in the field value and handle expirations in whatever place you're accessing the value.
This allows you to keep using Redis hashes normally without needing to worry about any complications that might arise from the other approaches. The only cost is a bit of extra logic and parsing on the client end. Not a perfect solution, but it's what I typically do as I haven't needed TTL for any other reason and I'm usually needing to do extra parsing on the cached value anyways.
So basically it'll be something like this:
In Redis:
hash_name
- field_1: "2021-01-15;123"
- field_2: "2021-01-20;125"
- field_2: "2021-02-01;127"
Your (pseudo)code:
val = redis.hget(hash_name, field_1)
timestamp = val.substring(0, val.index_of(";"))
if now() > timestamp:
new_val = get_updated_value()
new_timestamp = now() + EXPIRY_LENGTH
redis.hset(hash_name, field_1, new_timestamp + ";" + new_val)
val = new_val
else:
val = val.substring(val.index_of(";"))
// proceed to use val
The biggest caveat imo is that you don't ever remove fields so the hash can grow quite large. Not sure there's an elegant solution for that - I usually just delete the hash every once in a while if it feels too big. Maybe you could keep track of everything you've stored somewhere and remove them periodically (though at that point, you might as well just be using that mechanism to expire the fields manually...).

You could store key/values in Redis differently to achieve this, by just adding a prefix or namespace to your keys when you store them e.g. "hset_"
Get a key/value GET hset_key equals to HGET hset key
Add a key/value SET hset_key value equals to HSET hset key
Get all keys KEYS hset_* equals to HGETALL hset
Get all vals should be done in 2 ops, first get all keys KEYS hset_* then get the value for each key
Add a key/value with TTL or expire which is the topic of question:
SET hset_key value
EXPIRE hset_key
Note: KEYS will lookup up for matching the key in the whole database which may affect on performance especially if you have big database.
Note:
KEYS will lookup up for matching the key in the whole database which may affect on performance especially if you have big database. while SCAN 0 MATCH hset_* might be better as long as it doesn't block the server but still performance is an issue in case of big database.
You may create a new database for storing separately these keys that you want to expire especially if they are small set of keys.
Thanks to #DanFarrell who highlighted the performance issue related to
KEYS

You can. Here is an example.
redis 127.0.0.1:6379> hset key f1 1
(integer) 1
redis 127.0.0.1:6379> hset key f2 2
(integer) 1
redis 127.0.0.1:6379> hvals key
1) "1"
2) "1"
3) "2"
redis 127.0.0.1:6379> expire key 10
(integer) 1
redis 127.0.0.1:6379> hvals key
1) "1"
2) "1"
3) "2"
redis 127.0.0.1:6379> hvals key
1) "1"
2) "1"
3) "2"
redis 127.0.0.1:6379> hvals key
Use EXPIRE or EXPIREAT command.
If you want to expire specific keys in the hash older then 1 month. This is not possible.
Redis expire command is for all keys in the hash.
If you set daily hash key, you can set a keys time to live.
hset key-20140325 f1 1
expire key-20140325 100
hset key-20140325 f1 2

You could use the Redis Keyspace Notifications by using psubscribe and "__keyevent#<DB-INDEX>__:expired".
With that, each time that a key will expire, you will get a message published on your redis connection.
Regarding your question basically you create a temporary "normal" key using set with an expiration time in s/ms. It should match the name of the key that you wish to delete in your set.
As your temporary key will be published to your redis connection holding the "__keyevent#0__:expired" when it expired, you can easily delete your key from your original set as the message will have the name of the key.
A simple example in practice on that page : https://medium.com/#micah1powell/using-redis-keyspace-notifications-for-a-reminder-service-with-node-c05047befec3
doc : https://redis.io/topics/notifications ( look for the flag xE)

static async setCount(ip: string, count: number) {
const val = await redisClient.hSet(ip, 'ipHashField', count)
await redisClient.expire(ip, this.expireTime)
}
Try expire your key.

Related

Generate C* bucket hash from multipart primary key

I will have C* tables that will be very wide. To prevent them to become too wide I have encountered a strategy that could suit me well. It was presented in this video.
Bucket Your Partitions Wisely
The good thing with this strategy is that there is no need for a "look-up-table" (it is fast), the bad part is that one needs to know the max amount of buckets and eventually end up with no more buckets to use (not scalable). I know my max bucket size so I will try this.
By calculating a hash from the tables primary keys this can be used as a bucket part together with the rest of the primary keys.
I have come up with the following method to be sure (I think?) that the hash always will be the same for a specific primary key.
Using Guava Hashing:
public static String bucket(List<String> primKeyParts, int maxBuckets) {
StringBuilder combinedHashString = new StringBuilder();
primKeyParts.forEach(part ->{
combinedHashString.append(
String.valueOf(
Hashing.consistentHash(Hashing.sha512()
.hashBytes(part.getBytes()), maxBuckets)
)
);
});
return combinedHashString.toString();
}
The reason I use sha512 is to be able to have strings with max characters of 256 (512 bit) otherwise the result will never be the same (as it seems according to my tests).
I am far from being a hashing guru, hence I'm asking the following questions.
Requirement: Between different JVM executions on different nodes/machines the result should always be the same for a given Cassandra primary key?
Can I rely on the mentioned method to do the job?
Is there a better solution of hashing large strings so they always will produce the same result for a given string?
Do I always need to hash from string or could there be a better way of doing this for a C* primary key and always produce same result?
Please, I don't want to discuss data modeling for a specific table, I just want to have a bucket strategy.
EDIT:
Elaborated further and came up with this so the length of string can be arbitrary. What do you say about this one?
public static int murmur3_128_bucket(int maxBuckets, String... primKeyParts) {
List<HashCode> hashCodes = new ArrayList();
for(String part : primKeyParts) {
hashCodes.add(Hashing.murmur3_128().hashString(part, StandardCharsets.UTF_8));
};
return Hashing.consistentHash(Hashing.combineOrdered(hashCodes), maxBuckets);
}
I currently use a similar solution in production. So for your method I would change to:
public static int bucket(List<String> primKeyParts, int maxBuckets) {
String keyParts = String.join("", primKeyParts);
return Hashing.consistentHash(
Hashing.murmur3_32().hashString(keyParts, Charsets.UTF_8),
maxBuckets);
}
So the differences
Send all the PK parts into the hash function at once.
We actually set the max buckets as a code constant since the consistent hash is only if the max buckets stay the same.
We use MurMur3 hash since we want it to be fast not cryptographically strong.
For your direct questions 1) Yes the method should do the job. 2) I think with the tweaks above you should be set. 3) The assumption is you need the whole PK?
I'm not sure you need to use the whole primary key since the expectation is that your partition part of your primary key is gonna be the same for many things which is why you are bucketing. You could just hash the bits that will provide you with good buckets to use in your partition key. In our case we just hash some of the clustering key parts of the PK to generate the bucket id we use as part of the partition key.

store list in key value database

I search for best way to store lists associated with key in key value database (like berkleydb or leveldb)
For example:
I have users and orders from user to user
I want to store list of orders ids for each user to fast access with range selects (for pagination)
How to store this structure?
I don't want to store it in serializable format for each user:
user_1_orders = serialize(1,2,3..)
user_2_orders = serialize(1,2,3..)
beacuse list can be long
I think about separate db file for each user with store orders ids as keys in it, but this does not solve range selects problem.. What if I want to get user ids with range [5000:5050]?
I know about redis, but interest in key value implementation like berkleydb or leveldb.
Let start with a single list. You can work with a single hashmap:
store in row 0 the count of user's order
for each new order store a new row with the count incremented
So yoru hashmap looks like the following:
key | value
-------------
0 | 5
1 | tomato
2 | celery
3 | apple
4 | pie
5 | meat
Steady increment of the key makes sure that every key is unique. Given the fact that the db is key ordered and that the pack function translates integers into a set of byte arrays that are correctly ordered you can fetch slices of the list. To fetch orders between 5000 and 5050 you can use bsddb Cursor.set_range or leveldb's createReadStream (js api)
Now let's expand to multiple user orders. If you can open several hashmap you can use the above using several hashmap. Maybe you will hit some system issues (max nb of open fds or max num of files per directory). So you can use a single and share the same hashmap for several users.
What I explain in the following works for both leveldb and bsddb given the fact that you pack keys correctly using the lexicographic order (byteorder). So I will assume that you have a pack function. In bsddb you have to build a pack function yourself. Have a look at wiredtiger.packing or bytekey for inspiration.
The principle is to namespace the keys using the user's id. It's also called key composition.
Say you database looks like the following:
key | value
-------------------
1 | 0 | 2 <--- count column for user 1
1 | 1 | tomato
1 | 2 | orange
... ...
32 | 0 | 1 <--- count column for user 32
32 | 1 | banna
... | ...
You create this database with the following (pseudo) code:
db.put(pack(1, make_uid(1)), 'tomato')
db.put(pack(1, make_uid(1)), 'orange')
...
db.put(pack(32, make_uid(32)), 'bannana')
make_uid implementation looks like this:
def make_uid(user_uid):
# retrieve the current count
counter_key = pack(user_uid, 0)
value = db.get(counter_key)
value += 1 # increment
# save new count
db.put(counter_key, value)
return value
Then you have to do the correct range lookup, it's similar to the single composite-key. Using bsddb api cursor.set_range(key) we retrieve all items
between 5000 and 5050 for user 42:
def user_orders_slice(user_id, start, end):
key, value = cursor.set_range(pack(user_id, start))
while True:
user_id, order_id = unpack(key)
if order_id > end:
break
else:
# the value is probably packed somehow...
yield value
key, value = cursor.next()
Not error checks are done. Among other things slicing user_orders_slice(42, 5000, 5050) is not guaranteed to tore 51 items if you delete items from the list. A correct way to query say 50 items, is to implement a user_orders_query(user_id, start, limit)`.
I hope you get the idea.
You can use Redis to store list in zset(sorted set), like this:
// this line is called whenever a user place an order
$redis->zadd($user_1_orders, time(), $order_id);
// list orders of the user
$redis->zrange($user_1_orders, 0, -1);
Redis is fast enough. But one thing you should know about Redis is that it stores all data in memory, so if the data eventually exceed the physical memory, you have to shard the data by your own.
Also you can use SSDB(https://github.com/ideawu/ssdb), which is a wrapper of leveldb, has similar APIs to Redis, but stores most data in disk, memory is only used for caching. That means SSDB's capacity is 100 times of Redis' - up to TBs.
One way you could model this in a key-value store which supports scans , like leveldb, would be to add the order id to the key for each user. So the new keys would be userId_orderId for each order. Now to get orders for a particular user, you can do a simple prefix scan - scan(userId*). Now this makes the userId range query slow, in that case you can maintain another table just for userIds or use another key convention : Id_userId for getting userIds between [5000-5050]
Recently I have seen hyperdex adding data types support on top of leveldb : ex: http://hyperdex.org/doc/04.datatypes/#lists , so you could give that a try too.
In BerkeleyDB you can store multiple values per key, either in sorted or unsorted order. This would be the most natural solution. LevelDB has no such feature. You should look into LMDB(http://symas.com/mdb/) though, it also supports sorted multi-value keys, and is smaller, faster, and more reliable than either of the others.

Amazon DynamoDB table design and querying

We are considering DynamoDB for an expectedly large dataset. I come from a strong SQL background so the No-SQL way of thinking is new to me.
I have a problem and design, but ran into what appears to be a dead end.
The documentation says to make sure your Hash keys are widely distributed to aid in performance, okay that makes sense.
I am going to be recording various datapoints/actions for users. It makes sense to me that the hash key should be the user-id, and my range key can be the action(s) performed.
Now, if I want all the actions user #1 performs, I can easily query that.
But, if I want all the USERS who performed action X, I cannot do that without a table scan. From the Query documentation:
A Query operation directly accesses items from a table using the table primary key, or from an index using the index key. You must provide a specific hash key value.
So it would seem I am limited to getting data from a specific user, unless I am willing to do a table scan, which is slower and consumes many capacity units.
My question is, I think, ultimately a design question. Maybe I am missing something when it comes to No-SQL? Should my hash key be something else? Or is it simply that my requirements do not fit in with No-SQL (and more specifically, DynamoDB)?
It is almost as if the hash key is a kind of grouping with DynamoDB. I considered changing the hash key to the actions we are intending to put into place, but then I am not widely distributing my keys...
The DynamoDb way to meet your requirement to allow both types of queries is to store the data in two tables, one with hash key user-id and range key action-id, and one with hash key action-id and range key user-id.
And you should think about if you need all the data in both tables, or if one can be a summary table. For example, say you have a limited number of possible actions. Instead of putting the full record of every action in the user-keyed table, you might want a table with only one row for each user: a hash key of user - id, and a second column that is multiply valued and is a list of any action-id that the user has performed at least once.
You must create a Global Secondary Index (GSI). What this does is it creates a second pair of hash and range keys which differ from the original keys. You can then query the same table by also including an index name in your parameters.
Example in JS:
var table = tablename;
var index = actionId-username-gsi;
var action = actionId;
var params = {
TableName : table,
IndexName : index,
KeyConditionExpression : 'actionId = :v_actionId',
ExpressionAttributeValues : {
':v_actionId': { N : action }
},
ProjectionExpression : 'actionId, username'
};
ddb.query(params, err) {
if(err) {
// Oh well
} else {
// Do something
}
};
This will query the actionId-username-gsi index and look for any actionId hashes with the value provided. Using ProjectionExpression will return only the specified attributes' values for each item, lowering throughput if that ever becomes a concern. I hope this helps answer your question.
node.js aws amazon-dynamodb nosql
I guess the global secondary indexes option is better, as you get a single table.
Creating two tables will create redundancy and additional work to maintain consistency when doing any CUD (Create, Update, Delete) operation on any one table.

MongoDB custom and unique IDs

I'm using MongoDB, and I would like to generate unique and cryptical IDs for blog posts (that will be used in restful URLS) such as s52ruf6wst or xR2ru286zjI.
What do you think is best and the more scalable way to generate these IDs ?
I was thinking of following architecture :
a periodic (daily?) batch running to generate a lot of random and uniques IDs and insert them in a dedicated MongoDB collection with InsertIfNotPresent
and each time I want to generate a new blog post, I take an ID from this collection and mark it as "taken" with UpdateIfCurrent atomic operation
WDYT ?
This is exactly why the developers of MongoDB constructed their ObjectID's (the _id) the way they did ... to scale across nodes, etc.
A BSON ObjectID is a 12-byte value
consisting of a 4-byte timestamp
(seconds since epoch), a 3-byte
machine id, a 2-byte process id, and a
3-byte counter. Note that the
timestamp and counter fields must be
stored big endian unlike the rest of
BSON. This is because they are
compared byte-by-byte and we want to
ensure a mostly increasing order.
Here's the schema:
0123 456 78 91011
time machine pid inc
Traditional databases often use
monotonically increasing sequence
numbers for primary keys. In MongoDB,
the preferred approach is to use
Object IDs instead. Object IDs are
more synergistic with sharding and
distribution.
http://www.mongodb.org/display/DOCS/Object+IDs
So I'd say just use the ObjectID's
They are not that bad when converted to a string (these were inserted right after each other) ...
For example:
4d128b6ea794fc13a8000001
4d128e88a794fc13a8000002
They look at first glance to be "guessable" but they really aren't that easy to guess ...
4d128 b6e a794fc13a8000001
4d128 e88 a794fc13a8000002
And for a blog, I don't think it's that big of a deal ... we use it production all over the place.
What about using UUIDs?
http://www.famkruithof.net/uuid/uuidgen as an example.
Make a web service that returns a globally-unique ID so that you can have many webservers participate and know you won't hit any duplicates?
If your daily batch didn't allocate enough items? Do you run it midday?
I would implement the web-service client as a queue that can be looked at by a local process and refilled as needed (when server is slower) and could keep enough items in queue not to need to run during peak usage. Makes sense?
This is an old question but for anyone who could be searching for another solution.
One way is to use simple and fast substitution cipher. (The code below is based on someone else's code -- I forgot where I took it from so cannot give proper credit.)
class Array
def shuffle_with_seed!(seed)
prng = (seed.nil?) ? Random.new() : Random.new(seed)
size = self.size
while size > 1
# random index
a = prng.rand(size)
# last index
b = size - 1
# switch last element with random element
self[a], self[b] = self[b], self[a]
# reduce size and do it again
size = b;
end
self
end
def shuffle_with_seed(seed)
self.dup.shuffle_with_seed!(seed)
end
end
class SubstitutionCipher
def initialize(seed)
normal = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a + [' ']
shuffled = normal.shuffle_with_seed(seed)
#map = normal.zip(shuffled).inject(:encrypt => {} , :decrypt => {}) do |hash,(a,b)|
hash[:encrypt][a] = b
hash[:decrypt][b] = a
hash
end
end
def encrypt(str)
str.split(//).map { |char| #map[:encrypt][char] || char }.join
end
def decrypt(str)
str.split(//).map { |char| #map[:decrypt][char] || char }.join
end
end
You use it like this:
MY_SECRET_SEED = 3429824
cipher = SubstitutionCipher.new(MY_SECRET_SEED)
id = hash["_id"].to_s
encrypted_id = cipher.encrypt(id)
decrypted_id = cipher.decrypt(encrypted_id)
Note that it'll only encrypt a-z, A-Z, 0-9 and a space leaving other chars intact. It's sufficient for BSON ids.
The "correct" answer, which is not really a great solution IMHO, is to generate a random ID, and then check the DB for a collision. If it is a collision, do it again. Repeat until you've found an unused match. Most of the time the first will work (assuming that your generation process is sufficiently random).
It should be noted that, this process is only necessary if you are concerned about the security implications of a time-based UUID, or a counter-based ID. Either of these will lead to "guessability", which may or may not be an issue in any given situation. I would consider a time-based or counter-based ID to be sufficient for blog posts, though I don't know the details of your situation and reasoning.

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.