Which data-store to use, to store meta data corresponding to the keys in memcache?

Which data-store to use, to store meta data corresponding to the keys in memcache? - memcached

I have a memcache backend and i want to add redis for adding the meta data of the keys of the memcache.
Meta data is as follows:
Miss_count: The number of times the data was not present in the memcache.
Hash_value: The hash value of the data corresponding to the key in the memcache.
Data in memcache : key1 ::: Data
Meta data (miss count) : key1_miss ::: 10
Meta data (hash value) : key1_hash ::: hash(Data)
Please provide help as in which data store is preferable as when i store the meta data in the memcache itself, the meta data is removed well before its expiry time as the size of the meta data is small and the slab allocation is allocating a small memory chuck to it.

As the meta data will increase with time, the hash concept of the redis will fail. Therefore apply a client logic to see that the max_zipped is satisfied.

If I understand your use case correctly I suspect Redis might be a good choice. Assuming you'll be periodically updating the meta data miss counts associated with the various hashes over time, you'd probably want to use Redis sorted sets. For example, if you wanted the miss counts stored in a sorted set called "misscounts", the Redis command to add/update those counts would be one and the same:
zadd misscounts misscount key1
... because zadd adds the entry if one doesn't already exist or overwrites an existing entry if it does. If you have a hook into the process that fires each time a miss occurs, you could instead use:
zincrby misscounts 1 key1
Similar to the the zadd command behavior, zincrement will create a new entry (using the increment value as the count) if one doesn't exist, or increment the existing count by the increment value you pass if an entry does exist.
Complete documentation of Redis commands can be found here. Descriptions of the different types of storage options in Redis is detailed here.
Oh, and a final note. In my experience, Redis is THE SHIT. Sorry to curse (in caps), but there's simply no other way to do Redis justice. We call our Redis server "honey badger", because when load starts increasing and our other servers start auto-scaling, honey badger just don't give a shit.

Related

Caching in a microservice with multiple replicas in k8s

I've a Golang based micro-service which has an in-memory cache as follows:
Create object -> Put it in cache -> Persist
Update object -> Update the cache -> Persist
Get -> Get it from the cache
Delete -> Delete cache entry -> Remove from data store.
On a service re-start, the cache is populated from the data store.
The cache organizes the data in different ways that matches my access patterns.
Note that one client can create the object, and other clients can update it at a later point in time.
Everything works fine as long as I've one replica. But, this pattern will break when I increase the replica count in my deployment.
If I have to go to the DB for each GET, it defeats the purpose of the cache. The first thought is, to move the cache out. But, this seems like a fairly common problem when moving to multi-replica microservices. So, curious to understand alternatives.
Thanks for your time.

Mainly many things depends on how you structure your application.
One common solution is use Redis Cache or Distributed Cache. Here advantage is that your all services will go to same cache to manage object. This will give more consistent data.
Another approach that you can take and this will be some how more complex. Try to use sharding.
For Get Operation based on Id of object, you have to route request to specific instance. That instance will have that object in cache. If not then it read from db and put it in that instance cache. Eachtime for that object it will go that instance. This is applicable to Update and Delete operation.
For create operation.
If you want DB generate Id automatically for object then there is once chance object created in DB and then it return that Id and based on Id you have to route request and that way for first access after creation will be from DB but after that it will be in cache of that instance.
If you have provision that Id can be manually generated then during creation if you have to prefix Id with something that map to instance.
Note : In distributed system , there is no one solution. You always have to decide which approach works for you scenario.

Why memcached set operation is not idempotent?

On the page 2 of the Facebook's paper "Scaling Memcache at Facebook" they said "For write requests,the webserver issues SQL statements to the database and then sends a delete request to memcache that invalidates any stale data. We choose to delete cached data instead of updating it because deletes are idempotent."
Why update/set is not idempotent operation?
Paper can be found here: https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf

If you call one delete after another two times second delete won't make any effect. Update/set here will act differently: while it won't change value associated with the key, it will update the last access time for the key changing the logic of when the key will be evicted. In this sense delete key operation is idempotent while update key's value is not.
E.g. in the paper they don't want to keep a key in the cache if no one ever tried to read it (even if there are lots of writes for that key in the database).

Is querying MongoDB faster than Redis?

I have some data stored in a database (MongoDB) and in distributed cache redis.
While querying to the repository, I am using lazy loading approach which first finds the data in the cache if it's available, if not find it in the database and update the cache as well so that next time when the requirement comes it should be found in the cache.
Sample Model Used:
Person ( id, name, age, address (Reference))
Address (id, place)
PersonCacheModel extends Person with addressId.
I am not storing parent object with child object together in the cache that is why I've created personCacheModel with addressId and store this object in the cache and while getting the data personCacheModel converts to person and make a call to address repo to addressCache to fill the address details of the person object.
As far as I understand:
personRepository.findPersonByName(NAME + randomNumber);
Access Data from Cache = network time + cache access time + deserialize time
Access Data from database = network time + database query time + object mapping time
When I ran above approach for 1000 rows, accessing data from the database is faster than the accessing data from the cache. I believe cache access time must be smaller than the accessing MongoDB.
Please let me know if there's an issue with the approach or is this is the expected scenario.

to have a valid benchmark we need to consider hardware side and data processing side:
hardware - do we have same configuration, RAM, CPUs count, OS... etc
process - how data is transformed (on single thread, multi thread, per object, per request)
Performing a load test on your data set will give you an good overview of which process is faster in particular use case scenario.
It is hard to judge - what it should be as long as there mentioned above points will be know for us.
The other thing is to have more than one test scenario and have it stressed in let's say 10 sec time, minute , 5 an hour... so you can have digits that will tell you the truth.

How to implement Redis's features with some other datastore?

I found Redis have so good features for my project (webapp's autocomplete back-end). Basicly, it is my fulltext search engine. Now i am looking a replacement for Redis, because i can't hold whole dataset in memory.
I create my Redis store like this (can't find the link for credits for this idea):
i chunk my (weighted) items from regular database into 3 char chunks after each char ("words" -> ['wor', 'ord', 'rds'] )
every chunk becomes a key to hold sorted list of item's id-s which had such chunk. ( ZADD chunk weight items_id )
every item id is also key to hold some simple JSON document about item ( SET items_id items_hash_in_json )
Search works like this:
query string is made same way into 3 char chunks
i ask intersection of all those chunks and get list of items_id-s ( combination of ZINTERSTORE and ZRANGEBYSCORE )
return list of JSON docs by items_id-s
Plain and simple. Very effective and fast. There are some smaller cons still in such flow, but mostly i feel having just right tools and right datatypes for my domain.
Main problem is: it requires too much memory. I have about 600K items in database and on 'indexing' i cut them down after 40 chars, but it takes still 2.5GB RAM. It is a bit much for the task. And dataset will grow, not too much and too fast, but still.
I have looked some NoSQL stores now and i have not met similar approach and tools as Redis has. Maybe it is because i look hammer for every work now, but i feel that with other NoSQL stores i need to implement myself such functionalities ( sorted lists, find intersection of them, simple key-value as binary strings, inserting data dead simple, simple protocol/API and simple clients ).
I'd like to have Perl binding too, but in case of very simple protocol (like REST for CoachDB) it is not mandatory.
Do you know such tools to implement my solution with other NoSQL product?
With other eye i already look for completly different solutions too (like couchdb-lucene, but i'd like to avoid abandon system i described above.

HTTP Cache
I have a possible solution for you that I currently use on my site. I cache autocomplete queries with static files using Nginx. Nginx can serve static files very quickly. Here is a sample lines I have in my config.
http {
fastcgi_cache_path /var/cache/nginx levels=1:2
keys_zone=tt:600m
inactive=7d max_size=10g;
fastcgi_temp_path /var/cache/nginx/tmp;
}
This block describes that path where the files will be stored. levels is how many directories deep, 1:2 would suffice. My zone here is called tt, name it whatever you want. Followed by expiration time.
location ~ /tt/(.+)\.php$ {
try_files $uri /index.php?$args;
fastcgi_index index.php;
fastcgi_pass 127.0.0.1:9000;
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_param SCRIPT_NAME $fastcgi_script_name;
#Caching parameters
fastcgi_cache tt;
fastcgi_cache_key "$scheme$request_method$host$request_uri";
fastcgi_cache_valid 200 302 304 30m;
fastcgi_cache_valid 301 1h;
fastcgi_cache_valid any 5m;
fastcgi_cache_use_stale error timeout invalid_header updating http_500;
}
The location block will that contains the cache params. So anything with the URI /tt/.*.php will be cached. The URI + query string will become the cache key.
If you don't use Nginx, the same concept might work with another webserver. I hope this helps.
Edit
From Comments:
Using index as plain files seems rather slower than SQL queries. Still, i have not benchmarked them.
A cache hit for Nginx will look something like this:
-> Nginx -> file
Miss:
-> Nginx -> php/python/ruby -> db(redis/mysql/whatever)
The first path might seem slower because you think of diskio, but it's not, the OS will automatically cache files that are frequently accessed. So when Nginx heats up, just hitting your PHP backend to say "Hello world" is going to be slower in comparison. I make that claim because it's just like serving a static file.
Actual hit/miss rates will depend on the application, data, and configuration. In my experience, people use a lot of the same search terms, so you probably won't have 600k files sitting around. Even if you do it doesn't really hurt, Nginx manages them for you. This method isn't very good if your data changes a lot and you want the search to reflect those changes quickly. You would have to set a short expire time which would result in more misses.
Redis Zip Lists/Hashes
http://redis.io/topics/memory-optimization
If you still need sorted sets, make sure the configuration settings from the link are set high enough for your dataset needs. If you are able to use hashes, you can save a ton of memory using the algorithm they show lower on that page. I think you can definitely use it to when storing the item_id linking to a json string.

Just asimple idea that could be useful for you. It's not a direct and exact answer to your question.
I suppose that most of your data or a significant part is located in those JSON documents. In this situation I suggest you to change slightly your data infrastructure: in order to keep all the benefits of Redis, you should use the same first 2 steps for create and search, but change your 3rd step. Instead of using Redis to store these JSON documents, just move them to a simple indexed table of your prefered/used DB. This way you'll handle just the chunks and keys and perform the operations offered by Redis, but at step 3 you take the list of item_id's and retrieve the JSON data from your DB. Probably a SELECT ... WHERE item_id IN(...) will be enough.

key value stores for extendable objects

http://www.infoq.com/presentations/newport-evolving-key-value-programming-model is a video about KV stores, and the whole premise is that redis promotes a column-based style for storing the attributes of an object under separate keys rather than serialising an object and storing it under a single key.
(This question is not redis-specific, but more a general style and best practice for KV stores in general.)
Instead of a blob for, say, a 'person', redis encourages a column based style where the attributes in an object are stored as separate key, e.g.
R.set("U:123:firstname","Billy")
R.set("U:123:surname","Newport")
...
I am curious if this is best practice, and if people take different approaches.
E.g. you could 'pickle' an object under a single key. This has the advantage of being fetched or set in a single request
Or a person could be a list with the first item being a field name index or such?
This got me thinking - I'd like a hierarchical key store, e.g.
R.set(["U:123","firstname"],"Billy")
R.set(["U:123","surname"],"Newport")
R.get(["U:123"]) returns [("firstname","Billy"),("surname","Newport")]
And then to add in transactions:
with(R.get(["U:132"]) as user):
user.set("firstname","Paul")
user.set("lastname","Simon")
From a scaling perspective, the batching of gets and sets is going to be important?
Are there key stores that do have support for this or have other applicable approaches?

You can get similar behavior in Redis by using an extra Set to keep track of the individual members of your object.
SET U:123:firstname Billy
SADD U:123:members firstname
SET U:123:surname Cobin
SADD U:123:members surname
GET U:123:firstname => Billy
GET U:123:firstname => Cobin
SORT U:123:members GET U:123:* -> [Billy, Cobin]
or
SMEMBERS U:123:members -> [firstname, surname]
MGET U:123:firstname U:123:firstname
Not a perfect match but good enough in many situations. There's an interesting article about how hurl uses this pattern with Redis

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse