why guava cache support weakKeys() AND weakValues() - guava

Guava CacheBuilder support both weakKeys() and weakValues().
But if the values are collected, why do we want to still keep keys in Cache?
So if we just use weakKeys(), that should be enough?

It is not the case that weakKeys means "collect the keys but keep the values," nor is it the case that weakValues means "collect the values but keep the keys."
What weakKeys does is say, "when there are no longer any strong references to the key, collect the entire entry." What weakValues does is say, "when there are no longer any strong references to the value, collect the entire entry." So when you use both, the entire entry is collected when either the key or the value has no strong references pointing to it.

Related

Kafka Streams WindowStore Fetch Record Ordering

The Kafka Streams 2.2.0 documentation for the WindowStore and ReadOnlyWindowStore method fetch(K key, Instant from, Instant to) states:
For each key, the iterator guarantees ordering of windows, starting
from the oldest/earliest available window to the newest/latest window.
None of the other fetch methods state this (except the deprecated fetch(K key, long from, long to)), but do they offer the same guarantee?
Additionally, is there any guarantee on ordering of records within a given window? Or is that up to the underlying hashing collection (I assume) implementation and handling of possible hash collisions?
I should also note that we built the WindowStore with retainDuplicates() set to true. So a single key would have multiple entries within a window. Unless we're using it wrong; which I guess would be a different question...
The other methods don't have ordering guarantees, because the order depends on the byte-order of the serialized keys. It's hard to reason about this ordering for Kafka Streams, because the serializers are provided by the user.
I should also note that we built the WindowStore with retainDuplicates() set to true. So a single key would have multiple entries within a window. Unless we're using it wrong; which I guess would be a different question...
You are using it wrong :) -- you can store different keys for the same window by default. If you enable retainDuplicates() you can store the same key multiple times for the same window.

How do I model a queue on top of a key-value store efficiently?

Supposed I have a key-value database, and I need to build a queue on top of it. How could I achieve this without getting a bad performance?
One idea might be to store the queue inside an array, and simply store the array using a fixed key. This is a quite simple implementation, but is very slow, as for every read or write access the complete array must be loaded / saved.
I could also implement a linked list, with random keys, and there is one fixed key which acts as starting point to element 1. Depending on if I prefer a fast read or a fast write access, I could let point the fixed element to the first or the last entry in the queue (so I have to travel it forward / backward).
Or, to proceed with that - I could also have two fixed pointers: One for the first, on for the last item.
Any other suggestions on how to do this effectively?
Initially, key-value structure is extremely similar to the original memory storage where the physical address in computer memory plays as the key. So any type of data structure could be modeled upon key-value storage surely, including linked list.
Originally, a linked list is a list of nodes including the index information of previous node or following node. Then the node it self should also be viewed as a sub key-value structure. With additional prefix to the key, the information in the node could be separately stored in a flat table of key-value pairs.
To proceed with that, special suffix to the key could also make it possible to get rid of redundant pointer information. This pretend list might look something like this:
pilot-last-index: 5
pilot-0: Rei Ayanami
pilot-1: Shinji Ikari
pilot-2: Soryu Asuka Langley
pilot-3: Touji Suzuhara
pilot-5: Makinami Mari
The corresponding algrithm is also imaginable, I think. If you could have a daemon thread for manipulation these keys, pilot-5 could be renamed as pilot-4 in the above example. Even though, it is not allowed to have additional thread in some special situation, the result of the queue it self is not affected. Just some overhead would exist for the break point in sequence.
However which of the two above should be applied is the problem of balance between the cost of storage space or the overhead of CPU time.
The thread safe is exactly a problem however an ancient problem. Just like the class implementing the interface of ConcurrentMap in JDK, Atomic operation on key-value data is also provided perfectly. There are similar methods featured in some key-value middleware, like memcached, as well, which could make you update key or value separately and thread safely. However these implementation is the algrithm problem rather than the key-value structure it self.
I think it depends on the kind of queue you want to implement, and no solution will be perfect because a key-value store is not the right data structure for this kind of task. There will be always some kind of hack involved.
For a simple first in first out queue you could use a few kev-value stores like the folliwing:
{
oldestIndex:5,
newestIndex:10
}
In this example there would be 6 items in the Queue (5,6,7,8,9,10). Item 0 to 4 are already done whereas there is no Item 11 or so for now. The producer worker would increment newestIndex and save his item under the key 11. The consumer takes the item under the key 5 and increments oldestIndex.
Note that this approach can lead to problems if you have multiple consumer/producers and if the queue is never empty so you cant reset the index.
But the multithreading problem is also true for linked lists etc.

Use optional keys or a catch-all key in MongoMapper?

Suppose I'm working on a MongoMapper class that looks like this:
class Animal
include MongoMapper::Document
key :type, String, :required => true
key :color, String
key :feet, Integer
end
Now I want to store a bird's wingspan. Would it be better to add this, even though it's irrelevant for many documents and feels a bit untidy:
key :wingspan, Float
Or this, even though it's an indescriptive catch-all that feels like a hack:
key :metadata, Hash
It seems like the :metadata approach (for which there's precedent in the code I'm inheriting) is almost redundant to the Mongo document as a whole: they're both intended to be schemaless buckets of key-value pairs.
However, it also seems like adding animal-specific keys is a slippery slope to a pretty ugly model.
Any alternatives (create a Bird subclass)?
MongoMapper doesn't store keys that are nil, so if you did define key :wingspan only the documents that actually set that key would store it.
If you opt not to define the key, you can still set/get it with my_bird[:wingspan] = 23. (The [] call will actually automatically define a key for you; similarly if a doc comes back from MongoDB with a key that's not explicitly defined a key will be defined for it and all docs of that class--it's kind of a bug to define it for the whole class but since nil keys aren't stored it's not so much of a problem.)
If bird has its own behavior as well (it probably does), then a subclass makes sense. For birds and animals I would take this route, since every bird is an animal. MongoDB is much nicer than ActiveRecord for Single Table/Single Collection Inheritance, because you don't need a billion migrations and your code makes it clear which attributes go with which classes.
It's hard to give a good answer without knowing how you intend to extend the database in the future and how you expect to use the information you store. If you were storing large numbers of birds and wanted to summarize on wingspan, then wingspan would be helpful even if it would be unused for other animals. If you plan to store random arbitrary information for every known animal, there are too many possibilities to try to track in a schema and the metadata approach would be more usable.

Meaning of Open hashing and Closed hashing

Open Hashing (Separate Chaining):
In open hashing, keys are stored in linked lists attached to cells of a hash table.
Closed Hashing (Open Addressing):
In closed hashing, all keys are stored in the hash table itself without the use of linked lists.
I am unable to understand why they are called open, closed and Separate. Can some one explain it?
The use of "closed" vs. "open" reflects whether or not we are locked in to using a certain position or data structure (this is an extremely vague description, but hopefully the rest helps).
For instance, the "open" in "open addressing" tells us the index (aka. address) at which an object will be stored in the hash table is not completely determined by its hash code. Instead, the index may vary depending on what's already in the hash table.
The "closed" in "closed hashing" refers to the fact that we never leave the hash table; every object is stored directly at an index in the hash table's internal array. Note that this is only possible by using some sort of open addressing strategy. This explains why "closed hashing" and "open addressing" are synonyms.
Contrast this with open hashing - in this strategy, none of the objects are actually stored in the hash table's array; instead once an object is hashed, it is stored in a list which is separate from the hash table's internal array. "open" refers to the freedom we get by leaving the hash table, and using a separate list. By the way, "separate list" hints at why open hashing is also known as "separate chaining".
In short, "closed" always refers to some sort of strict guarantee, like when we guarantee that objects are always stored directly within the hash table (closed hashing). Then, the opposite of "closed" is "open", so if you don't have such guarantees, the strategy is considered "open".
You have an array that is the "hash table".
In Open Hashing each cell in the array points to a list containg the collisions. The hashing has produced the same index for all items in the linked list.
In Closed Hashing you use only one array for everything. You store the collisions in the same array. The trick is to use some smart way to jump from collision to collision until you find what you want. And do this in a reproducible / deterministic way.
The name open addressing refers to the fact that the location ("address") of the element is not determined by its hash value. (This method is also called closed hashing).
In separate chaining, each bucket is independent, and has some sort of ADT (list, binary search trees, etc) of entries with the same index.
In a good hash table, each bucket has zero or one entries, because we need operations of order O(1) for insert, search, etc.
This is a example of separate chaining using C++ with a simple hash function using mod operator (clearly, a bad hash function)

Are NSManagedObject objectIDs unique across space and time like a CFUUID?

The NSManagedObjectID documentation states:
An NSManagedObjectID object is a compact, universal, identifier for a managed object. This forms the basis for uniquing in the Core Data Framework. A managed object ID uniquely identifies the same managed object both between managed object contexts in a single application, and in multiple applications (as in distributed systems).
Translation in my head: "There is probably no way that any two NSManagedObjectIDs are ever the same across the set of all instances of my application."
The CFUUID documentation states:
UUIDs ... are 128-bit values
guaranteed to be unique. A UUID is
made unique over both space and time
by combining a value unique to the
computer on which it was
generated—usually the Ethernet
hardware address—and a value
representing the number of
100-nanosecond intervals since October
15, 1582 at 00:00:00.
Translation in my head: "There is definitely no way that any two CFUUIDs are ever the same across the set of all instances of my application."
The fact that NSManagedObjectIDs are described as a "universal identifier" makes me almost certain that they offer the same uniqueness as a CFUUID, whereas "unique across space and time" leaves absolutely no room for doubt. Can anybody with more Core Data experience than me confirm or deny my thoughts?
Beyond uniqueness, there is one case where the object ID will change, and that's if you query it before persisting the object to disk. After saving, it will have a different ID. Beyond that point, the ID will not change. I just wanted to point this out because it caused me a bit of confusion until I figured out what was happening.
I can't comment on the hashing used to generate the NSManagedObjectID, but it does seem like the odds of it matching another NSManagedObject are vanishingly small, based on looking at the IDs generated.