Concurrent hash-tables - hash

I make some investigation about a hash-tables implementations for multi-threaded environment. I wonder are there some new articles or research in 2010-2014 period? Maybe you know about such scientific papers?
Currently, I've found not so much information about it:
1. Relativistic Causal Ordering A Memory Model for Scalable Concurrent Data Structures
2. CPHash: A Cache-Partitioned Hash Table

Have you looked into Cuckoo Hashing and Hopscotch Hashing?

Related

How can I make sure that a hash function won't produce the same cypher for 2+ different entries?

Edit: some people flagged this question as a potential duplicate of this other one. While I agree that knowing how the birthday paradox applies to hashing functions, the 2 questions (and respective answers) address 2 different, albeit related, subjects.
The other question is asking "what are the odds of collision", whereas this question main focus is "how can I make sure that collision never happens".
I have a data lake stored in S3 where each day an ETL script dumps additional data from the day before.
Due to how the pipeline is built, it is possible for a very inconsiderate user that has admin access to produce duplicates in said data lake by manually interacting with the dump files coming from our OLTP database, and triggering the ETL script when it's not supposed to.
I thought that a good idea to prevent data duplication was to insert a form of security measure in my ETL script:
Produce a hash for each entry.
Store said hashes somewhere else (like a dynamodb table).
Whenever new data comes in, hash that as well and compare it with the already existing hashes.
If any of new hash is in the existing hashes, reject the associated entry entirely.
However, I know very little about hashing and I was reading that, although unlikely, 2 different sources can produce the same hash.
I understand it's really hard for it to happen in this situation, but I was wondering if there is a way to be 100% sure about it.
Any idea is much appreciated.
Long answer: what you want to study and explore is called "perfect hashing" (ie hashing guaranteed not to have collisions. https://en.wikipedia.org/wiki/Perfect_hash_function
Short answer: A cryptographic collision resistant algorithm like sha-1 is probably safe to use for all but the largest (PBs a day) datasets and even then its probably all right. Git uses sha-1 internally and code repositories probably deal with the most files on the planet and rarely have collisions.
See for details: https://ericsink.com/vcbe/html/cryptographic_hashes.html#:~:text=Git%20uses%20hashes%20in%20two,computed%20when%20it%20was%20stored.
Medium answer: this is actually a pretty hard problem overall and a frequent area of study for computer science and a lot depends on your particular use case and the context you're operating in. Cuckoo hashing, collision resistant algorithms, and hashing in general are probably all good terms to research. There's also a lot of art and science behind space (memory) and time (computer power needed) when picking these methods. A good rule of thumb is that perfect hashing will generally take up more space and time than a collision resistant cryptographic hash like sha-1.

Hash Table With Adaptive Hash Function

The performance of a particular hash table depends heavily on both the keys and the hash function. Obviously one can improve the performance greatly by trying different hash functions based on the incoming elements, and picking the one resulting into the least collisions. Are there any publications on this subject, exploring the methods of selecting such functions dynamically with or without user guidance?
I doubt there is a formal process to choose the best. There are too many moving parts. Especially when it comes to performance - there is no single "best performance" approach. Is it best latency? throughput? memory usage? cpu usage? More reads? More writes? Concurrent access? etc, etc, etc.
The only sensible way is to run performance tests for your specific code and use cases and choose what works for you.

Can someone please explain the concept of causality in distributed computing?

I'm reading up on consistency models but can't seem to understand the concept of causality in distributed systems. I've googled quite a lot but don't find a good explanation of the concept. People usually explain why causality is a good thing, but what is the basic concept?
Assuming you are asking about the basic notion of causal relationships among events in distributed systems, the following may help get you on the right track.
In the absence of perfectly synchronised clocks shared by all processes of a distributed system, Leslie Lamport introduced the notion of Logical Clocks. A Logical Clock affords the establishment of a partial order over the events occurring in a distributed system via the so-called happened-before relationship, a causal relationship.
To illustrate a bit further, events on the same machine can be ordered by relying on the local clock. However, this is not generally an option for events that cross process-boundaries. In particular, we use the following insight to establish a causal relationship over message passing events in the system: send(m) at process p occurs before receive(m) at process q. This enables us to establish a causal relationship among these events.
I am not sure how helpful my explanation is, but, if you have not already done so, Leslie Lamport's original paper Time, Clocks, and the Ordering of Events in a Distributed System should help clear things up for you. Next, you may want to look at Spanner: Google's Globally Distributed Database for a creative way to deal with the issue of time in a distributed system (TrueTime).
Hope this helps.
Taken from https://jepsen.io/consistency/models/causal
Causal consistency captures the notion that causally-related operations should appear in the same order on all processes—though processes may disagree about the order of causally independent operations.
For example, consider a chat between three people, where Attiya asks “shall we have lunch?”, and Barbarella & Cyrus respond with “yes”, and “no”, respectively. Causal consistency allows Attiya to observe “lunch?”, “yes”, “no”; and Barbarella to observe “lunch?”, “no”, “yes”. However, no participant ever observes “yes” or “no” prior to the question “lunch?”.

Ensuring a hash function is well-mixed with slicing

Forgive me if this question is silly, but I'm starting to learn about consistent hashing and after reading Tom White blog post on it here and realizing that most default hash functions are NOT well mixed I had a thought on ensuring that an arbitrary hash function is minimally well-mixed.
My thought is best explained using an example like this:
Bucket 1: 11000110
Bucket 2: 11001110
Bucket 3: 11010110
Bucket 4: 11011110
Under a standard hash ring implementation for consistent caching across these buckets, you would be get terribly performance, and nearly every entry would be lumped into Bucket 1. However, if we use bits 4&5 as the MSBs in each case then these buckets are suddenly excellently mixed, and assigning a new object to a cache becomes trivial and only requires examining 2 bits.
In my mind this concept could very easily be extended when building distributed networks across multiple nodes. In my particular case I would be using this to determine which cache to place a given piece of data into. The increased placement speed isn't a real concern, but ensuring that my caches are well-mixed is and I was considering just choosing a few bits that are optimally mixed for my given caches. Any information later indexed would be indexed on the basis of the same bits.
In my naive mind this is a much simpler solution than introducing virtual nodes or building a better hash function. That said, I can't see any mention of an approach like this and I'm concerned that in my hashing ignorance I'm doing something wrong here and I might be introducing unintended consequences.
Is this approach safe? Should I use it? Has this approach been used before and are there any established algorithms for determining the minimum unique group of bits?

Iterative steps / best practices for transitioning RDBMS to Cassandra

What are the iterative steps / best practices for transitioning from an RDBMS to Cassandra? Is there a benefit to denormalization of the RDBMS schema prior to the move (beyond the improved scalability of the RDBMS itself)?
That's quite a question.
I would start by reading about the data model, especially the "Thinking in Terms of Queries" section.
The goal is to do as few queries per "action" that you need to perform as possible. This frequently requires denormalization, sometimes in more than one way. There are also quite a few tricks that sometimes need to be used to reach that goal; the Twissandra example in the linked documentation demonstrates a couple of common ones.
It's easier to give specific suggestions for specific requirements, when you have them.