Kafka : Generating unique IDs for strings across partitions - apache-kafka

I'm trying to asses if Kafka could be used to scale-out our current solution.
I can identify partitions easily. Currently, the requirement is there to be 1500 partitions, each having 1-2 events per second, but future might go as high as 10000 partitions.
But there is one part of our solution which I don't know how would be solved in Kafka.
The problem is that each message contains a string and I want to assign a unique ID to each string across the whole topic. So same strings have the same ID while different strings have different IDs. The IDs don't need to be sequential, nor do they need to be always-growing.
The IDs will then be used down-stream as unique keys to identify those strings. The strings can be hundreds of characters long, so I don't think they would make efficient keys.
More advanced usage would be where messages might have different "kinds" of strings, so there would be multiple unique sequences of IDs. And messages will contain only some of those kinds depending on the type of the message.
Another advanced usage would be that the values are not strings, but structures and if two structures are same would be some more elaborate rule, like if PropA is equal, then structures are equal, if not, then structures are equal if PropB is equal.
To illustrate the problem: Each partition is a computer in a network. Each event is action on the computer. Events need to be ordered per-computer so that events that change the state of the computer (eg. user logged in) can affect other types of events, and ordering is critical for that. Eg. the user opened an application, a file is written, a flash drive is inserted, etc.. And I need each application, file, flash drive, or many others to have unique identifiers across all computers. This is then used to calculate statistics down-stream. And sometimes, an event can have multiple of those, eg. operation on a specific file on the specific flash drive.

There is a very nice post about kafka and blockchain. This is collective mind work and I think this could solve your IDs scalability issue. For solution refer to "Blockchain: reasons." part. All credits goes to respective authors.
Idea is simple, yet efficient:
Data is hash based, with link to previous block
Data may be very well same hashes, links to respective blocks of types
Custom block-chain solution means you in control of data encoding/decoding
Each hash chain is self-contained, and essentially may be your process (hdd/ram/cpu/word/app etc.)
Each hash chain may be a message itself
Bonus: statistics and analytics may be very well stored in block-chain, with high support for compression and replication. Consumers are pretty cheap in that context (scalability).
Proc:
Unique identifier issue solved
All records linked and thanks to kafka & blockchain highly ordered
Data extendable
Kafka properties applied
Cons:
Encryption/Decryption is CPU intensive
Growing level of hash calculation complexity
Problem: without problem context it's hard to approximate the limitations that need to be addressed further. However, assuming calculated solution has a finite nature you should have no issues scaling the solution in a regular way.
Bottom line:
Without knowledge of requirements in terms of speed/cost/quality it's hard to give a better, backed answer with working example. CPU cloud extension may be comparably cheap, data storage - depends on time for how long and what amount of data you want to store, replay-ability, etc. It's a good chunk of work. Prototype? Concept in referenced article.

Related

Why does DHT hash the filenames?

One of the objectives of DHT is to partition the keyspace, so each node (or group of them) has a share of it. To do so, it hashes the filename of a file that wants to be saved and stores it in the node responsible of this part of the network. But, why does it have to hash the filename? Couldn't it just work like a dictionary, so instead of having a node hold hash values between 0000 and 0a2d, it would hold filename values between C and E?
But, why does it have to hash the filename?
It doesn't have to be a filename. It can hash other things too. E.g. file contents. Or metadata. Or cryptographic keys used as identities of users in the network.
Couldn't it just work like a dictionary, so instead of having a node hold hash values between 0000 and 0a2d, it would hold filename values between C and E?
Because filenames are not uniformly distributed throughout the possible keyspace (how often do you see filenames starting with some exotic unicode character?) and their entropy is spread over a variable length, leading to even more clustering at the top level.
If you were to index all existing unix filesystems in the world you would have massive clustering around the /etc/... prefix for example.
There are other p2p network overlays that can deal with heavy clustering in the keyspace, often by rearranging the nodes around the hotspots to increase network capacity in regions of the affected keyspace, e.g. based on levenshtein distance, but they generally aren't distributed hash tables because they do not employ hashing.
because searches are done on numbers.
When you hash a file, you end up with a number, and that number will be allocated in the nearest K-buckets of the nearest K-peers.
names are irrelevant, you're performing XOR searches on numeric spaces, so that you always search half of the space on every hop.
once you find a peer that has the bucket pointed by the hash, then you can communicate with that peer and exchange related information.
A DHT, like libtorrent's kademlia implementation has to be seen more of a distributed routing data structure. The problem you're solving is how do I find a number among billions of numbers, how do I find a peer among millions in the least amounts of hops possible, and the answer is that every node on the network has to follow a set of simple rules as to how to organize the numbers they're storing, and the peers that they know about.
I recommend you read these notes on how a real DHT actually works.
https://gist.github.com/gubatron/cd9cfa66839e18e49846
Also, storing a number takes a lot less space than storing a word.
If you know the word, you can hash the word and search for the hash.
Yes, it could work like a dictionary. However, it would be missing some desirable (for the typical DHT use case) emergent properties that come from using a hash.
One property that hashing (along with XOR distance metric) gives you is an even distribution of content amongst all the nodes participating in a DHT. "Even" here being caveated by how the k-bucket data structure works (here's an overview k-bucket slides), but in aggregate, you get nodes evenly distributing data amongst the DHT peers.. in theory. In practice, you can get hotspots.
Another property of using a hash is if you're looking for a file with specific contents. So, if you use hashes of the file contents as the identifiers, you can be... statistically sure (the guarantee comes from your hash function collision properties) that you're getting the contents you're looking for. Relying on a filename introduces a level of indirection that can serve different contents for the same file. Depending on your use case, that's acceptable or not.
I've considered what you're proposing before as a prefix to a SHA-1 hash. So, something like node1-cd9cf... (the prefix could be anything really, doesn't need to be human readable). This would ensure that all the things with that prefix end up pretty much on a node that identifies itself with an id starting with "node1-". But, you'd have to have a DHT implementation (including k-bucket implementation) that supports variable length ids. In this case, you're guaranteeing a hotspot. It's an equivalent of artificially ensuring that things are "close together" as in the difference between them in the XOR metric is very small. Why would anyone want to do this? For example: com.example.www-cd9cf... combined with some crypto could ensure that while you're participating in a DHT, the data is stored on your servers. I haven't seen this implemented before though.

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.

Storing two way relational data in Redis

Over the last few days I've been working on a very simple web service for myself (and a few others) that allows me to keep track of books that I've read and when I've read them. Whilst storing users and books (titles + authors + maybe more data in the future) is relatively simple because they can just be stored as hashes with keys user:username and book:uniqueID respectively storing which users read which books and when is proving to be a bit more challenge.
My original plan was to have a sorted set for a user (user:username:readbooks) that used the timestamp as a score (for when the user read the book) and each book's unique ID as the value. The problem with this approach is that I can't store that a user has read a book twice (as you can't have duplicate values in a set). It also means that in order to track readers of a book I have to add them to a second set readersof:bookID.
My current approach that is rather than directly storing book IDs in the set user:username:readbooksto instead store a value in the form uniqueReadingEventId.bookId, however the problem with this is that if I delete a book (rather than the unique reading event) I have to iterate through every user in the set readersof:bookID, iterate through every value in user:username:readbooks and deleting values that match x.bookId, which seems a little inefficient. Furthermore, I may want to find users that have read two or more books in common.
My question is therefore two fold: is there a simpler way to structure my data in Redis or is my data better structured to a different NoSQL system? I would really like to continue working with Redis because I like its API, however because it is a personal project it doesn't really matter what I use.
Unless you need really high throughput here for some reason, it doesn't sound like Redis is the right choice. It sounds like you want to store a lot of document level information, and neither high-throughput nor data structures are a huge concern for you. To me that screams for just using SQL. Your data is very schematic-- and from what you've said, there's really no reason SQL wouldn't best and most simply fit your use case. If you're married to the idea of using NoSQL, one of the more general use-case databases like Mongo would also serve well.
Redis as a persistent database is specialized for cases where you need high throughput, data structures are useful, and you don't mind paying the extra cost of keeping everything in memory instead of much less expensive HD space. There are lots of scenarios where Redis fits perfectly, but yours isn't one of them.

Is it better to use multiple databases when you are managing independent sets of things in MongoDB?

If, as an example, you have a blogging website done with MongoDB to store data
Is it better to have a database per blogger? given that their blogs and comments are completely independent from other bloggers. Or just lump everything together? or it doesn't make too much difference?
I'm imagining the same web app (not independent webs/urls per blogger) is used by all bloggers. So when someone logs in / accesses the blog the code would find the right database to use and haul data out it.
Does this have any downsides? is this normal for handling these kinds of things?
I am making plenty of assumptions about your needs. But, generally, there are 3 paths to multi-tenant apps in MongoDB:
Single collection per customer; never, ever do this.
Single database per customer. Good. You will trade off free space if your product is on the freemium model. Either way, you will want to run with "smallfiles" option. As stated, you will build the routing system for your environment. Thus, you will want to connect to the proper database for the proper customer.
customer_id key per document + path slug. Good. The trade off here is recovery of free space. Traditionally, MongoDB does not recover space used by deleted documents. Thus customers creating and deleting blog posts would create unused space. By using 'usePowerOf2Sizes' collections, you will recover disk space of deleted documents. However, 'usePowerOf2Sizes' creates bloated padding space.
To get over the disk space padding, take a look at the compression used here: http://blog.appsignal.com/blog/2013/07/30/taming-mongodb-disk-usage.html
Recap, I would recommend using customer_id plus the compression. It gives you the best of both worlds.
As stated in the comments under the original question, there's really no performance benefit to splitting up your MongoDB store into separate databases per blogger, due to the overhead of having each database and minimum storage.
On the flipside: You are going to make some cross-user analysis more difficult for yourself. As a very simple example, based on your blogging example: Imagine you want to look at average post count per user. This is pretty simple if your users (and posts) are in the same database (typically in the same collections), and you can likely use the aggregation framework for this task. This task will not be so straightforward with an unbounded number of databases, where you'll need to first enumerate all databases, then perform your aggregations/averaging once per database. This could end up being a slower operation than within a single-database architecture.
Having said all that: You still might have some reason to split data across databases. Maybe you have to separate data due to legal reasons, or to ensure customers that their sensitive data won't be commingled with other companies' data. Maybe your customer needs full read/write access to their database, and so you use per-database configuration as a security boundary. I'm sure there are other reasons as well...
It is perfectly normal to allocate 100's of databases if that is all you will see.
Database separation can have many benefits. They can be sharded independantly, since sharding occurs on database level. Databases also have the upside of being completely isolated instances (including locks) of the data within them (good example: space allocation occurs on database level).
This means they can be moved around the network as users data is accessed more and since a single users data might not be that big it would be easier than moving all of your users data to a more powerful node.
However, you must consider the problematic sides in the application of managing the connections to each database. There will be over head on it and you will need to have far more complex coding than what is considered standard.
Considering space, you will not see a drastic usage of space. The most problematic part of using separate databases is the journal allocation. Every collection you use in separate databases will also, of course, pre-allocate itself but this is actually considered one of the upsides to using database separation (movement of databases between nodes, isolation).
So the space problem is really only a problem if your scenario makes it one.
is this normal for handling these kinds of things?
For a normal blogger site, no, and I do not know enough about the complexities of your scenario to say any different. Normal operation would be to lump everything together, since you could see into the region of 1,000's maybe 1,000,000's of users and database separation just won't scale over that very well.

Memcached best practices - small objects and lots of keys or big objects and few keys?

I use memcached to store the integer result of a complex calculation. I've got hundreds of integer objects that I could cache! Should I cache them under a single key in a more complex object or should I use hundreds of different keys for the objects? (the objects I'm caching do not need to be invalidated more than once a day)
I would say lots of little keys. This way you can get the exact result you want in 1 call with minimal serialization effort.
If you store it in another object (an array for example) you will have to fetch the array from cache and then fetch the item you actually want again from that array, plus you have the overhead of serializing/deserializing the whole complex object again. Depending on your language of choice this might mean manually writing a serialization/deserialization function from scratch.
I wrote somewhat large analysis at http://dammit.lt/2008/12/25/memcached-for-small-objects/ - it outlines how to optimize memcached for small object storage - it may shed quite some light on the issue.
It depends on your application. While memcached is very fast, it does require some request transmission and memory lookup time per request. Those numbers increase depending on whether or not the server is on the local machine (localhost), on the local network, or across a wide area. The size of your cache generally doesn't affect the lookup speed.
So, if your application is using MANY objects per processing unit (per request, method, or what-have-you), then it's generally better to define your cache in a way which lowers total number of hits to the cache while at the same time trying not to duplicate cache data. Like everything else, it's a balance.
i.e. If you have a web request which pulls a list of blog posts, it would be more beneficial to cache the entire object list as one memcached key, rather than (and this is a somewhat bad example, obviously) caching an array of cache keys for that list, which relate to individually memcached objects.
The less processing you have to do of the cached values, the better. So why not just dump them into the cache individually?
I would say you should store values individually and use some kind of helper class to retrieve values with multiget and generate a complex dataobject for you.
It depends on what are those numbers. If you could, for example, group them in ranges, then you could optimize the storage. If you could hash them, into a map, or hashtable and store that map serialized in memcached would be good to.
Anyway, you can save many little keys, just make sure you configure the slabs to have chunks with small size, so you will not waste memory space.