Influxdb buckets or tags for storing users? - tags

First time working with influxdb and im trying to optimize a project that might have a lot of users. My question is, should time series data for each user be stored with different tags or in different buckets?

Well, it depends.
If number of users are not infinite or too many (i.e. compare to other tags relatively speaking) and these users will be filtered a lot when you are doing "group by", you should put these users as tags.
Otherwise, treat them in fields.
Tag values are indexed and field values aren’t in InfluxDB. Tags consume a lot of memory and you want to make best of resources on the most important stuff. The more tags, the more diversity of each tag, the higher the cardinality, the higher memory usage, the more probability you will hit the OOM issue.
See more best practices here.

Related

What is the best way to store data where one column has values that repeat ranging anywhere from 1-300+ times?

I've used web scraping to grab approximately 10,000 movies and all their associated review pages URLs, and the next step for me is to grab every single one of those reviews so that I can get the overall positive/negative reviews using sentiment analysis.
I'm writing all this in Python and am using the Pandas library as my means of pre-processing and structuring all the data. Already I have around 36,000 rows containing the name of the movie in one column and the URLs in the other, with the movie name being repeated over and over again, and with the average reviews per page being 20 I'm looking at roughly 720,000 rows when all things are said and done.
This is for the final project of the college course I'm taking, and throughout my schooling I've come to fear data redundancy in databases. I will eventually be writing all of this to a PostgreSQL database so users can query any movie to get back the prediction, and I'm having a hard time overlooking the fact that these movie titles are being repeated so often.
I was wondering if there was a better way to go about this (which could also hopefully save me some processing time), any help would be greatly appreciated!
I feel like this is more of a direct question than a code issue, but if necessary I can provide any relevant code.
If all the information you have about each movie, there is no redundancy (in the relational sense) , since this is the unique identifier.
You could save some space by having a separate movie table that contains an artificial numeric ID and the name and reference the ID from the main table, but that will make your queries more complicated and seems unnecessary for a small table like this.
What I would be more concerned about is whether the movie name is a good identifier at all: what if two movies have the same name? In this age of remakes, that is not a rarity.

Kafka : Generating unique IDs for strings across partitions

I'm trying to asses if Kafka could be used to scale-out our current solution.
I can identify partitions easily. Currently, the requirement is there to be 1500 partitions, each having 1-2 events per second, but future might go as high as 10000 partitions.
But there is one part of our solution which I don't know how would be solved in Kafka.
The problem is that each message contains a string and I want to assign a unique ID to each string across the whole topic. So same strings have the same ID while different strings have different IDs. The IDs don't need to be sequential, nor do they need to be always-growing.
The IDs will then be used down-stream as unique keys to identify those strings. The strings can be hundreds of characters long, so I don't think they would make efficient keys.
More advanced usage would be where messages might have different "kinds" of strings, so there would be multiple unique sequences of IDs. And messages will contain only some of those kinds depending on the type of the message.
Another advanced usage would be that the values are not strings, but structures and if two structures are same would be some more elaborate rule, like if PropA is equal, then structures are equal, if not, then structures are equal if PropB is equal.
To illustrate the problem: Each partition is a computer in a network. Each event is action on the computer. Events need to be ordered per-computer so that events that change the state of the computer (eg. user logged in) can affect other types of events, and ordering is critical for that. Eg. the user opened an application, a file is written, a flash drive is inserted, etc.. And I need each application, file, flash drive, or many others to have unique identifiers across all computers. This is then used to calculate statistics down-stream. And sometimes, an event can have multiple of those, eg. operation on a specific file on the specific flash drive.
There is a very nice post about kafka and blockchain. This is collective mind work and I think this could solve your IDs scalability issue. For solution refer to "Blockchain: reasons." part. All credits goes to respective authors.
Idea is simple, yet efficient:
Data is hash based, with link to previous block
Data may be very well same hashes, links to respective blocks of types
Custom block-chain solution means you in control of data encoding/decoding
Each hash chain is self-contained, and essentially may be your process (hdd/ram/cpu/word/app etc.)
Each hash chain may be a message itself
Bonus: statistics and analytics may be very well stored in block-chain, with high support for compression and replication. Consumers are pretty cheap in that context (scalability).
Proc:
Unique identifier issue solved
All records linked and thanks to kafka & blockchain highly ordered
Data extendable
Kafka properties applied
Cons:
Encryption/Decryption is CPU intensive
Growing level of hash calculation complexity
Problem: without problem context it's hard to approximate the limitations that need to be addressed further. However, assuming calculated solution has a finite nature you should have no issues scaling the solution in a regular way.
Bottom line:
Without knowledge of requirements in terms of speed/cost/quality it's hard to give a better, backed answer with working example. CPU cloud extension may be comparably cheap, data storage - depends on time for how long and what amount of data you want to store, replay-ability, etc. It's a good chunk of work. Prototype? Concept in referenced article.

MongoDB documents of calulated values for a dashboard vs re-retrieving on each web page view?

If I have a page in a web app that displays some dashboard type statistics about documents in my database (counts, docs created per hour, per day etc), is it best to pre-calculate this data and store it in a separate document (and update as needed), or assuming the collections have appropriate indexes, would it be appropriate to execute queries to retrieve these statistics on every load of the page?
It's not necessary that the data has to be exactly up to date on every page hit/load, so that's why I was thinking to maintain the data I need to display in a separate document that can be retrieved on page hit (or even cached and only re-retrieved every 5 minutes or similar).
That's pretty broad, and I have the feeling you have already identified the key points. Generally speaking, you should consider these questions:
Do you need to allow users to apply filters? Complex filters usually make pre-aggregation impossible.
Related: Is it likely that the exact same data is ever queried again? If not, pre-aggregation might need to happen on different levels of granularity (e.g. by creating day / week / month totals and summing these, instead of individual events).
What is the relation of reads vs. writes on the data? If the number of writes is small, it might be OK to keep counters in real-time, instead of using read-caching.
What are your performance requirements for cached and uncached queries? Getting fast cached queries is trivial, but comes at the cost of stale data. Making uncached queries faster is more tricky and usually requires something like the multi-level approach discussed before - it often doesn't help if old data comes super fast, but new queries take minutes.
Caching works especially well if the data can't be changed later (or is seldomly changed), and the queries remain the same with a certain chance of re-occuring. A nice example are facebook's profiles, where past years are apparently cached for every visitor-profile combination. First accesses are slow, however...

Storing two way relational data in Redis

Over the last few days I've been working on a very simple web service for myself (and a few others) that allows me to keep track of books that I've read and when I've read them. Whilst storing users and books (titles + authors + maybe more data in the future) is relatively simple because they can just be stored as hashes with keys user:username and book:uniqueID respectively storing which users read which books and when is proving to be a bit more challenge.
My original plan was to have a sorted set for a user (user:username:readbooks) that used the timestamp as a score (for when the user read the book) and each book's unique ID as the value. The problem with this approach is that I can't store that a user has read a book twice (as you can't have duplicate values in a set). It also means that in order to track readers of a book I have to add them to a second set readersof:bookID.
My current approach that is rather than directly storing book IDs in the set user:username:readbooksto instead store a value in the form uniqueReadingEventId.bookId, however the problem with this is that if I delete a book (rather than the unique reading event) I have to iterate through every user in the set readersof:bookID, iterate through every value in user:username:readbooks and deleting values that match x.bookId, which seems a little inefficient. Furthermore, I may want to find users that have read two or more books in common.
My question is therefore two fold: is there a simpler way to structure my data in Redis or is my data better structured to a different NoSQL system? I would really like to continue working with Redis because I like its API, however because it is a personal project it doesn't really matter what I use.
Unless you need really high throughput here for some reason, it doesn't sound like Redis is the right choice. It sounds like you want to store a lot of document level information, and neither high-throughput nor data structures are a huge concern for you. To me that screams for just using SQL. Your data is very schematic-- and from what you've said, there's really no reason SQL wouldn't best and most simply fit your use case. If you're married to the idea of using NoSQL, one of the more general use-case databases like Mongo would also serve well.
Redis as a persistent database is specialized for cases where you need high throughput, data structures are useful, and you don't mind paying the extra cost of keeping everything in memory instead of much less expensive HD space. There are lots of scenarios where Redis fits perfectly, but yours isn't one of them.

Good data store for millions of events?

We have a number of systems that daily generate a total of around 5M events. Currently we are saving these for around 10 days totaling around 40-50M events. Currently we're using an RDBMS as the persistance layer with a web-GUI slapped onto it, but we are experiencing certain performance problems.
An event consists of 20-30 fields composed of the following:
fields representing the event itself (e.g. OrderReceived)
fields representing the system that generated the event (e.g. ERP system)
fields representing the business context in which the event was generated (e.g. OrderManagement)
fields representing other details that we consider relevant/important
Roughly 5-6 of the fields are identifiers, most of them unique, representing the event itself, the business entity/object, the context and similar. Using these identifiers we can also relate events to each other chaining them together. The time difference in an event chain may be hours or in rare cases even days.
Currently we use the solution for analysis of individual event chains, mostly for error and outlier analysis (where did my order go?). In the future we may also like to gather statistics about events and event chains (how many orders per day? how many orders are handled by system X?). If possible the solution should also be able to grow to at least double the current size (we foresee an increase in the number of events as new systems are enabled). Analysis is today currently performed by human beings so search needs to be tolerable (searching for an event chain should take seconds, not minutes). The datastore should also allow for cleaning of stale events.
As mentioned in the beginning we're using a standard RDBMS for this. We were using a fairly normalized structure which we've now started denormalizing to try to increase performance. I can't help wondering whether some other solution might be better though. I've started looking around at different NoSQL databases (and in my own opinion MongoDB seems promising) but also trying to gather information concerning search engines and similar (Solr and ElasticSearch e.g.).
The question is what type of data store/solution would be a good fit for these events? Should we head into the NoSQL space, is perhaps a search engine what we want, or are we barking up the wrong tree when what we really need is to find someone who's really good at optimizing RDBMS:s?
I would suggest a hibrid solution with a conventional SQL server for the actual storage and a Lucene based frontend search engine, that is populated from the SQL based on some automatic or timed event. The web layer queries the Lucene layer and writes the SQL.
An SQL backend keeps your options open for the future (OLAP??, etc) and also provides a standard,scalable and multiuser way to accept data from the world through the dbconnection libraries and ui tools. In short if your data is stored in SQL you can not be lost...
The Lucene layer provides extreme query performance if the query capabilities it provides suffices. (In a nutshell: field value search for numbers, dates, strings, etc,range search,multiple field value search (field is an array actually), all with logical operators and logicalbinary expressions, sorting and paging. HOWEVER! it can not do groupings and sum, avg etc aggregating functions).
UPDATE: several years passed. Solr now has statistical capabilities like sum, avg, etc...
Query Performance: in a 100M record item database selecting a couple of hundred items with multifield query predicate is under 100ms.
Populating the index takes a constant time (no increase on size) because of the internal splitfile implementation. It is possible to build up a 5 million line index in minutes, 20 tops depending on mainly your storage controller. Lucence however supports realtime update to the index, a feature that we have used extensively with success on high load websites.
Lucene supports splitting and index into subindexes and index hierarchies so you can create an index per day but search in all of them (or in a specific subset of them) with a single query (using the multi-index adapter). I tried it with 2000 unique index files and the performance was amazing.
These architecture can be done without much effort in Java and .NET, both has great SQL and Lucene support