Should I store every data point or only changes in offline store for Feast offline feature retrieval? - feature-engineering

I am implementing a Feature Engineering & Feature store solution with Feast on GCP.
I am using Bigquery for offline storage.
I have a question: say I have a feature on a user entity that does not change frequently (for example address). I of course intend to use Feast to build a training dataset and the point in time joint functionality. In that case I seem to have 2 options:
Saving at a given frequency, (lets say every hour) the address for all my users in the BQ table even if there is no change in the feature value compared to the previous one stored, having a lot of duplicates
Saving only changes in the features, with potentially important gaps and sparsity in the storage.
The second option seems the most adequate since we would not store too many duplicate data points. However I know there is an argument ttl on feast FeatureView object which in my understanding sets the number of days that feast will use to search for feature values when using get_historical_features. Thus for a data with large sparsity such as user location I may need to set a very high ttl value, which may have performance & cost impacts according to Feast documentation.
What is the way to approach this problem please?

Related

Data retention in timescaledb

Trying to wrap my head around timescaledb, but my google-fu is failing me. Most likely because I'm not searching for the correct term.
With RRD tool, old data can be stored as averages, reducing the amount of data being stored.
I can't seem to find out how to do this with timescaledb. I'd like 5 minute resolution for 90 days, but after that, it's pointless to keep all those data points, and I'd like to reduce it to 30 or 60 minute averages for a couple years, then maybe daily averages after that.
Is this something that I can set in the database itself, or is this something I would have to implement in a housekeeping job?
We had the exact same question half a year ago.
The term "Data Retention" is also used by the timescaledb team. It is currently implemented using drop_chunks policies (see their doc here). It's a Enterprise feature but IMHO not (yet) as useful as it could/should be (and it surely does not do what you are looking for).
Let me explain: probably the easiest approach for down-sampling your data are Continuous Aggregates (their doc here). You can quite easily aggregate virtually any numeric value to whatever resolution you desire. However, Continuous Aggregates will be affected by the deletions of the drop_chunks, too. Your data is gone.
One workaround would be to create other Hypertables instead. Then, create your own background workers copying the data from the original, hi-res table to these new lo-res Hypertables.
For housekeeping, either use the Data Retention Enterprise feature or create your own background workers.

Storing and managing Forex trading tick data

I'm building a data visualization system for Forex trading and I'm exploring ways of storing the historical Forex trading tick data that I have.
The data are in the form of currency pair (e.g. USD/CAD) chronological ticks of Ask and Bid prices. At the end of the day I need my data to be indexed in Elasticsearch and what I searching for is the best way to get them there.
I found a couple of approaches online; they start out simple but then get complicated. I'm wondering if adding that extra complexity is worth it. Some of my options are:
Storing tick data on PostgreSQL and then via a plugin sync them to Elasticsearch (here)
Storing tick data on PostgreSQL, push them to Logstash and then to Elasticsearch
Finally, storing tick data on PostgreSQL, push them to Redis, then to Logstash, and then to Elasticsearch
My intuition says that solution No 2 would be the ideal one, but what is considered best practice?
It's a good idea to store your data in a long-term storage DB, such as PostgreSQL or similar. That way you can decide at any time whether you need to change your mappings, add fields, remove fields, change their types, or what have you, and then you can easily rebuild your ES index/indices without too much trouble from your primary source of truth (i.e. PostgreSQL) and you always have clean data in ES.
I don't know ZomboDB (solution 1) so I can't really speak for it, all I know is that I'm generally not too fond of tying two different technologies together, it makes it hard to upgrade any of them in case you need/must/want to apply patches or benefit from new features in either of them.
Unless you have big and costly transformations to do on your source data, I feel that solution 3 doesn't bring much, i.e. the additional step of storing data in an intermediary Redis, doesn't bring much in my opinion (your mileage may vary here). It's a good idea to use a temporary store, such as Redis or Kafka, when you may lose data along the pipeline, but in this case, since you have your data in PostgreSQL, you don't really run the risk of losing anything. If at all, you can relaunch your pipeline and rebuild a few days of data.
That leaves solution 2, which would be fine given the information at hand. Using the Logstash JDBC input, you can easily retrieve the latest changes and forward them to ES every x minutes.
Eric from ZomboDB here. I wanted to try and answer your question as it relates to ZDB.
ZomboDB is really designed for full-text searching within Postgres. It's important to note that it's not a tool to synchronize your PG data to Elasticsearch. It's a fully-functional Postgres index type (akin to the built-in types like btree, gin, and gist) that happens to be backed by Elasticsearch. The fact that ZomboDB uses Elasticsearch is really an implementation detail.
While ZDB does provide a number of UDFs that expose access to ES' aggregate facilities, again, it's really designed for text searching.
So if your data is really just pairs of numbers, you're probably better off using ES directly -- especially if you're loading in one batch per day. There's no doubt that ZDB could provide superior aggregate performance compared to standard Postgres "GROUP BY" queries (because it passes it through to Elasticsearch), but you're paying a heavy operational penalty for a limited use-case.
If, on the other hand, your ask/bid data comes with a lot of related metadata, and:
You need PG to be your source of truth,
You need to text-search that metadata (with or without aggregation support), and
You don't want to learn ES and introduce another database system to your application, then...
... ZomboDB could be right for you.
I suspect Stack Overflow isn't the place to get into this, so feel free to contact me via the ways ZDB's github page recommends.

Is OLAP the right approach

I have a requirement to develop a reporting solution for a system which has a large number of data items, with a significant number of these being free text fields. Almost any value in the tables are needed for access to a team of analysts who carry out reporting, analysis and data provision.
It has been suggested that an OLAP solution would be appropriate for the delivery of this, however the general need is to get records not aggregates and each cube would have a large number of dimensions (~150) and very few measures (number of records, length of time). I have been told that this approach will let us answer any questions we ask of it, however we do not have repeated business questions that much but need to list the raw records out.
Is OLAP really a logical way to go with this or will the cubes take too long to process and limit the level of access to the data that the user require?

MongoDB High Avg. Flush Time - Write Heavy

I'm using MongoDB with approximately 4 million documents and around 5-6GB database size. The machine has 10GB of RAM, and free only reports around 3.7GB in use. The database is used for a video game related ladder (rankings) website, separated by region.
It's a fairly write heavy operation, but still gets a significant number of reads as well. We use an updater which queries an outside source every hour or two. This updater then processes the records and updates documents on the database. The updater only processes one region at a time (see previous paragraph), so approximately 33% of the database is updated.
When the updater runs, and for the duration that it runs, the average flush time spikes up to around 35-40 seconds, and we experience general slowdowns with other queries. The updater is RAN on a SEPARATE MACHINE and only queries MongoDB at the end, when all the data has been retrieved and processed from the third party.
Some people have suggested slowing down the number of updates, or only updating players who have changed, but the problem comes down to rankings. Since we support ties between players, we need to pre-calculate the ranks - so if only a few users have actually changed ranks, we still need to update the rest of the users ranks accordingly. At least, that was the case with MySQL - I'm not sure if there is a good solution with MongoDB for ranking ~800K->1.2 million documents while supporting ties.
My question is: how can we improve the flush and slowdown we're experiencing? Why is it spiking so high? Would disabling journaling (to take some load off the i/o) help, as data loss isn't something I'm worried about as the database is updated frequently regardless?
Server status: http://pastebin.com/w1ETfPWs
You are using the wrong tool for the job. MongoDB isn't designed for ranking large ladders in real time, at least not quickly.
Use something like Redis, Redis have something called a "Sorted List" designed just for this job, with it you can have 100 millions entries and still fetch the 5000000th to 5001000th at sub millisecond speed.
From the official site (Redis - Sorted sets):
Sorted sets
With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!
In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.
With Sorted Sets you can:
Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!
Without seeing any disk statistics, I am of the opinion that you are saturating your disks.
This can be checked with iostat -xmt 2, and checking the %util column.
Please don't disable journalling - you will only cause more issues later down the line when your machine crashes.
Separating collections will have no effect. Separating databases may, but if you're IO bound, this will do nothing to help you.
Options
If I am correct, and your disks are saturated, adding more disks in a RAID 10 configuration will vastly help performance and durability - more so if you separate the journal off to an SSD.
Assuming that this machine is a single server, you can setup a replicaset and send your read queries there. This should help you a fair bit, but not as much as the disks.

Raw Data or Pre-Calculated Values in Database?

In general, is it better to store raw data with pre-calculated values in the database and concentrate on keeping the database up-to-date if I remove or delete a row while using the pre-calculated values for display to the user
OR
is it better to store the raw data and calculate the correct display values on-the-fly?
An example (which is pertinent to my project) would be similar to the following:
You have a timer application. In my case its using Core Data. It's not connected to the web, but a self-contained app that runs on a computer or mobile device (user's choice). The app stores a raw start time and a raw end time. The application needs to display the duration of the event and the interval at which the events are occuring. Would it be better to store a pre-calculated "duration" time and even a pre-formatted duration string that will be used for output or would it be better to calculate the duration on-the-fly, so to speak, for display?
Same goes with the interval, although there's another layer involved because when I create/delete/update a row in the database, I'll have update the interval for the items that are affected by this. Or, is it better to just calculate as the app executes?
For the record, I'm not trying to micro-optimize. I'm trying to figure out the best way to reduce the amount of code I have to maintain. If performance improves as a result, so be it.
Thoughts?
Generally, you would want to avoid computed values in the DB (from existing columns/tables), unless profiling absolutely dictates that they are necessary (i.e., the DB is underperforming or to great of a load is being placed on the server). This is even more true for formatting of the data, which should almost always be performed on the client side, instead of wasting DB server cycles.
Of course, any data that is absolutely mandatory to perform the calculations should be stored in the database.
When you speak of reducing the amount of code you need to maintain, keep in mind that the DBA needs to maintain stored-proc code and table schemas, too. Moving maintenance responsibilities from Developers to DBAs is not eliminating work, it is just shifting it.
Finally, database changes often cascade to many applications, whereas application changes only affect that application.
The only time I store calculated values in a database is if I need it for historical purposes. You'll see this all the time in accounting software.
For example if I'm dealing with an invoice, I will typically save the calculated invoice total because perhaps the way that total will get calculated later on will change.
I will also sometimes perform the actual calculation on the database server using views.
As with so many other things, "it depends". For your described case, I would lean towards keeping the calculation in code. If you do choose to use the database, you should use a view to dynamically calculate rather than put in a static value. The risk of changing the start time or end time and forgetting to change the duration would be too high otherwise :)
This really depends on wether you want to be pure (keep your data clean) or fast. Compute capacity on the desktop facilitates purity, high speed cores and large memory spaces make string composition for table cells possible with large data sets.
However on the phone, an iPhone 4 even, computing a single NSString for a UITableViewCell over a set of 1000 objects takes a noticeable amount of time, and this can affect your user experience.
So, tune the balance for your use case. Duration doesn't sound like it will change, so I would precalculate and store the duration AND the display string (feels aweful from the perspective of a DBA, but it will render fast on the phone).
For the interval it sounds like you actually need another entity, to relate the interval to a set of events. It would then be easy enough to pre-compute / maintain this calculation as well each time the relationship changes (i.e. you add an entity to the relationship, update the interval).