Raw Data or Pre-Calculated Values in Database? - iphone

In general, is it better to store raw data with pre-calculated values in the database and concentrate on keeping the database up-to-date if I remove or delete a row while using the pre-calculated values for display to the user
OR
is it better to store the raw data and calculate the correct display values on-the-fly?
An example (which is pertinent to my project) would be similar to the following:
You have a timer application. In my case its using Core Data. It's not connected to the web, but a self-contained app that runs on a computer or mobile device (user's choice). The app stores a raw start time and a raw end time. The application needs to display the duration of the event and the interval at which the events are occuring. Would it be better to store a pre-calculated "duration" time and even a pre-formatted duration string that will be used for output or would it be better to calculate the duration on-the-fly, so to speak, for display?
Same goes with the interval, although there's another layer involved because when I create/delete/update a row in the database, I'll have update the interval for the items that are affected by this. Or, is it better to just calculate as the app executes?
For the record, I'm not trying to micro-optimize. I'm trying to figure out the best way to reduce the amount of code I have to maintain. If performance improves as a result, so be it.
Thoughts?

Generally, you would want to avoid computed values in the DB (from existing columns/tables), unless profiling absolutely dictates that they are necessary (i.e., the DB is underperforming or to great of a load is being placed on the server). This is even more true for formatting of the data, which should almost always be performed on the client side, instead of wasting DB server cycles.
Of course, any data that is absolutely mandatory to perform the calculations should be stored in the database.
When you speak of reducing the amount of code you need to maintain, keep in mind that the DBA needs to maintain stored-proc code and table schemas, too. Moving maintenance responsibilities from Developers to DBAs is not eliminating work, it is just shifting it.
Finally, database changes often cascade to many applications, whereas application changes only affect that application.

The only time I store calculated values in a database is if I need it for historical purposes. You'll see this all the time in accounting software.
For example if I'm dealing with an invoice, I will typically save the calculated invoice total because perhaps the way that total will get calculated later on will change.
I will also sometimes perform the actual calculation on the database server using views.
As with so many other things, "it depends". For your described case, I would lean towards keeping the calculation in code. If you do choose to use the database, you should use a view to dynamically calculate rather than put in a static value. The risk of changing the start time or end time and forgetting to change the duration would be too high otherwise :)

This really depends on wether you want to be pure (keep your data clean) or fast. Compute capacity on the desktop facilitates purity, high speed cores and large memory spaces make string composition for table cells possible with large data sets.
However on the phone, an iPhone 4 even, computing a single NSString for a UITableViewCell over a set of 1000 objects takes a noticeable amount of time, and this can affect your user experience.
So, tune the balance for your use case. Duration doesn't sound like it will change, so I would precalculate and store the duration AND the display string (feels aweful from the perspective of a DBA, but it will render fast on the phone).
For the interval it sounds like you actually need another entity, to relate the interval to a set of events. It would then be easy enough to pre-compute / maintain this calculation as well each time the relationship changes (i.e. you add an entity to the relationship, update the interval).

Related

Should I store every data point or only changes in offline store for Feast offline feature retrieval?

I am implementing a Feature Engineering & Feature store solution with Feast on GCP.
I am using Bigquery for offline storage.
I have a question: say I have a feature on a user entity that does not change frequently (for example address). I of course intend to use Feast to build a training dataset and the point in time joint functionality. In that case I seem to have 2 options:
Saving at a given frequency, (lets say every hour) the address for all my users in the BQ table even if there is no change in the feature value compared to the previous one stored, having a lot of duplicates
Saving only changes in the features, with potentially important gaps and sparsity in the storage.
The second option seems the most adequate since we would not store too many duplicate data points. However I know there is an argument ttl on feast FeatureView object which in my understanding sets the number of days that feast will use to search for feature values when using get_historical_features. Thus for a data with large sparsity such as user location I may need to set a very high ttl value, which may have performance & cost impacts according to Feast documentation.
What is the way to approach this problem please?

PostgreSQL: Efficient way to get Average of records

I have a table called "ITEM_REVIEW" with a column named "score".
I know that I can get average of score by:
SELECT AVG(score) FROM "ITEM_REVIEW" WHERE "item_id"=1
Is this the right(and efficient) way to calculate average even when records are piled up? or is it better to fetch data and calculate average on my NodeJS server?
I'd simplify it to
SELECT AVG(ir.score) FROM "ITEM_REVIEW" ir
Yes it's massively more efficient to keep a million rows in the db on a fast hard disk, and calculate a single float that you pass over a network connection, than it is to transmit a million floats over a network connection into a shared server with more varied responsibilities and use some slow JavaScript to calculate the average
Databases are incredibly good at storing, collating, connecting and processing data; it's their sole purpose/entire reason for being. Teams of smart people spend all their time implementing and improving the performance of data processing features to ensure their db stays top of the pile. Leave your data in a database wherever possible, and drag the smallest number of bytes you can over slow network links (summarise and filter in the db always)
If you really care, you should try it and see. That way you get the correct answer for your versions, hardware, configuration, etc. Pay special attention to memory usage, as NodeJS might store all the data in memory before it gets around to averaging it.
But in general trying to re-implement the database outside of the database is going to lose. Not always, but in general.

MongoDB documents of calulated values for a dashboard vs re-retrieving on each web page view?

If I have a page in a web app that displays some dashboard type statistics about documents in my database (counts, docs created per hour, per day etc), is it best to pre-calculate this data and store it in a separate document (and update as needed), or assuming the collections have appropriate indexes, would it be appropriate to execute queries to retrieve these statistics on every load of the page?
It's not necessary that the data has to be exactly up to date on every page hit/load, so that's why I was thinking to maintain the data I need to display in a separate document that can be retrieved on page hit (or even cached and only re-retrieved every 5 minutes or similar).
That's pretty broad, and I have the feeling you have already identified the key points. Generally speaking, you should consider these questions:
Do you need to allow users to apply filters? Complex filters usually make pre-aggregation impossible.
Related: Is it likely that the exact same data is ever queried again? If not, pre-aggregation might need to happen on different levels of granularity (e.g. by creating day / week / month totals and summing these, instead of individual events).
What is the relation of reads vs. writes on the data? If the number of writes is small, it might be OK to keep counters in real-time, instead of using read-caching.
What are your performance requirements for cached and uncached queries? Getting fast cached queries is trivial, but comes at the cost of stale data. Making uncached queries faster is more tricky and usually requires something like the multi-level approach discussed before - it often doesn't help if old data comes super fast, but new queries take minutes.
Caching works especially well if the data can't be changed later (or is seldomly changed), and the queries remain the same with a certain chance of re-occuring. A nice example are facebook's profiles, where past years are apparently cached for every visitor-profile combination. First accesses are slow, however...

Too much data duplication in mongodb?

I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.
So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.
So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?
Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.
Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.
I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.
Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.
A few tips:
If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.
Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."
Try adding an IList of type UserEvent property to your User object. You didn't specify much about how your domain model is designed. Check the NoRM group http://groups.google.com/group/norm-mongodb/topics
for examples.

Reasons for & against a Database

i had a discussion with a coworker about the architecture of a program i'm writing and i'd like some more opinions.
The Situation:
The Program should update at near-realtime (+/- 1 Minute).
It involves the movement of objects on a coordinate system.
There are some events that occur at regular intervals (i.e. creation of the objects).
Movements can change at any time through user input.
My solution was:
Build a server that runs continously and stores the data internally.
The server dumps a state-of-the-program at regular intervals to protect against powerfailures and/or crashes.
He argued that the program requires a Database and i should use cronjobs to update the data. I can store movement information by storing startpoint, endpoint and speed and update the position in the cronjob (and calculate collisions with other objects there) by calculating direction and speed.
His reasons:
Requires more CPU & Memory because it runs constantly.
Powerfailures/Crashes might destroy data.
Databases are faster.
My reasons against this are mostly:
Not very precise as events can only occur at full minutes (wouldn't be that bad though).
Requires (possibly costly) transformation of data on every run from relational data to objects.
RDBMS are a general solution for a specialized problem so a specialized solution should be more efficient.
Powerfailures (or other crashes) can leave the Data in an undefined state with only partially updated data unless (possibly costly) precautions (like transactions) are taken.
What are your opinions about that?
Which arguments can you add for any side?
Databases are not faster. How silly... How can a database be faster than writing a custom data structure and storing it in memory ?? Databases are Generalized tools to persist data to disk for you so you don't have to write all the code to do that yourself. Because they have to address the needs of numerous disparate (and sometimes inconsistent) business functions (Persistency (Durability), Transactional integrity, caching, relational integrity, atomicity, etc. etc. ) and do it in a way that protects the application developer from having to worry about it so much, by definition it is going to be slower. That doesn't necessarilly mean his conclusion is wrong however.
Each of his other objections can be addressed by writing the code to address that issue yourself... But you see where that is going... At some point, the development efforts of writing the custom code to address the issues that are important for your application outweigh the performance hit of just using a database - which already does all that stuff out of the box... How many of these issues are important ? and do you know how to write the code necessary to address them ?
From what you've described here, I'd say your solution does seem to be the better option. You say it runs once a minute, but how long does it take to run? If only a few seconds, then the transformation to relational data would likely be inconsequential, as would any other overhead. most of this would take likely 30 seconds. This is assuming, again, that the program is quite small.
However, if it is larger, and assuming that it will get larger, doing a straight dump is a better method. You might not want to do a full dump every run, but that's up to you, just remember that it could wind up taking a lot of space (same goes if you're using a database).
If you're going to dump the state, you would need to have some sort of a redundancy system in place, along with quasi-transactions. You would want to store several copies, in case something happens to the newest version. Say, the power goes out while you're storing, and you have no backups beyond this half-written one. Transactions, you would need something to tell that the file has been fully written, so if something does go wrong, you can always tell what the most recent successful save was.
Oh, and for his argument of it running constantly: if you have it set to a cronjob, or even a self-enclosed sleep statement or similar, it doesn't use any CPU time when it's not running, the same amount that it would if you're using an RDBMS.
If you're writing straight to disk, then this will be the faster method over a database, and faster retrieval, since, as you pointed out, there is no overhead.
Summary: A database is a good idea if you have a lot of idle processor time or historical records, but if resources are a legitimate concern, then it can become too much overhead and a dump with precautions taken is better.
mySQL can now model spatial data.
http://dev.mysql.com/doc/refman/4.1/en/gis-introduction.html
http://dev.mysql.com/doc/refman/5.1/en/spatial-extensions.html
You could use the database to keep track of world locations, user locations, items locations ect.