Is there a way to caching mechanism for Class::DBI? - perl

I have a set of rather complex ORM modules that inherit from Class::DBI. Since the data changes quite infrequently, I am considering using a Caching/Memoization layer on top of this to speed things up. I found a module: Class::DBI::Cacheable but no rating or any reviews on RT. I would appreciate hearing from people who have used this or any other Class::DBI caching scheme.
Thanks a ton.

I too have rolled my own ORM plenty of times I hate to say! Caching/Memoization is pretty easy if all your fetches happen through a single api (or subclasses thereof).
For any fetch based on a unique key you can just cache based on a concatenation of the keys. A naive approach might be:
my %_cache;
sub get_object_from_db {
my ($self, $table, %table_lookup_key) = #_;
# concatenate a unique key for this object
my $cache_key = join('|', map { "$_|$table_lookup_key{$_}" }
sort keys %table_lookup_key
return $_cache{$cache_key}
if exists $_cache{$cache_key};
# otherwise get the object from the db and cache it in the hash
# before returning
}
Instead of a hash, you can use the Cache:: suite of modules on CPAN to implement time and memory limits in your cache.
If you're going to cache for some time you might want to think about a way to expire objects in the cache. If for instance all your updates also go through the ORM you can clear (or update) the cache entry in your update() ORM method.
A final point to think carefully about - you're returning the same object each time which has implications. If, eg., one piece of code retrieves an object and updates a value but doesn't commit that change to the db, all other code retrieving that object will see that change. This can be very useful if you're stringing together a series of operations - they can all update the object and then you can commit it at the end - but it may not be what you intend. I usually set a flag on the object when it is fresh from the database and then in your setter method invalidate that flag if the object is updated - that way you can always check that flag if you really want a fresh object.

On a few occasions we've rolled our own, but we limited it to special cases where profiling indicated we needed a boost (for example large joins). Since our applications typically use a custom abstraction layer (akin to a home-grown ORM) on top of the DB access, that's where we implemented the caching. We achieved good results that we were satisfied with and it didn't take a whole lot of effort. Of course, since we weren't using a CPAN ORM, we didn't really have any choice about using a CPAN caching module, either.
It was strictly case-by-case and opt-in. Whether you end up using a CPAN solution or rolling your own, it's probably a good idea to restrict it to cases where profiling indicates you need help, and make sure that it's opt-in so your caching doesn't undermine your application in subtle ways by being active when you didn't expect it.

I have used memcached before to cache objects, but not with Class::DBI (ORM makes me feel dirty).

Related

Check for object ownership with Prisma

I'm new to working with Prisma. One aspect that is unclear to me is the right way to check if a user has permission on an object. Let's assume we have Book and Author models. Every book has an author (one-to-many). Only authors have permission to delete books.
An easy way to enforce this would be this:
prismaClient.book.deleteMany({
id: bookId, <-- id is known
author: {
id: userId <-- id is known
}
})
But this way it's very hard to show an UnauthorizedError to the user. Instead, the response will be a 500 status code since we can't know the exact reason why the query failed.
The other approach would be to query the book first and check the author of the book instance, which would result in one more query.
Is there a best practice for this in Prisma?
Assuming you are using PostgreSQL, the best approach would be to use row-level-security(RLS) - but unfortunately, it is not yet officially supported by Prisma.
There is a discussion about this subject here
https://github.com/prisma/prisma/issues/5128
As for the current situation, to my opinion, it is better to use an additional query and provide the users with informative feedback rather than using the other method you suggested without knowing why it was not deleted.
Eventually, it is up to you to decide based on your use case - whether or not it is important for you to know the reason for failure.
So this question is more generic than prisma - it is also true when running updates/deletes in raw SQL.
When you have extra where clauses to check for ownership, it's difficult to infer which of the clause(s) caused that if the update does not happen, without further queries.
You can achieve this with row level security in postgres, but even that does not come out the box and involves custom configuration to throw specific exceptions when rows are not found due to row level security rules. See this answer for more detail.
I tend to think that doing customised stuff like this is rarely worth the tradeoff, unless you need specialised UX for an uncommon circumstance.
What I would suggest instead in this case is to keep it simple and just use extra queries to check for ownership, but optimise the UX optimistically for the case where the user does own the entity and keep that common and legitimate usecase to a single query.
That is, catch the exception from primsa (or the fact that the update returns 0 rows, or whatever it is in different cases), and only then run a specific select for ownership, to check if that was the reason the update failed.
This is a nice tradeoff because it keeps things simple, and only runs extra queries in the (usually) far less common failure case.
Even having said all that, the broader caveat as always is that probably the extra queries simply won't matter regardless! It's, in 99% of cases, probably best to just always run the extra ownership query upfront as a pattern to keep things as simple as possible, which is what you really want to optimise for over performance until you're running at significant scale.

Automatically updating related rows with DBIx::Class

In the context of a REST API, I've been using DBIx::Class to create related rows, i.e.
POST /artist
{ "name":"Bob Marley", "cds":[{"title":"Exodus"}] }
That ultimately calls $artist->new($data)->insert() which creates an Artist AND creates the related row(s) in the CD table. It then send back the resulting object to the user (via DBIC::ResultClass::HashRefInflator), including the created primary keys and default values. The problem arises when the user makes changes to those objects and send them back to the API again:
POST /artist/7
{ "name":"Robert Nesta Marley", "artistid":"7",
"cds":[{"title":"Exodus", "cdid":"1", "artistid":"7", "year":"1977"}] }
Now what? From what I can see from testing, DBIC::Row::update doesn't deal with changes in related rows, so in this case the name change would work, but the update to the CD's year would not. DBIC::ResultSet::update_or_create just calls DBIC::Row::update. So I went searching for some alternatives, and they do seem to exist, i.e. DBIC::ResultSet::RecursiveUpdate, but it hasn't been updated in 4 years, and references to it seem to suggest it should/would be folded into DBIC. Did that happen?
Am I missing something easier?
Obviously, I can handle this particular situation, but I have many APIs to write and it would be easier to handle it generically for all of them. I'm certainly tempted to use RecursiveUpdate, but wary due to its apparent abandonment.
Ribasushi promised to core the RecursiveUpdate feature if someone steps up and migrates its test suite to the DBIC schema and comes up with a sane API because he didn't like how RU handles this currently.
Catalyst::Controller::DBIC::API is using RU under the hood as well as HTML::Formhandler::Model::DBIC both here in production since years without any issues.
So currently RecursiveUpdate is the way to go.

EF - multiple includes to eager load hierarchical data. Bad practice?

I am needing to eager load a hierarchy structure so that I can recursively iterate through it. The eager loading is necessary to prevent multiple db queries while traversing the tree. It seems the consensus is that you can't eager load infinite levels of the tree, so I did something like
var item= db.ItemHierarchies
.Include("Children.Children.Children.Children.Children")
.Where(x => x.condition == condition)
to load 5 levels of children. This seems to get the job done. I'm wondering what the drawback is to doing this? If there is none then theoretically could I add 50 levels of includes here without slowing things down?
I recommend taking a look at the SQL that is generated as you add eager loading to your query.
var item= db.ItemHierarchies
.Include("Children")
.Include("Children.Children")
.Include("Children.Children.Children")
.Include("Children.Children.Children.Children")
.Include("Children.Children.Children.Children.Children")
var sql = ((System.Data.Objects.ObjectQuery) item).ToTraceString()
// http://visualstudiomagazine.com/blogs/tool-tracker/2011/11/seeing-the-sql.aspx
You'll see that the SQL quickly gets very big and complicated and can potentially have serious performance implications. You'd do well to limit your eager loading to data that you are certain you will need and to consider using explicit loading for some of the related entities - especially if you're working with connected entities in which case you can explicitly load collection properties when they're needed.
Also note that you may not need multiple separate Includes. For example, the following needs to be separate Includes because they're addressing separate properties (Widgets and Spanners) of the root.
var item= db.ItemHierarchies
.Include("Widgets")
.Include("Spanners.Flanges")
But the following isn't necessary:
var item= db.ItemHierarchies
.Include("Widgets") //This isn't necessary.
.Include("Widgets.Flanges") //This loads both Widges and Flanges.
Well honestly.. It's an extremely bad practice.
Let's assume you had 50 objects in your root.. and 50 per level.
You may end up retrieving 312500000 "capsules" of information.
Now, one might ask: "So what is wrong with that?!",
I mean if that is what is required than why not do that..
Rule #1: we develop software that should be used by human beings.
And the fact is that no human capable of taking a glimpse at 312500000 items of information at once and learn or conclude something beneficial out of it. (except.. that it does not help him or her to watch it)
Rule #2: UI should be based on what is needed and not what is possible.
And since we already established that showing 312500000 capsules of data is not needed there is no reason to bring all that at once.
And now you might come forward and say - But I don't care about the UI, really! All I need is to iterate in that data in order to process some information!
In that case you would probably want to save your results somewhere for future reference, but that means that its a batch job.. so why not apply batch job rules upon it.. like process it item by item which will also may give you the benefit of splitting it between even more machines if needed.
So you see.. no matter which path you choose there should be no reason to do it.
(= definition of what is a bad practice.)
Update:
After reading interesting concerns in the comments, I would like to update this answer with more analysis:
Deciding what is a bad practice must always be in reference to what is to be achieved or what is the role of each part in the system. In the current situation (after reading the comments) it has been brought or implied that the data storage is actually a persistent medium for objects opposed to a different concept where the data is the 'heart' of the application.
We can define two data types:
1) Data-Center which is being used in data-centric applications such as banks, CRM, ERP, websites or other service based solutions.
VS.
2) Data-Persistence medium which is being used as data to be saved for when the application is not active, in example: any simple app save file or any game save file and etc.
The main difference is that a data persistence medium is to be accessed only by a single instance of the app at a single point in time.. meaning the data is not designed to be shared by many instances. if the data is to be shared - we are dealing with a data-center application.
If your app just need a data-persistence medium - loading all the information cannot be considered as a bad practice - but you still need to make sure you are not exploding the memory. and in that frame of work, SQL Server might not be what you need or the best tool to use.
In the other case of Data-Centric application - my original answer remains as it will be a bad practice to bring all the information per instance of the application.

In which scenario we need a ReadOnlyCollection?

Dotnet 4.5 has introduce ReadOnlyCollection. My question is what is the practical useage of it? What scenarios we may need this kind of data structure?
You need read-only collections when your API returns collection objects to your callers, copying is too expensive, and you would prefer to stay away from returning IEnumerable<T>. This is commonly desirable in situations when random access is required over the returned collection.
When you want to return a collection that the caller should not be able to modify, but you still want to have the guarantees that an IList gives over an IEnumerable, e.g. a free .Count property, an indexer and the ability to safely iterate over it multiple times, both which aren't guaranteed on an IEnumerable.
This class is useful in a multithreading application. In a multithreading environment can it be a real problem to have a collection of objects, which might be changed by some other thread. This assures threadsafety and lessens the complexity of the code.

Hashes vs Numeric id's

When creating a web application that some how displays the display of a unique identifier for a recurring entity (videos on YouTube, or book section on a site like mine), would it be better to use a uniform length identifier like a hash or the unique key of the item in the database (1, 2, 3, etc).
Besides revealing a little, what I think is immaterial, information about the internals of your app, why would using a hash be better than just using the unique id?
In short: Which is better to use as a publicly displayed unique identifier - a hash value, or a unique key from the database?
Edit: I'm opening up this question again because Dmitriy brought up the good point of not tying down the naming to db specific property. Will this sort of tie down prevent me from optimizing/normalizing the database in the future?
The platform uses php/python with ISAM /w MySQL.
Unless you're trying to hide the state of your internal object ID counter, hashes are needlessly slow (to generate and to compare), needlessly long, needlessly ugly, and needlessly capable of colliding. GUIDs are also long and ugly, making them just as unsuitable for human consumption as hashes are.
For inventory-like things, just use a sequential (or sharded) counter instead. If you migrate to a different database, you will just have to initialize the new counter to a value at least as large as your largest existing record ID. Pretty much every database server gives you a way to do this.
If you are trying to hide the state of your counter, perhaps because you're counting users and don't want competitors to know how many you have, I suggest avoiding the display of your internal IDs. If you insist on displaying them and don't want the drawbacks of a hash, you might consider using a maximal-period linear feedback shift register to generate IDs.
I typically use hashes if I don't want the user to be able to guess the next ID in the series. But for your book sections, I'd stick with numerical id's.
Using hashes is preferable in case you need to rebuild your database for some reason, for example, and the ordering changes. The ordinal numbers will move around -- but the hashes will stay the same.
Not relying on the order you put things into a box, but on properties of the things, just seems.. safer.
But watch out for collisions, obviously.
With hashes you
Are free to merge the database with a similar one (or a backup), if necessary
Are not doing something that could help some guessing attacks even a bit
Are not disclosing more private information about the user than necessary, e.g. if somebody sees a user number 2 in your current database log in, they're getting information that he is an oldie.
(Provided that you use a long hash or a GUID,) greatly helping youself in case you're bought by YouTube and they decide to integrate your databases.
Helping yourself in case there appears a search engine that indexes by GUID.
Please let us know if the last 6 months brought you some clarity on this question...
Hashes aren't guaranteed to be unique, nor, I believe, consistent.
will your users have to remember/use the value? or are you looking at it from a security POV?
From a security perspective, it shouldn't matter - since you shouldn't just be relying on people not guessing a different but valid ID of something they shouldn't see in order to keep them out.
Yeah, I don't think you're looking for a hash - you're more likely looking for a Guid.If you're on the .Net platform, try System.Guid.
However, the most important reason not to use a Guid is for performance. Doing database joins and lookups on (long) strings is very suboptimal. Numbers are fast. So, unless you really need it, don't do it.
Hashes have the advantage that you can check if they are valid or not BEFORE performing any check to your database whether they exist or not. This can help you to fend off attacks with random hashes as you don't need to burden your database with fake lookups.
Therefor, if your hash has some kind of well-defined format with for example a checksum at the end, you can check if it's correct without needing to go to the database.