How to create a caching layer on top of slick that could be applied globally? - scala

I am curios if using scala and slick, you could create a flexible caching layer (say using memcached) on top of slick.
Ruby has a cool library called IdentityCache: https://github.com/Shopify/identity_cache
It allows you to simply extend your model class (a trait in scala?) where you tell it to use this cache layer.
You can then tell it to only cache by Id, or cache associations also etc.
Sounds like a very cool thing, how could something like this fit into slick's design?

I was thinking about how to add this to Slick lately, but we don't have any resources assigned to it for the foreseeable future.
You could build a query cache on top of Slick. Invalidating the cache based on the observed write operations on the base data can be very hard for arbitrary queries. You would need to restrict the supported operations for conditions in cached queries, e.g. to only use equality. Oracle and others have similar restrictions in place for their materialized view maintenance features.

Related

Scala generic query predicate generation DSL (possibly similar to AlaSQL in javascript)?

I'm building a generic API that provides access to a ton of different data sets, and I would like to avoid creating specific query API's for each data type. I.e. for the Users endpoint needing to manually implement name="Joe". I would rather the user be able to use some query language, like an SQL where predicate or something similar, to filter down these data sets. The set of data is ever growing, and we need a generic way to form query predicates.
Back when I was working with javascript, I used https://github.com/agershun/alasql to do simple predicates over objects in memory.
I'm looking for something similar in scala. It doesn't need to be SQL, it could be JSON or some other DSL.
I've looked at Calcite, and was able to get it to do WHERE predicates on data, but it took a lot of hacking. The Calcite library is incredibly large and complex. It requires a lot of objects to be instantiated to even generate one query. I don't want to pull that kind of heavy dependency into the project.

Is it possible to use spark to process complex entities with complex dependencies?

Consider a scenario (objects and dependencies are Scala classes):
There is a set of dependencies which themselves require significant amount of data to be instantiated (data coming from a database).
There is a set of objects with complex nested hierarchy which store references to those dependencies.
The current workflow consist of:
Loading the dependencies data from a database and instantiating them
(in a pretty complex way with interdependencies).
Loading object
data from the database and instantiating objects using previously
loaded dependencies.
Running operations on a list of objects like:
a. Search with a complex predicate
b. Transform
c. Filter
d. Save to the database
e. Reload from the database
We are considering running these operations on multiple machines. One of the options is to use Spark, but it is not clear how to properly support data serialization and distribute/update the dependencies.
Even if we are able to separate the logic in the objects from the data (making objects easily serializable) the functions we want to run over the objects will still rely on the complex dependencies mentioned above.
Additionally, at least at the moment, we don't have plans to use any operations requiring shuffling of the data between machines and all we need is basically sharding.
Does Spark look like a good fit for such scenario?
If yes, how to handle the complex dependencies?
If no, would appreciate any pointers to alternative systems that can handle the workflow.
I don't understand enough what you mean by "complex interdependencies" but it seems that if you only need sharding you won't really get much from spark - just run multiple whatever you have an use a queue to synchronize the work and distribute to each copy the shard it needs to work on.
We did something similar converting a pySpark jot to a Kubernetes setup where the queue holds the list of ids and then we have multiple pods (we control the scale via kubectl) that read from that queue and got much better performance and simpler solution - see https://kubernetes.io/docs/tasks/job/coarse-parallel-processing-work-queue/

Unit test dao layer without in memory database

Is there a way to Unit test dao layer without in memory database like h2?Most of the examples out there use in memory databases for dao calls.
It depends on what you would like to test there.
If you would like to verify the result of query execution, you'll need some data, and that would need a physical or in-memory database. You can use JDBC connections for plain queries or Hibernate to inject an EntityManager into your DAO layer, and then test it. I think one advantage of in-memory database is that we define exactly what data will be used, instead of depending on a physical database data, that might change with time.
On another perspective, you might be wanting to check the construction of the queries, when there are some logic involved. For this, you won't need any data, Mockito method interception and ArgumentCaptor could be helpful to obtain the constructed (native or HQL) query, and then you could compare the result with the expected query. I've seen such a solution using xml properties, that also serve as a kind of documentation and regression test against accidental changes on sensible queries.

business logic in stored procedures vs. middle layer

I'd like to use Postgres as web api storage backend. I certainly need (at least some) glue code to implement my REST interface (and/or WebSocket). I think about two options:
Implement most of the business logic as stored procedures, PL/SQL while using a very thin middle layer to handle the REST/websocket part.
middle layer implements most of the business logic, and reach Pg over it's abstract DB interface.
My question is what are the possible benefits/hindrances of the above designs compared to each other regarding flexibility, scalability, maintainability and availability?
I don't really care about the exact middle layer implementation (it can be either php, node.js, python or whatever), I'm interested in the benefits and pitfalls of the actual architectural design choice.
I'm aware of that I lose some flexibility by choosing (1) since it would be difficult to port the system to other than maybe oracle, and my users will be bound to postgres. In my case it's not very important, the database intended to be an integral part of the system anyway.
I'm especially interested in the benefits lost in case of choosing (2), and possible pitfalls of either case.
I think both options have their benefits and drawbacks.
(2) approach is good and known. Most simple applications and web services are using it. But sometimes, using stored procedure is much better than (2).
Here is some examples which, IMHO, are good to implement with stored procedures:
tracking changes of rows. I.e you have some table with items that are regularly updated and you want to have another table with all changes and dates of that changes for every item.
custom algorithms, if your functions can be used as expressions for indexing data.
you want to share some logic between several micro-services. If every micro-service are implemented using a different language, you have to re-implement some parts of the business logic for every language and micro-service. Using stored procedures obviously can help to avoid this.
Some benefits of (2) approach (with some "however" of course to confuse you :D):
You can use your favorite programing language to write business logic.
However: in (1) approach you can write procedures using pl/v8 or pl/php or pl/python or pl/whatever extension using your favorite language.
maintaning code is more easy than maintaining stored procedures.
However: there is some good methods to avoid such headaches with code maintenance. I.e: migrations, which is a good thing for every approach.
Also, you can put your functions into your own namespace, so to reinstall re-deploy procedures into database you have to just drop and re-create this namespace, not each function. This can be done with simple script.
you can use various ORM's to query data and got abstraction layers which can have much more complex logic and inheritance logic. In (1) it would be hard to use OOP patterns.
I think this is the most powerful argument against (1) approach, and I can't add any 'however' to this.

How to escape from ORMs limitations or should I avoid them?

In short, ORMs like Entity Framework provides a fast solution but with many limitations, When should they (ORMs) be avoided?
I want to create an engine of a DMS system, I wonder that how could I create the Business Logic Layer.
I'll discuss the following options:
Using Entity Framework and provides it as a Business later for the engine's clients.
The problem is that missing the control on the properties and the validation because it's generated code.
Create my own business layer classes manually without using Entity Framework or any ORM:
The problem is that it's a hard mission and something like reinvent the weel.
Create my own business layer classes up on the Entitiy Framework (use it)
The problem Seems to be code repeating by creating new classes with the same names and every property will cover the opposite one which is generated by the ORM.
Am I discuss the problem in a right way?
In short, ORMs should be avoided when:
your program will perform bulk inserts/updates/deletes (such as insert-selects, and updates/deletes that are conditional on something non-unique). ORMs are not designed to do these kinds of bulk operations efficiently; you will end up deleting each record one at a time.
you are using highly custom data types or conversions. ORMs are generally bad at dealing with BLOBs, and there are limits to how they can be told how to "map" objects.
you need the absolute highest performance in your communication with SQL Server. ORMs can suffer from N+1 problems and other query inefficiencies, and overall they add a layer of (usually reflective) translation between your request for an object and a SQL statement which will slow you down.
ORMs should instead be used in most cases of application-based record maintenance, where the user is viewing aggregated results and/or updating individual records, consisting of simple data types, one at a time. ORMs have the extreme advantage over raw SQL in their ability to provide compiler-checked queries using Linq providers; virtually all of the popular ORMs (Linq2SQL, EF, NHibernate, Azure) have a Linq query interface that can catch a lot of "fat fingers" and other common mistakes in queries that you don't catch when using "magic strings" to form SQLCommands. ORMs also generally provide database independence. Classic NHibernate HBM mappings are XML files, which can be swapped out as necessary to point the repository at MSS, Oracle, SQLite, Postgres, and other RDBMSes. Even "fluent" mappings, which are classes in code files, can be swapped out if correctly architected. EF has similar functionality.
So are you asking how to do "X" without doing "X"? ORM is an abstraction and as any other abstraction it has disadvantages but not those you mentioned.
Code (EFv4) can be generated by T4 template and T4 template is a code that can be modified
Generated code is partial class which can be combined with your partial part containing your logic
Writing classes manually is very common case - using designer as available in Entity framework is more rare
Disclaimer: I work at Mindscape that builds the LightSpeed ORM for .NET
As you don't ask about a specific issue, but about approaches to solving the flexibility problem with an ORM I thought I'd chime in with some views from a vendor perspective. It may or may not be of use to you but might give some food for thought :-)
When designing an O/R Mapper it's important to take into consideration what we call "escape hatches". An ORM will inevitably push a certain set of default behaviours which is one way that developer gain productivity gains.
One of the lessons we have learned with LightSpeed has been where developers need those escape hatches. For example, KeithS here states that ORMs are not good for bulk operations - and in most cases this is true. We had this scenario come up with some customers and added an overload to our Remove() operation that allowed you to pass in a query that removes all records that match. This saved having to load entities into memory and delete them. Listening to where developers are having pain and helping solve those problems quickly is important for helping build solid solutions.
All ORMs should efficiently batch queries. Having said that, we have been surprised to see that many ORMs don't. This is strange given that often batching can be done rather easily and several queries can be bundled up and sent to the database at once to save round trips. This is something we've done since day 1 for any database that supports it. That's just an aside to the point of batching made in this thread. The quality of those batches queries is the real challenge and, frankly, there are some TERRIBLE SQL statements being generated by some ORMs.
Overall you should select an ORM that gives you immediate productivity gains (almost demo-ware styled 'see I queried data in 30s!') but has also paid attention to larger scale solutions which is where escape hatches and some of the less demoed, but hugely useful features are needed.
I hope this post hasn't come across too salesy, but I wanted to draw attention to taking into account the thought process that goes behind any product when selecting it. If the philosophy matches the way you need to work then you're probably going to be happier than selecting one that does not.
If you're interested, you can learn about our LightSpeed ORM for .NET.
in my experience you should avoid use ORM when your application do the following data manipulation:
1)Bulk deletes: most of the ORM tools wont truly delete the data, they will mark it with a garbage collect ID (GC record) to keep the database consistency. The worst thing is that the ORM collect all the data you want to delete before it mark it as deleted. That means that if you want to delete 1000000 rows the ORM will first fetch the data, load it in your application, mark it as GC and then update the database;. which I believe is a huge waist of resources.
2)bulk inserts and data import:most of the ORM tools will create business layer validations on the business classes, this is good if you want to validate 1 record but if you are going to insert/import hundreds or even millions of records the process could take days.
3)Report generation: the ORM tools are good to create simple list reports or simple table joins like in a order-order_details scenario. but it most cases the ORM will only slow down the retrieve of the data and will add more joins that you need for a report. that translate in a give more work to the DB engine than you usually do with a SQL approach