Unit test dao layer without in memory database - junit4

Is there a way to Unit test dao layer without in memory database like h2?Most of the examples out there use in memory databases for dao calls.

It depends on what you would like to test there.
If you would like to verify the result of query execution, you'll need some data, and that would need a physical or in-memory database. You can use JDBC connections for plain queries or Hibernate to inject an EntityManager into your DAO layer, and then test it. I think one advantage of in-memory database is that we define exactly what data will be used, instead of depending on a physical database data, that might change with time.
On another perspective, you might be wanting to check the construction of the queries, when there are some logic involved. For this, you won't need any data, Mockito method interception and ArgumentCaptor could be helpful to obtain the constructed (native or HQL) query, and then you could compare the result with the expected query. I've seen such a solution using xml properties, that also serve as a kind of documentation and regression test against accidental changes on sensible queries.

Related

Is it possible to use spark to process complex entities with complex dependencies?

Consider a scenario (objects and dependencies are Scala classes):
There is a set of dependencies which themselves require significant amount of data to be instantiated (data coming from a database).
There is a set of objects with complex nested hierarchy which store references to those dependencies.
The current workflow consist of:
Loading the dependencies data from a database and instantiating them
(in a pretty complex way with interdependencies).
Loading object
data from the database and instantiating objects using previously
loaded dependencies.
Running operations on a list of objects like:
a. Search with a complex predicate
b. Transform
c. Filter
d. Save to the database
e. Reload from the database
We are considering running these operations on multiple machines. One of the options is to use Spark, but it is not clear how to properly support data serialization and distribute/update the dependencies.
Even if we are able to separate the logic in the objects from the data (making objects easily serializable) the functions we want to run over the objects will still rely on the complex dependencies mentioned above.
Additionally, at least at the moment, we don't have plans to use any operations requiring shuffling of the data between machines and all we need is basically sharding.
Does Spark look like a good fit for such scenario?
If yes, how to handle the complex dependencies?
If no, would appreciate any pointers to alternative systems that can handle the workflow.
I don't understand enough what you mean by "complex interdependencies" but it seems that if you only need sharding you won't really get much from spark - just run multiple whatever you have an use a queue to synchronize the work and distribute to each copy the shard it needs to work on.
We did something similar converting a pySpark jot to a Kubernetes setup where the queue holds the list of ids and then we have multiple pods (we control the scale via kubectl) that read from that queue and got much better performance and simpler solution - see https://kubernetes.io/docs/tasks/job/coarse-parallel-processing-work-queue/

How to create a caching layer on top of slick that could be applied globally?

I am curios if using scala and slick, you could create a flexible caching layer (say using memcached) on top of slick.
Ruby has a cool library called IdentityCache: https://github.com/Shopify/identity_cache
It allows you to simply extend your model class (a trait in scala?) where you tell it to use this cache layer.
You can then tell it to only cache by Id, or cache associations also etc.
Sounds like a very cool thing, how could something like this fit into slick's design?
I was thinking about how to add this to Slick lately, but we don't have any resources assigned to it for the foreseeable future.
You could build a query cache on top of Slick. Invalidating the cache based on the observed write operations on the base data can be very hard for arbitrary queries. You would need to restrict the supported operations for conditions in cached queries, e.g. to only use equality. Oracle and others have similar restrictions in place for their materialized view maintenance features.

How to escape from ORMs limitations or should I avoid them?

In short, ORMs like Entity Framework provides a fast solution but with many limitations, When should they (ORMs) be avoided?
I want to create an engine of a DMS system, I wonder that how could I create the Business Logic Layer.
I'll discuss the following options:
Using Entity Framework and provides it as a Business later for the engine's clients.
The problem is that missing the control on the properties and the validation because it's generated code.
Create my own business layer classes manually without using Entity Framework or any ORM:
The problem is that it's a hard mission and something like reinvent the weel.
Create my own business layer classes up on the Entitiy Framework (use it)
The problem Seems to be code repeating by creating new classes with the same names and every property will cover the opposite one which is generated by the ORM.
Am I discuss the problem in a right way?
In short, ORMs should be avoided when:
your program will perform bulk inserts/updates/deletes (such as insert-selects, and updates/deletes that are conditional on something non-unique). ORMs are not designed to do these kinds of bulk operations efficiently; you will end up deleting each record one at a time.
you are using highly custom data types or conversions. ORMs are generally bad at dealing with BLOBs, and there are limits to how they can be told how to "map" objects.
you need the absolute highest performance in your communication with SQL Server. ORMs can suffer from N+1 problems and other query inefficiencies, and overall they add a layer of (usually reflective) translation between your request for an object and a SQL statement which will slow you down.
ORMs should instead be used in most cases of application-based record maintenance, where the user is viewing aggregated results and/or updating individual records, consisting of simple data types, one at a time. ORMs have the extreme advantage over raw SQL in their ability to provide compiler-checked queries using Linq providers; virtually all of the popular ORMs (Linq2SQL, EF, NHibernate, Azure) have a Linq query interface that can catch a lot of "fat fingers" and other common mistakes in queries that you don't catch when using "magic strings" to form SQLCommands. ORMs also generally provide database independence. Classic NHibernate HBM mappings are XML files, which can be swapped out as necessary to point the repository at MSS, Oracle, SQLite, Postgres, and other RDBMSes. Even "fluent" mappings, which are classes in code files, can be swapped out if correctly architected. EF has similar functionality.
So are you asking how to do "X" without doing "X"? ORM is an abstraction and as any other abstraction it has disadvantages but not those you mentioned.
Code (EFv4) can be generated by T4 template and T4 template is a code that can be modified
Generated code is partial class which can be combined with your partial part containing your logic
Writing classes manually is very common case - using designer as available in Entity framework is more rare
Disclaimer: I work at Mindscape that builds the LightSpeed ORM for .NET
As you don't ask about a specific issue, but about approaches to solving the flexibility problem with an ORM I thought I'd chime in with some views from a vendor perspective. It may or may not be of use to you but might give some food for thought :-)
When designing an O/R Mapper it's important to take into consideration what we call "escape hatches". An ORM will inevitably push a certain set of default behaviours which is one way that developer gain productivity gains.
One of the lessons we have learned with LightSpeed has been where developers need those escape hatches. For example, KeithS here states that ORMs are not good for bulk operations - and in most cases this is true. We had this scenario come up with some customers and added an overload to our Remove() operation that allowed you to pass in a query that removes all records that match. This saved having to load entities into memory and delete them. Listening to where developers are having pain and helping solve those problems quickly is important for helping build solid solutions.
All ORMs should efficiently batch queries. Having said that, we have been surprised to see that many ORMs don't. This is strange given that often batching can be done rather easily and several queries can be bundled up and sent to the database at once to save round trips. This is something we've done since day 1 for any database that supports it. That's just an aside to the point of batching made in this thread. The quality of those batches queries is the real challenge and, frankly, there are some TERRIBLE SQL statements being generated by some ORMs.
Overall you should select an ORM that gives you immediate productivity gains (almost demo-ware styled 'see I queried data in 30s!') but has also paid attention to larger scale solutions which is where escape hatches and some of the less demoed, but hugely useful features are needed.
I hope this post hasn't come across too salesy, but I wanted to draw attention to taking into account the thought process that goes behind any product when selecting it. If the philosophy matches the way you need to work then you're probably going to be happier than selecting one that does not.
If you're interested, you can learn about our LightSpeed ORM for .NET.
in my experience you should avoid use ORM when your application do the following data manipulation:
1)Bulk deletes: most of the ORM tools wont truly delete the data, they will mark it with a garbage collect ID (GC record) to keep the database consistency. The worst thing is that the ORM collect all the data you want to delete before it mark it as deleted. That means that if you want to delete 1000000 rows the ORM will first fetch the data, load it in your application, mark it as GC and then update the database;. which I believe is a huge waist of resources.
2)bulk inserts and data import:most of the ORM tools will create business layer validations on the business classes, this is good if you want to validate 1 record but if you are going to insert/import hundreds or even millions of records the process could take days.
3)Report generation: the ORM tools are good to create simple list reports or simple table joins like in a order-order_details scenario. but it most cases the ORM will only slow down the retrieve of the data and will add more joins that you need for a report. that translate in a give more work to the DB engine than you usually do with a SQL approach

Entity Framework Code First - Reducing round trips with .Load() and .Local

I'm setting up a new application using Entity Framework Code Fist and I'm looking at ways to try to reduce the number of round trips to the SQL Server as much as possible.
When I first read about the .Local property here I got excited about the possibility of bringing down entire object graphs early in my processing pipeline and then using .Local later without ever having to worry about incurring the cost of extra round trips.
Now that I'm playing around with it I'm wondering if there is any way to take down all the data I need for a single request in one round trip. If for example I have a web page that has a few lists on it, news and events and discussions. Is there a way that I can take down the records of their 3 unrelated source tables into the DbContext in one single round trip? Do you all out there on the interweb think it's perfectly fine when a single page makes 20 round trips to the db server? I suppose with a proper caching mechanism in place this issue could be mitigated against.
I did run across a couple of cracks at returning multiple results from EF queries in one round trip but I'm not sure the complexity and maturity of these kinds of solutions is worth the payoff.
In general in terms of composing datasets to be passed to MVC controllers do you think that it's best to simply make a separate query for each set of records you need and then worry about much of the performance later in the caching layer using either the EF Caching Provider or asp.net caching?
It is completely ok to make several DB calls if you need them. If you are affraid of multiple roundtrips you can either write stored procedure and return multiple result sets (doesn't work with default EF features) or execute your queries asynchronously (run multiple disjunct queries in the same time). Loading unrealted data with single linq query is not possible.
Just one more notice. If you decide to use asynchronous approach make sure that you use separate context instance in each asynchronous execution. Asynchronous execution uses separate thread and context is not thread safe.
I think you are doing a lot of work for little gain if you don't already have a performance problem. Yes, pay attention to what you are doing and don't make unnecessary calls. The actual connection and across the wire overhead for each query is usually really low so don't worry about it.
Remember "Premature optimization is the root of all evil".
My rule of thumb is that executing a call for each collection of objects you want to retrieve is ok. Executing a call for each row you want to retrieve is bad. If your web page requires 20 collections then 20 calls is ok.
That being said, reducing this to one call would not be difficult if you use the Translate method. Code something like this would work
var reader = GetADataReader(sql);
var firstCollection = context.Translate<whatever1>(reader);
reader.NextResult();
var secondCollection = context.Translate<whateve2r>(reader);
etc
The big down side to doing this is that if you place your sql into a stored proc then your stored procs become very specific to your web pages instead of being more general purpose. This isn't the end of the world as long as you have good access to your database. Otherwise you could just define your sql in code.

How would you use EF in a typical Business Layer/Data Access Layer/Stored Procedures set up?

Whenever I watch a demo regarding the Entity Framework the demonstrator simply sets up some tables and performs Inserts, Updates and Deletes using automatically created code stubs but never shows any use of stored procedures. It seems to me that this is executing SQL from the client.
In my experience this is not particular good practice so I am presuming that my understanding of the Entity Framework is wrong.
Similarly WCF RIA Services demos use the EF and the demos are always the same. Can anyone shed any light on how you would use EF in a typical Business Layer/Data Access Layer/Stored Procedures set up.
I think I am confused and shouldn't be!!?
There's nothing wrong with executing SQL from the client. Most (if not all) of the problems that it might cause are in fact not there when using something like EF. For instance:
Client generated SQL might cause runtime syntax errors. This is not unlikely since the description of your query is mostly checked on compile time (assuming that the generator itself doesn't generate invalid SQL, which is also unlikely)
Client generated SQL might be inefficient. This is not true with modern database software which have query caches. EF works in a way that's compatible with query caches, i.e. it generates the same SQL consistently (as long as you use the same code consistently) and uses parameters for varying data.
Client generated SQL might be insecure (SQL injections and whatnot). This is all handled by the generator, which uses parameters for your values and does not interpolate user input into the query itself.
Back in the old Client / Server days, it used to be considered good practice to do all db updates using stored procedures.
But now, it's perfectly acceptable to have an O/RM generate SQL and run directly against DB.
Well, part of the reason why executing sql in stored procedures is a good idea is that it gives you a level of abstraction - when db changes inevitably occur, you make a change in a single place (the proc) rather than a dozen places (all the places where you were calling the client sql). Entity Framework provides this layer of abstraction through the data model, and you have the same advantage.
There are some other reasons why you might want to look at procs, like security granularity (only allowing certain users the right to execute), and some minor performance differences. Ultimately, you have to decide for yourself what the right trade-off is. EF is an attempt to dramatically reduce the developer time spent creating a data layer, with the trade-offs listed above.
never shows any use of stored procedures
Take a look at this video: Using Your Own Stored Procedures to Insert, Update and Delete Entities in Entity Framework.
Note that there are a lot of other videos on that topic there that are certainly worth watching!
The legend is that Scott Hanselman once said "It's not a real demo unless someone drags a datagrid" (pg 478 Silverlight 4 In Action, Pete Brown)
You have to remember that demos, are all about selling software, and not at all about communicating best practice. So your observations about the demos are absolutely correct, they cover the basics, and leave it to the observer to fill in the blanks.
As to your comment about Stored Procedures, and various answers to your question about the generator. The generator is good, and getting better. Howerver there are certain circumstances when it will generate completely unusable queries. (see my SO question here and discussed on the ADO.NET team blog)
Therefore there are occasions when hand crafted queries are your only recourse (either by way of stored proc, table value functions, views etc)