JPA PersistenceContext in a distributed environment - jpa

Based on my understanding, transactions are not flushed immediately once they are completed. They sit in a cache in memory and only get written to the DB when the EntityManager determines that it is cost effective to do so. I believe the L1 cache is utilized in this case, but correct me if I'm wrong.
My question is, in a distributed environment, is the cache used by the Persistence Context distributed?

L1 cache (session cache, persistence context) always works the same way, no matter wheteher your environment is distributed or not. Session cache belongs to a session, and you can have multiple sessions, either on the same machine or on different machines, so that it doesn't matter.
In distributed environment you need to care about the second level cache, if you use it.
If you run your application in a cluster, you need to use cluster-capable L2 cache implementation, if you JPA provider supports it (see, for example, 21.2. The Second Level Cache from Hibernate documentation).
If you have other applications accessing the same database, you need to carefully configure caching strategies to avoid inconsistency in critical cases and tolerate possible inconsistency in other cases.

Related

What is the correct way to enable query cache?

Based on the documentation, the super privilege is not supported, which means that the following query:
SET GLOBAL query_cache_size = 1000000;
results in an error message
Access denied; you need (at least one of) the SUPER privilege(s) for this operation
and does not allow us to set the query cache size.
What's the correct way to accomplish the task?
Unfortunately, Cloud SQL does not support query caching and query_cache_size cannot be set.
If you are experiencing performance issues, you can try changing your instance tier to give your instance access to more resources. Also, it is preferable to use InnoDB over MyISAM tables. The reason for this is because when a Cloud SQL instance is started, it gives most of the available memory to the InnoDB buffer pool.
As mhalt hints at, there is a good reason not to use the query cache:
You should be using InnoDB rather than MyISAM, as MyISAM is not robust enough for the cloud environment.
InnoDB has built in caching as part of it's buffer pool. This caches individual pages of data, rather than entire result sets.
The buffer pool generally provides superior caching to the query cache: 1) it does not get flushed after writes 2) multiple different queries can be served using the same cache entries 3) it supports partial caching if the active set is larger than available ram.
The only workload where the query cache is superior is if you have a very low write rate and almost all your queries are exactly the same.
For this reason Cloud SQL is optimized by maximizing RAM allocated to the buffer pool instead of having a query cache.
CloudSQL now support query_cache flags.
https://cloud.google.com/sql/docs/mysql/flags
But these options may break the SLA coverage.

How does the storage backend influence Datomic?

How should I pick the backend storage service for Datomic?
Is it a matter of preference to select, say, DynamoDB instead of Postgres, or does each option have different tradeoffs? If so, what are they?
Storage Services Requirements
Datomic' storage services should generally meet 3 requirements:
Implement key-value store semantics: efficient read/write access using indexed keys’ values
Support consistent reads. e.g. read your own writes. Ideally, no-contention/lock-free reads.
Support conditional puts. e.g. optimistic locking + snapshot isolation.
Datomic uses storages services to store blocks of sorted, compressed datoms, similar to the way traditional database systems use file systems and the requirements above are pretty much the API between the underlying storage service and Datomic. So the choice in storage services depend on how well they support those three requirements.
Write Scalability
Datomic doesn't usually put a lot of write pressure on the underlying storage service since there's only one component writing to it, the Transactor. Also, Datomic uses a background indexing job to integrate novelty into storage once enough of it has been accumulated (by default ~32MB but can be configured) which further reduces the constant write load. The only thing Datomic immediately writes is the transaction log.
Read Scalability
Datomic uses multiple layers of caching i.e. memcached and peers cache so in ideal circumstances i.e. when the working set fits in memory, the systems won't put a lot o read pressure either.
System Load
If your system doesn't require huge write scalability and your application data tends to fit in memory, then the choice of a particular storage service is irrelevant except, of course, for their operational capabilities (backups, admin tools, etc.) which have nothing to do with Datomic.
If, on the other hand, you system does require huge write scalability or you have a great number of peers, each of them working with more data than can fit in their memory (forcing a lot of data segments to be brought from storage), you'll require a storage system that can horizontally scale e.g. DynamoDB. As mentioned in one of the comments, if you need arbitrary write scalability, Datomic is not the right system for you anyway.

Data Synchronization in a Distributed system

We have an REST-based application built on the Restlet framework which supports CRUD operations. It uses a local-file to store the data.
Now the requirement is to deploy this application on multiple VMs and any update operation in one VM needs to be propagated other application instances running on other VMs.
Our idea to solve this was to send multiple POST msgs (to all other applications) when a update operation happens in a given VM.
The assumption here is that each application has a list/URLs of all other applications.
Is there a better way to solve this?
Consistency is a deep topic, and a hard thing to get right. The trouble comes when two nearly-simultaneous changes occur to the same data: conflicting updates can arrive in one order on one server, and in another order on another. This is a problem, since the two servers no longer agree on what the data is, and it isn't clear who is "right".
The short-story: get your favorite RDBMS (for example, mysql is popular) and have your app servers connect to in what is called the three-tier model. Be sure to perform complex updates in transactions, which will provide an acceptable consistency model.
The long-story: The three-tier model serves well for small-to-medium scale web sites/services. You will eventually find that the single database becomes the bottleneck. For services whose read traffic is substantially larger than write traffic, a common optimization is to create a single-master, many-slave database replication arrangement, where all writes go to the single master (required for consistency with non-distributed transactions), but the more-common reads could go to any of the read slaves.
For services with evenly-mixed read/write traffic, you may be better served by dropped some of the conveniences (and accompanying restrictions) that formal SQL provides and instead use of one of the various "nosql" data stores that have recently emerged. Their relative merits and fitness for various problems is a deep topic in itself.
I can see 7 major options for now. You should find out more details and decide whether the facilities / trade-offs are appropriate for your purpose
Perform the CRUD operation on a common RDBMS. Simplest and most consistent
Perform the CRUD operations on a common RDBMS which runs as fast in-memory RDBMS. eg TimesTen from Oracle etc
Perform the CRUD on a distributed cache or your own home cooked distributed hash table which can guarantee synchronization eg Hazelcast/ehcache and others
Use a fast common state server like REDIS/memcached and perform your updates
in a synchronized manner on it and write out the successfull operations to a DB in a lazy manner if required.
Distribute your REST servers such that the CRUD operations on a single entity are only performed by a single master. Once this is done, the details about the changes can be communicated to everyone else using a reliable message bus or a distributed database (eg postgres) that runs underneath and syncs all of your updates fairly fast.
Target eventual consistency and use a distributed data store like Cassandra which lets you target the consistency you require
Use distributed consensus algorithms like Paxos or RAFT or an implementation of the same(recommended) like zookeeper or etcd respectively and take ownership of the item you want to change from each REST server before you perform the CRUD operation - might be a bit slow though and same stuff is what Cassandra might give you.

why memcached instead of hashmap

I am trying to understand what would be the need to go with a solution like memcached. It may seem like a silly question - but what does it bring to the table if all I need is to cache objects? Won't a simple hashmap do ?
Quoting from the memcache web site, memcache is…
Free & open source, high-performance,
distributed memory object caching
system, generic in nature, but
intended for use in speeding up
dynamic web applications by
alleviating database load.
Memcached is an in-memory key-value
store for small chunks of arbitrary
data (strings, objects) from results
of database calls, API calls, or page
rendering. Memcached is simple yet
powerful. Its simple design promotes
quick deployment, ease of development,
and solves many problems facing large
data caches. Its API is available for
most popular languages.
At heart it is a simple Key/Value
store
A key word here is distributed. In general, quoting from the memcache site again,
Memcached servers are generally
unaware of each other. There is no
crosstalk, no syncronization, no
broadcasting. The lack of
interconnections means adding more
servers will usually add more capacity
as you expect. There might be
exceptions to this rule, but they are
exceptions and carefully regarded.
I would highly recommend reading the detailed description of memcache.
Where are you going to put this hashmap? That's what it's doing for you. Any structure you implement on PHP is only there until the request ends. If you throw stuff in a persistent cache, you can fetch it back out for other requests, instead of rebuilding the data.
I know that this question is rather old, but in addition to being able to share a cache across multiple servers, there is also another aspect that is not mentioned in other answers and is the values expiration.
If you store the values in a HashMap, and that HashMap is bound to the Application context, it will keep growing in size, unless you expire items in some ways. Memcached expires object lazily for maximum performance.
When an item is added to the memcache, it can have an expiration time, for instance 600 seconds. After the object is expired it will just remain there, but if another object asks for it, it will purge it and return null.
Similarly, when memcached memory is full, it will look for the first expired item of adequate size and expire it to make room for the new item. Lastly, it can also happen that the cache is full and there isn't any item to expire, in which case it will replace the least used items.
Using a fully flagded cache system usually allow you to replicate the cache on many servers, or just scale to many server just to scale a lot of parallel requestes, all this remaining acceptable fast in term of reply.
There is an (old) article that compares different caching systems used by php:
https://www.percona.com/blog/2006/08/09/cache-performance-comparison/
Basically, file caching is faster than memcached.
So to answer the question, I believe you would have better performances using a file based cache system.
Here are the results from the tests of the article:
Cache Type Cache Gets/sec
Array Cache 365000
APC Cache 98000
File Cache 27000
Memcached Cache (TCP/IP) 12200
MySQL Query Cache (TCP/IP) 9900
MySQL Query Cache (Unix Socket) 13500
Selecting from table (TCP/IP) 5100
Selecting from table (Unix Socket) 7400

Are Rose::DB::Object::Cached memory cached through different processes?

Are RDBOC objects cached through different processes? I'm thining of running it in mod-perl, and it would factor into things, even though it would mostly be used on things that don't change (much).
Also, do relationships referencing RDBOCs use the cache when it should intuitively?
Rose::DB::Object::Cached caches objects in plain-old (non-shared) memory. Under mod_perl, this means that each apache process has its own cache. You could, however, cache your objects on server startup. All of those cached objects would then be shared with each apache child process. This is most useful for read-only objects that you don't ever expect to change for the life of the server.
For more flexible caching options, check out Rose::DBx::Object::Cached::CHI.
As for your second question, Rose::DB::Object::Cached only reads from and writes to the cache on load() and save(). Most relationship methods use Manager queries to get objects and so will not read from the Rose::DB::Object::Cached cache.