spring batch remote partitioning remote step performance - spring-batch

We are using remote partitioning in our POC where we process around 20 million records. To process this records, slave needs some static metadata which is around 5000 rows. Our current POC uses EhCache to load this metadata in slave once time from db and put it in cache so the subseuent calls just get this data from cache for better performance.
Now since we are using remote partitioning, our slave has approx 20 MDP/thread so each message listener calls first to get the metadata from db, so basically 20 threads are hitting db at the same time on each remote machine. We have 2 machine for now but will grow to 4.
My question is , is there any better way to load this metadata only one time like before job starts and be accessible to all remote slave?
Or can we use step listener in remote stap? I dont think so this is a good idea, as it will be executed for each remote step execution but needed expert thoughts on this.

You could set up an EhCache server running as a separate application or use another product for caching instead like Hazelcast. If commercial products are an option for you, Coherence might also work.

Related

Why does fetching from postgresql by Hibernate takes extreme long on AWS?

I have an environment on AWS with a RDS Postgresql9.6 and a Spring Boot v1.2.7RELEASE application running on a EC2 Instance. Now I want to fetch about 10.000 entries from a table of the Postgresql DB, which takes about 1 minute. If I do this locally It takes about a second to fetch the entities.
I would expect that the fetching would just take some more time than locally like 2 or 3 seconds.
Actually the request takes 1 minute.
To determine if the problem maybe is caused by a bad query I did
explain analyze SELECT * FROM view_name where uuid ='4e663553-4271-4d7d-8de9-d7b746787cc6' which tells me that the execution of the query itself just takes 300ms.
Therefore I thought the performance Issue comes from transmitting the data from the DB to the application. But I don't know how to evaluate this or even how to improve this.
To reproduce this I guess you need a AWS environment with a RDS and an application which just uses Hibernate to fetch from the RDS a table with approximately 10.000 entries.
EDIT 1
Persistence and DataSource Configuration.
We are using hibernate and have the the following configuration:
hibernate.default_batch_fetch_size=8
hibernate.jdbc.fetch_size=10
hibernate.jdbc.batch_size=8
hibernate.cache.use_query_cache=true
hibernate.cache.use_second_level_cache=true
hibernate.cache.region.factory_class=org.hibernate.cache.redis.SingletonRedisRegionFactory
hibernate.cache.use_structured_entries=true
hibernate.max_fetch_depth=10
hibernate.transaction.factory_class=org.hibernate.engine.transaction.internal.jdbc.JdbcTransactionFactory
javax.persistence.sharedCache.mode=ENABLE_SELECTIVE
I should also note that we use ElastiCache Redis with version 2.8.24.

Can I keep two mongo databases synced?

I have an app that can run in offline mode. If offline it uses a local mongo database, if it has a data connection it will use a remote mongo database.
Is there an easy way to sync these two databases and make sure they both have the union of their collections and documents?
EDIT: Effectively there are two databases that could both have insertions and deletions happening on them that aren't happening on the other. At fixed points in time I would like to have both databases show the union of them both.
For example over a period of time.
DB1.insert(A)
DB1.insert(B)
DB2.insert(C)
DB1.remove(A)
RUN SYNC
DB1 = DB2 = {B, C}
EDIT2: Been doing some reading. It's not the intended purpose but could they be set up as slaves replica sets of the remote and used that way? Problem is that I think replicas need to have a replica hosts must be accessible by way of resolvable DNS. Not sure how the remote could access local host.
You could use replica set but MongoDB doesn’t support master-master replication. Let's assume if you have setup like this:
two nodes with priority 1 which will be used as remote servers
single arbiter to ensure majority if one of remotes dies
5 local dbs with priority set as 0
When your application goes offline, it will stay secondary so you won't be able to perform writes. When you go online it will sync changes from remote dbs but you still need some way of syncing local changes. One of dealing with could be using local fallback db which will be used for writes when you are offline. When you go online, you push all new records to master. A little bit trickier could be dealing with updates but it is doable.
Another problem is that it won't scale up if you'll need to add more applications. If I remember correctly, there is a 12 nodes per replica set limit. For small cluster DNS resolution could be solved by using ssh tunnels.
Another way of dealing with a problem could be using small restful service and document timestamps. Whenever app is online it can periodically push local inserts to remote and pull data from remote db.

mybatis memcached cache failover

We're considering using memcached as a distributed cache for mybatis (the MyBatis-Memcached
integration module).
Does anyone know how to configure it so that the memcached servers are not single points of failure? Currently, if I configure multiple memcached servers, the cache requests are hashed out to the servers, but each server is a single point of failure (i.e. if one goes down, the app will fail).
We would like that if one of the memcached servers goes down, that the mybatis client treats this cache as lost, continues working and builds the cache back up on another available memcached server.
Anyone any experience with this?

What's the memcached server

I'm a beginner in learning memcached. The memcached server confused me most. Can I see it as a single server computer just like web server? I'm also confused about the relationship between memcached server and client, are they located at different computers?
I agree with most things #phihag has answered but I must clarify some things.
Memcached stores data according to a key (phihag called it an id, not to be confused with database ids). Data can be of various sizes so you can store small bits (like 1 record pulled from the database) or you can store huge chunks of data (like hundreds of records, or entire finished html pages).
Memcached is not typically used on the same machine as the application server, the reason for this is because it is designed to be used via TCP (it would be accessible via sockets if it were designed to work on the same server) and it was designed as a pooling server.
The pooling part is interesting - you can have 10 machines running Memcached each allocating a maximum 10GB of ram for this purpose. 10*10 = 100GB ram space.
When you write a value into Memcached only one (randomly or via some algorithms) of the servers stores it. When you try to read a value from Memcached only the server that has stored it will send it to you.
So indeed you can put all database/memcached/application/fileserver on the same machine, and typically you do that for you development sandbox. But you also can put each on a separate machine and any other combination of the two.
If you only need one Memcached server you will probably be OK with hosting it on the same machine as the application code.
If you start using a front-end cache server such as varnish or you configure NginX as a front-end cache server, you will have to configure some Memcached servers to store the data that these front-end cache servers are caching.
If you distribute your database into multiple servers and file servers into a CDN, that means that your application handles a lot of data in a short period of time so you'll need a lot of RAM space that couldn't be available in one single application server.
And since extending a memory pool for a Memcached server is as easy as adding the IP of the new server to the list, you will be scaling horizontally as in many servers (which is Memcached's true typical use).
The memcached server is a program which manages the data that memcached stores (not to be confused with a machine, which may also be called server). In theory, it can run on any computer. However, it is typically run on the same machine that the main application runs on.
The application then uses its memcached client to talk to the memcached server and ask for cached content. This is faster than querying data from a traditional database because
A memcached server just maps IDs to values, and never needs to scan an entire table
The memcached protocol is simpler. The server doesn't need to parse SQL or so, and the client doesn't need to craft it.
Since memcached does not require the reliability of a database (think of backups, fault isolation, clustering, security etc.), it can be run on the same machine that the application runs on. While you could run a database on the same machine that the applications runs on, doing so is frowned upon for the above reasons.

What are possible reasons for memcached to be significantly slower on a remote server?

I have a PHP/Apache server with 12GB of RAM. I have been running Memcached on the same machine with 6GB of allotted RAM.
I wanted to run Memcached on a separate server (same datacenter, vlan, subnet), just as I do for MySQL. I setup a separate, identical server with the same memcached configuration.
I am seeing a roughly 10x page load time using Memcached from the remote server than what I get when running locally. I have primed both caches and I still have a 10x load time from remote.
I'm having trouble trouble shooting this.
You're loading 500kb of data per pageload, in all small keys? How many keys per pageload is this?
Latency to a remote server is very low, but running many roundtrips is still a bad idea. Memcached clients support multi-get operations, where you batch many keys into a single request/response with much lower latency.
Just for info, DDR3-1333 is about 10667 MB/s.
If you have, let's say, Gigabit ethernet, I guess it can explains some of the problems you are experiencing...