Ceph - CRUSH and failure domain changes? - ceph

If you deploy data into a Ceph pool, can the pool be altered (ideally online) without losing the data, and what limits are there on the alterations that can be made?
An example:
I setup Ceph on two hosts and the pool has three OSD's on each host, with data being stored using replication.
I then lob some data on into the pool.
Some time down the road, I want to add another host (or two, or ten...) and switch the pool from replicated to erasure coding to reduce the storage overhead, can this be done with the pool online and without losing data?
This is a simple question I suspect, but I haven't been able to find a clear answer on what the limits and risks are in terms of making alterations to the pool.
And yes, I already know that officially you should start with three hosts :-)
Thanks

You can not change a replicated pool into an erasure coded pool. What you can do is create a different pool, copy all data into the EC pool and then rename it as you require (after renameing the replicated pool). If everything works as expected you can then delete the first pool.

Related

.Net Core connection pool exhausted (postgres) under heavy load spike, when new pod instances are created

I have an application which runs stable under heavy load, but only when the load increase in graduate way.
I run 3 or 4 pods at the same time, and it scales to 8 or 10 pods when necessary.
The standard requests per minute is about 4000 (means 66 req-per-second per node, means 16 req-per-second per single pod).
There is a certain scenario, when we receive huge load spike (from 4k per minute to 20k per minute). New pods are correctly created, then they start to receive new load.
Problem is, that in about 10-20% of cases newly created pod struggles to handle initial load, DB requests are taking over 5000ms, piling up, finally resulting in exception that connection pool was exhausted: The connection pool has been exhausted, either raise MaxPoolSize (currently 200) or Timeout (currently 15 seconds)
Here goes screenshots from NewRelic:
I can see that other pods are doing well, and also that after initial struggle, all pods are handling the load without any issue.
Here goes what I did when attempting to fix it:
Get rid of non-async calls. I had few lines of blocking code inside async methods. I've changed everything to async. I do not longer have non-async methods.
Removed long-running transactions. We had long running transactions, like this:
beginTransactionAsync
selectDataAsync
saveDataAsync
commitTransactionAsync
which I refactored to:
- selectDataAsync
- saveDataAsync // under-the-hood EF core short lived transaction
This helped a lot, but did not solve problem completely.
Ensure some connections are always open and ready. We added Minimum Pool Size=20 to connection string, to always keep at least 20 connections open.
This also helped, but still sometimes pods struggle.
Our pods starts properly after Readiness probe returns success. Readiness probe checks connection to the DB using standard .NET core healthcheck.
Our connection string have MaxPoolSize=100;Timeout=15; setting.
I am of course expecting that initially new pod instance will need some spin-up time when it operates at lower parameters, but I do not expect that so often pod will suffocate and throw 90% of errors.
Important note here:
I have 2 different DbContexts sharing same connection string (thus same connection pool). Each of this DbContext accesses different schema in the DB. It was done to have a modular architecture. DbContexts never communicate with each other, and are never used together in same request.
My current guess, is that when pod is freshly created, and receives a huge load immediately, the pod tries to open all 100 connections (it is visible in DB open sessions chart) which makes it too much at the beginning. What can be other reason? How to make sure that pod does operate at it's optimal performance from very beginning?
Final notes:
DB processor is not at its max (about 30%-40% usage under heaviest load).
most of the SQL queries and commands are relatively simple SELECT and INSERT
after initial part, queries are taking no more than 10-20ms each
I don't want to patch the problem with increasing number of connections in pool to more than 100, because after initial struggle pods operates properly with around 50 connections in use
I rather not have connection leak because in such case it will throw exceptions after normal load as well, after some time
I use scoped DbContext, which is disposed (thus connection is released to the pool) at the end of each request
EDIT 25-Nov-2020
My current guess is that new created pod is not receiving enough of either BANDWITH resource or CPU resource. This reasoning is supported by fact that even those requests which DOES NOT include querying DB were struggling.
Question: is it possible that new created pod is granted insufficient resources (CPU or network bandwidth) at the beginning?
EDIT 2 26-Nov-2020 (answering Arca Artem)
App runs on k8s cluster on AWS.
App connects to DB using standard ADO.NET connection pool (max 100 connections per pod).
I'm monitoring DB connections and DB CPU (all within reasonable limits). Hosts CPU utilization is also around 20-25%.
I thought that when pod start, and /health endpoint responds successfully (it checks DB connections, with simple SELECT probe) and also pod's max capacity is e.g. 200rps - then this pod is able to handle this traffic since very first moment after /health probe succeeded. However, from my logs I see that after '/health' probe succeed 4 times in a row under 20ms, then traffic starts coming in, few first seconds of pod handling traffic is taking more than 5s per request (sometimes even 40seconds per req).
I'm NOT monitoring hosts network.
At this point it's just a speculation on my part without knowing more about the code and architecture, but it's worth mentioning one thing that jumps out to me. The health check might not be using the normal code path that your other endpoints use, potentially leading to a false positive. If you have the option, use of a profiler could help you pin-point exactly when and how this happens. If not, we can take educated guesses where the problem might be. There could be a number of things at play here, and you may already be familiar with these, but I'm covering them for completeness sake:
First of all, it's worth bearing in mind that connections in Postgres are very expensive (to put it simply, it's because it's a fork on the database process) and your pods are consequently creating them in bulk when you scale your app all at once. A relatively considerable time is needed to set each one up and if you're doing them in bulk, it'll add up (how long is dependent on configuration, available resources..etc).
Assuming you're using ASP.NET Core (because you mentioned DbContext), the initial request(s) will take the penalty of initialising the whole stack (create min required connections in the pool, initialise ASP.NET stack, dependencies...etc). Again, this will all depend on how you structure your code and what your app is actually doing during initialisation. If your health endpoint is connecting to the DB directly (without utilising the connection pool), it would mean skipping the costly pool initialisation resulting in your initial requests to take the burden.
You're not observing the same behaviour when your load increases gradually possibly because usually these things are an interplay between different components and it's generally a non-linear function of available resources, code behaviour...etc. Specifically if it's just one new pod that spun up, it'll require much less number of connections than, say, 5 new pods spinning up, and Postgres would be able to satisfy it much quicker. Postgres is the shared resource here - creating 1 new connection would be significantly faster than creating 100 new connections (5 pods x 20 min connections in a pool) for all pods waiting on a new connection.
There are a few things you can do to speed up this process with config changes, using an external connection pooler like PgBouncer...etc but they won't be effective unless your health endpoint represents the actual state of your pods.
Again it's all based on assumptions but if you're not doing that already, try using the DbContext in your health endpoint to ensure the pool is initialised and ready to take connections. As someone mentioned in the comments, it's worth looking at other types of probes that might be better suited to implementing this pattern.
I found ultimate reason for the above issue. It was insufficient CPU resources assigned to pod.
Problem was hard to isolate because NewRelic APM CPU usage charts are calculated in different way than expected (please refer to NewRelic docs). The real pod CPU usage vs. CPU limit can be seen only in NewRelic kubernetes cluster viewer (probably it uses different algorithm to chart CPU usage there).
What is more, when pod starts up, it need a little more CPU at the beginning. On top of it, the pods were starting because of high traffic - and simply, there was no enough of CPU to handle these requests.

Apache Spark Auto Scaling properties - Add Worker on the Fly

During the execution of a Spark Program, let's say,
reading 10GB of data into memory, and just doing a filtering, a map, and then saving in another storage.
Can I auto-scale the cluster based on the load, and for instance add more Worker Nodes to the Program, if this program eventually needs to hangle 1TB instead of 10GB ?
If this is possible, how can it be done?
It is possible to some extent, using dynamic allocation, but behavior is dependent on the job latency, not direct usage of particular resource.
You have to remember that in general, Spark can handle data larger than memory just fine, and memory problems are usually caused by user mistakes, or vicious garbage collecting cycles. None of these could be easily solved, by "adding more resources".
If you are using any of the cloud platforms for creating the cluster you can use auto-scaling functionality. that will scale cluster horizontally(number of nodes with change)
Agree with #user8889543 - You can read much more data then your memory.
And as for adding more resources on the fly. It is depended on your cluster type.
I use standalone mode, and I have a code that add on the fly machines that attached to the master automatically, then my cluster has more cores and memory.
If you only have on job/program in the cluster then it is pretty simple. Just set
spark.cores.max
to a very high number and the job will take all the cores of the cluster always. see
If you have several jobs in the cluster it becomes complicate. as mentioned in #user8889543 answer.

Will PgBouncer reuse postgresql session sequence cache?

I want to use postgres sequence with cache CREATE SEQUENCE serial CACHE 100.
The goal is to improve performance of 3000 usages per second of SELECT nextval('serial'); by ~500 connection/application threads concurrently.
The issue is that I am doing intensive autoscaling and connections will be disconnected and reconnected occasionally leaving "holes" of unused ids in the sequence each time a connection is disconnected.
Well, the good news might be that I am using a PgBouncer heroku buildpack with transaction pool mode.
My question is: will the transaction pool mode solve the "holes" issues that I described, will it reuse the session in a way that the next application connection will take this session from the pool and continue using the cache of the sequence?
This depends on the setting of server_reset_query. If you have that set to DISCARD ALL, then sequence caches are discarded before a server connected is handed out to a client. But for transaction pooling, the recommended server_reset_query is empty, so you will be able to reuse sequence caches in that case. You can also use a different DISCARD command, depending on your needs.

In Oracle RAC, will an application be faster, if there is a subset of the code using a separate Oracle service to the same database?

For example, I have an application that does lots of audit trails writing. Lots. It slows things down. If I create a separate service on my Oracle RAC just for audit CRUD, would that help speed things up in my application?
In other words, I point most of the application to the main service listening on my RAC via SCAN. I take the subset of my application, the audit trail data manipulation, and point it to a separate service listening but pointing same schema as the main listener.
As with anything else, it depends. You'd need to be a lot more specific about your application, what services you'd define, your workloads, your goals, etc. Realistically, you'd need to test it in your environment to know for sure.
A separate service could allow you to segregate the workload of one application (the one writing the audit trail) from the workload of other applications by having different sets of nodes in the cluster running each service (under normal operation). That can help ensure that the higher priority application (presumably not writing the audit trail) has a set amount of hardware to handle its workload even if the lower priority thread is running at full throttle. Of course, since all the nodes are sharing the same disk, if the bottleneck is disk I/O, that segregation of workload may not accomplish much.
Separating the services on different sets of nodes can also impact how frequently a particular service is getting blocks from the local node's buffer cache rather than requesting them from the other node and waiting for them to be shipped over the interconnect. It's quite possible that an application that is constantly writing to log tables might end up spending quite a bit of time waiting for a small number of hot blocks (such as the right-most block in the primary key index for the log table) to get shipped back and forth between different nodes. If all the audit records are being written on just one node (or on a smaller number of nodes), that hot block will always be available in the local buffer cache. On the other hand, if writing the audit trail involves querying the database to get information about a change, separating the workload may mean that blocks that were in the local cache (because they were just changed) are now getting shipped across the interconnect, you could end up hurting performance.
Separating the services even if they're running on the same set of nodes may also be useful if you plan on managing them differently. For example, you can configure Oracle Resource Manager rules to give priority to sessions that use one service over another. That can be a more fine-grained way to allocate resources to different workloads than running the services on different nodes. But it can also add more overhead.

How does mongodb replica compare with amazon ebs?

I am new to mongodb and amazon ec2.
It seems to me that mongo replicas are here to : 1/ avoid data loss and 2/ make reads and serving faster.
In Amazon they have this EbS thing. From what I understand it is a global persistent storage, like dropbox for instance.
So is there a need to have replicas if amazon abstracts away the need of it with EBS ?
Thanks in advance
Thomas
Let me clarify a couple of things.
EBS is essentially a SAN Volume if you are used to working within existing technologies. It can be attached to one instance, but it still has a limited IO throughout. Using RAID can help maximize the IO, provisioned IOPS can help you maximize the throughput.
Ideally however, with MongoDB, you want to have enough memory where indexes can be completely accessed within memory, performance drops if the disk needs to be hit.
Mongo can use Replicas, which is primarily used for failover and replication (You can send reads to a slave, but all writes need to hit the primary), and sharding which is used to split a dataset to increase performance. You will still need to do these things anyway even if you are using EBS for storage.
Replicas are there not just for storage redundancy but also for server redundancy. What happens if your MongoDB server (which uses an EBS volume) suddenly disappears because, for example, the host on which is sits fails? You would need to do a whole bunch of stuff, like clone a new instance to replace it, attach the volume to that instance, reroute traffic to it, etc. Mongo's replica sets mean you don't have to do that. They keep working even if one of them fails, so you have basically 0 down time.
Additionally, it's one more layer of redundancy. You can only trust EBS so far - what if AWS has a bug that erases your volume or that makes it unavailable for an unacceptably long time? With replica sets you can even replicate your data across availability zones or even to a completely different cloud provider.
Replica sets also let you read from multiple nodes, so you can increase your read throughput, theoretically, after you've maxed out what the EBS connection gives you from one instance.