Deleting Service Fabric Actor doesn't appear to clear state on D: (Temp Storage drive) - service-fabric-stateful

We appear to have encountered an issue within a Service Fabric cluster where by the state of an Actor service has grown to the point where the Temporary Storage (D:) drive had filled up. As I understand, Actor state and reliable collection state is persisted to to disk on this drive. For one service we had amassed 190 GB of space taken up by the ActorStateStore file.
We were working on the assumption that space was taken up by a lot of stale actors that we no longer needed in our system, so we added a call to the service to purge the unwanted actors using the mechanism detailed here (https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-actors-lifecycle).
ActorId actorToDelete = new ActorId(id);
IActorService myActorServiceProxy = ActorServiceProxy.Create(
new Uri("fabric:/MyApp/MyService"), actorToDelete);
await myActorServiceProxy.DeleteActorAsync(actorToDelete, cancellationToken)
We cycled through the full list off actors in state and called the delete on anything we didn't want to keep, which would have been the vast majority of them, expecting this to reduce our space consumption on the temporary storage drive. However, the space did not seem to free up. Does anyone know what the lifecycle of this process is? Are there lead times before we should expect the drive space to appear free?
Is there a mechanism that can be used to free this drive space up or what is the best way to perform housekeeping to remove old actors that we would no longer be interested in?

Related

Ceph: What happens when enough disks fail to cause data loss?

Imagine that we have enough disk failures in Ceph to cause actual loss of the data. (E.g. all 3 replicas fail in 3-replica; or >m fail in k+m erasure coding). What happens now?
Is the cluster still stable? That is, of course we've lost that data, but will other data, and new data, still work well.
Is there any way to get a list of the lost object ids?
In our use case, we could recover the lost data from offline backups. But, to do that, we'd need to know which data was actually lost - that is, get a list of the object ids that were lost.
Answer 1: what happens if ?
Ceph distributes your data in placement groups (PGs). Think of them as shards of your data pool. By default a PG is stored in 3 copies over your storage devices. Again by default a minimum of 2 copies have to be known to exist by ceph to be still accessible. Should only 1 copy be available (because 2 OSDs (aka disks) are offline), writes to that PG are blocked until the minimum number of copies (2) are online again. If all copies of a PG are offline your reads will be blocked until one copy comes online. All other PGs are free to be accessed if online with enough copies.
Answer 2: what is affected ?
You are probably referring to the S3 like object storage. This is modelled on top of the rados object store, that is the key storage of ceph. Problematic PGs can be traced and associated with the given rados object. There is documentation about identifying blocked RadosGW requests and another section about getting from defective PGs to the rados objects.

File replication

I have a Web API stateless-service that bring a file from a client and transfers it to an actor-service (for deferred ETL operations). File size limited up to 20MB.
Is it good idea to tranfer file directly (in-memory as a byte array) from one service to another? Or there any feature like a file-based state to replicate file within the cluster and further processing?
P.S. It is impossible (due to legal reasons) to upload it anywhere before processing.
P.P.S. SF cluster is on-premises installation.
It is not a good idea to do that for a few reasons:
1 - If you store your files in Reliable Collections, it would make your collections too big and slow down the replication, as every update to your collections would be replicated to other nodes, it would be also expensive(time) to move services around the cluster.
2 - if you don't store it on any collection and leave it in memory, Service Fabric could try to move your service around the cluster and you risk to loose the data.
3 - When you upload the file, you must return a confirmation to the user, to avoid let them waiting until the processing is complete, locking server resources is a bad idea.
4 - Saving it to the disk won't replicate the file, and if your service move to other nodes, you loose access to the file.
There are many reasons, if can't save it somewhere (like a file share), you can take these risks.
If you still prefer to go this route, I would suggest:
Send the content to the actor that will process it, the actor will save it to the actor state.
Register a timer (or reminder, dependning on your requirementts) in the actor for triggering the processing of this file.
Create the logic after the processing to deactivate the timer and them save the output to somewhere after processed.
Deactivate the actor and delete the state.
Using the actor state to store each file will make it more flexible, you might register the file in your actor at one node, and if the actor is moved, the actor state will be still available when it get activated on other node.
Keep in mind that your cluster has nodes and they may fail, so you
should not rely on their memory or disks to save state, unless
replicated to others with different reliability guarantees, like azure
storage.

Is keep logging messages in group communication service or paxos practical?

In the case of network partition or node crash, most of the distributed atomic broadcast protocols (like Extended Virtual Synchrony or Paxos), require running nodes, to keep logging messages, until the crashed or partitioned node rejoins the cluster. When a node rejoins the cluster, replay of logged messages are enough to regain the current state.
My question is, if the partitioned/crash node takes really long time to join the cluster again, then eventually logs will overflow. This seem to be a very practical issue, but still no one in their paper talks about it. Is there a very obvious solution to this which I am missing? Or my understanding in incorrect.
You don't really need to remember the whole log. Imagine for example that the state you were synchronizing between the nodes was something like an SQL table with a row of the form (id: int, name: string) and the commands that would be written into the logs were in a form "insert row with id=x and name=y", "delete row where id=z", "set name=a where id=1000",...
Once such commands were committed, all you really care about is the final table. Then once a node which was offline for a long time goes online, it would only need to download the table + few entries from the log that were committed while the table was being downloaded.
This is called "log compaction", check out the chapter 7 in the Raft paper for more info.
There are a few potential solutions to the infinite log problem but one of the more popular ones for replicated state machines is to periodically snap-shot the full replicated state machine and delete all history prior to that point. A node that has been offline too long would then just discard all of their information, download the snapshot, and start replaying the replicated logs from that point.

In Oracle RAC, will an application be faster, if there is a subset of the code using a separate Oracle service to the same database?

For example, I have an application that does lots of audit trails writing. Lots. It slows things down. If I create a separate service on my Oracle RAC just for audit CRUD, would that help speed things up in my application?
In other words, I point most of the application to the main service listening on my RAC via SCAN. I take the subset of my application, the audit trail data manipulation, and point it to a separate service listening but pointing same schema as the main listener.
As with anything else, it depends. You'd need to be a lot more specific about your application, what services you'd define, your workloads, your goals, etc. Realistically, you'd need to test it in your environment to know for sure.
A separate service could allow you to segregate the workload of one application (the one writing the audit trail) from the workload of other applications by having different sets of nodes in the cluster running each service (under normal operation). That can help ensure that the higher priority application (presumably not writing the audit trail) has a set amount of hardware to handle its workload even if the lower priority thread is running at full throttle. Of course, since all the nodes are sharing the same disk, if the bottleneck is disk I/O, that segregation of workload may not accomplish much.
Separating the services on different sets of nodes can also impact how frequently a particular service is getting blocks from the local node's buffer cache rather than requesting them from the other node and waiting for them to be shipped over the interconnect. It's quite possible that an application that is constantly writing to log tables might end up spending quite a bit of time waiting for a small number of hot blocks (such as the right-most block in the primary key index for the log table) to get shipped back and forth between different nodes. If all the audit records are being written on just one node (or on a smaller number of nodes), that hot block will always be available in the local buffer cache. On the other hand, if writing the audit trail involves querying the database to get information about a change, separating the workload may mean that blocks that were in the local cache (because they were just changed) are now getting shipped across the interconnect, you could end up hurting performance.
Separating the services even if they're running on the same set of nodes may also be useful if you plan on managing them differently. For example, you can configure Oracle Resource Manager rules to give priority to sessions that use one service over another. That can be a more fine-grained way to allocate resources to different workloads than running the services on different nodes. But it can also add more overhead.

Why does the sequential write to a journal file speed up if it is in a different file system?

As per MongoDB documentation at http://docs.mongodb.org/manual/core/journaling,
To speed the frequent sequential writes that occur to the current
journal file, you can ensure that the journal directory is on a
different filesystem
storing the journal file on a different file system speeds things up. Is it because two different hard disk spindles are at work? Just wanted to understand the mechanics of this optimization tip.
Yes,
If you are using physical rotating hard drives, there is significant performance benefit from separating the journal activities onto a separate (preferably dedicated) physical drive.
The benefits are not the same if you're using SAN hardware. And to an extent are lessened by larger drive caches available in modern hard drives. And it's a different story again with SSD.
The main factor with spinning disks is seek time - the time that it takes for the read/write head to get to the right part of the disk. Hard disks are arranged with circular tracks. To get to a specific block on the disk, the head moves to the right track, and the disk spins around to the right place (the disks keep spinning of course, so it's simply a matter of waiting for the right place to come around).
This doesn't take much time, but when it's happening a lot it adds up.
When you have the primary activity and the journal activity on the same drive, the head has to rapidly move between the two (many, really) locations that the system needs to look at.
If you have your journalling on another physical drive, then the head on that drive can be almost (or perhaps more accurately, relatively) static, with the ability to more rapidly access the correct track / location required. Meanwhile the other drive (with the primary activity on it) will be more efficient also, because the head will not be constantly seeking back to the where the journal entries are being written between the other activities required to keep the database running.
This benefit applies to most database systems and many other applications where there is a constant sequential writing to disk going on at the same time as other mixed disk activity.
You don't get the same profile if you're using SAN, because even if it appears to be separate file systems, it's actually likely to be striped across many drives which are both cached and shared.
SSD has a different profile also, because there is no physical seek time.