I am upgrading an application on Service Fabric and one of the replicas is showing the following warning:
Unhealthy event: SourceId='System.RAP', Property='IStatefulServiceReplica.ChangeRole(S)Duration', HealthState='Warning', ConsiderWarningAsError=false.
The api IStatefulServiceReplica.ChangeRole(S) on node _gtmsf1_0 is stuck. Start Time (UTC): 2018-03-21 15:49:54.326.
After some debugging, I suspect I'm not properly honoring a cancellation token. In the meantime, how do I safely force a restart of this stuck replica to get the service working again?
Partial results of Get-ServiceFabricDeployedReplica:
...
ReplicaRole : ActiveSecondary
ReplicaStatus : Ready
ServiceTypeName : MarketServiceType
...
ServicePackageActivationId :
CodePackageName : Code
...
HostProcessId : 6180
ReconfigurationInformation : {
PreviousConfigurationRole : Primary
ReconfigurationPhase : Phase0
ReconfigurationType : SwapPrimary
ReconfigurationStartTimeUtc : 3/21/2018 3:49:54 PM
}
You might be able to pipe that directly to Restart-ServiceFabricReplica. If that remains stuck, then you should be able to use Get-ServiceFabricDeployedCodePackage and Restart-ServiceFabricDeployedCodePackage to restart the surrounding process. Since Restart-ServiceFabricDeployedCodePackage has options for selecting random packages to simulate failure, just be sure to target the specific code package you're interested in restarting.
Related
I'm having a hard time deploying Graylog on Google Kubernetes Engine, I'm using this configuration https://github.com/aliasmee/kubernetes-graylog-cluster with some minor modifications. My Graylog server is up but show this error in the interface:
Error message
Request has been terminated
Possible causes: the network is offline, Origin is not allowed by Access-Control-Allow-Origin, the page is being unloaded, etc.
Original Request
GET http://ES_IP:12900/system/sessions
Status code
undefined
Full error message
Error: Request has been terminated
Possible causes: the network is offline, Origin is not allowed by Access-Control-Allow-Origin, the page is being unloaded, etc.
Graylog logs show nothing in particular other than this:
org.graylog.plugins.threatintel.tools.AdapterDisabledException: Spamhaus service is disabled, not starting (E)DROP adapter. To enable it please go to System / Configurations.
at org.graylog.plugins.threatintel.adapters.spamhaus.SpamhausEDROPDataAdapter.doStart(SpamhausEDROPDataAdapter.java:68) ~[?:?]
at org.graylog2.plugin.lookup.LookupDataAdapter.startUp(LookupDataAdapter.java:59) [graylog.jar:?]
at com.google.common.util.concurrent.AbstractIdleService$DelegateService$1.run(AbstractIdleService.java:62) [graylog.jar:?]
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119) [graylog.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
but at the end :
2019-01-16 13:35:00,255 INFO : org.graylog2.bootstrap.ServerBootstrap - Graylog server up and running.
Elastic search health check is green, no issues in ES nor Mongo logs.
I suspect a problem with the connection to Elastic Search though.
curl http://ip_address:9200/_cluster/health\?pretty
{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
After reading the tutorial you shared, I was able to identify that kubelet needs to run with the argument --allow-privileged.
"Elasticsearch pods need for an init-container to run in privileged mode, so it can set some VM options. For that to happen, the kubelet should be running with args --allow-privileged, otherwise the init-container will fail to run."
It's not possible to customize or modify kubelet parameter/arguments, there is a feature request found here: https://issuetracker.google.com/118428580, so this can be implemented in a future.
Also in case you are modifying kubelet directly on the node(s), it's possible that the master resets the configuration and it isn't guaranteed that the configurations will be persistent.
I have backuppc which is being handled by puppet and also using foreman. Below is my init.pp file :
class backuppc::service {
if $::operatingsystemcodename == 'squeeze' {
service { 'backuppc' : ensure => running, hasstatus => false, pattern => '/usr/share/backuppc/bin/BackupPC' }
} else {
service { 'backuppc' : ensure => running, hasstatus => true }
}
service { 'apache2' : ensure => running }
}
when I run puppet on node, it throws this reports on foreman :
class backuppc::service {
if $::operatingsystemcodename == 'squeeze' {
service { 'backuppc' : ensure => running, hasstatus => false, pattern => '/usr/share/backuppc/bin/BackupPC' }
} else {
service { 'backuppc' : ensure => running, hasstatus => true }
}
service { 'apache2' : ensure => running }
}
change from stopped to running failed: Could not start Service[backuppc]: Execution of '/etc/init.d/backuppc start' returned 1: Starting backuppc...2016-05-31 17:13:25 Another BackupPC is running (pid 6731); quitting...
node is running with debain squeeze 6.0.10.
any help on this ?
This ...
change from stopped to running failed: Could not start Service[backuppc]: Execution of '/etc/init.d/backuppc start' returned 1: Starting backuppc...2016-05-31 17:13:25 Another BackupPC is running (pid 6731); quitting...
... means that puppet attempted to start BackupPC, with /etc/init.d/backuppc start, which found that the process was already running. This indicates that puppet is incorrectly determining the status of the BackupPC service.
I can't find a reference to a facter fact named operatingsystemcodename in the source. Does foreman provide this variable, or are you defining it elsewhere? Perhaps you meant lsbdistcodename instead?
If so, and $::operatingsystemcodename is undefined, your conditional will always fall through to the else branch, and the resource will be defined with hasstatus => true. Puppet will attempt to use /etc/init.d/backuppc status to check if the service is running. Therefore, if the init script's status action is broken in some way (by always returning a non-0 exit code, for example) puppet will attempt to start the service on every agent run.
So first things first, I'd verify that $::operatingsystemcodename returns 'squeeze' on the node in question.
If not, I'd check the exit code of /etc/init.d/backuppc status under its various states, returning zero when started and non-zero when stopped.
If on the other hand $::operatingsystemcodename is undefined, or some unexpected value, then I'd choose another expression to use in the if statement. In this case, you'll also want to verify that the pattern attribute is correct by inspecting the process table while the BackupPC service is running.
EDIT: Alternatively, you can provide a value for the status attribute, containing a custom command used by puppet to check the status of the BackupPC service. I would expect something like status => 'pgrep -f BackupPC to work well enough. Although, puppet is already doing almost exactly this in ruby code, so I wouldn't expect it to solve you problem.
While a bit dated this blog post covers some general tips for troubleshooting puppet.
I know how to RESOLVE the problem, but I do not have any idea, how to find the cause/source (f.e. which statement) of the problem. Where (table, tools, commands) to look.
can I see something in the excerpt from db2diag.log?
2015-06-24-09.23.29.190320+120 ExxxxxxxxxE530 LEVEL: Error
PID : 15972 TID : 1 PROC : db2agent (XXX) 0
INSTANCE: db2inst2 NODE : 000 DB : XXX
APPHDL : 0-4078 APPID: xxxxxxxx.xxxx.xxxxxxxxxxxx
AUTHID : XXX
FUNCTION: DB2 UDB, data protection services, sqlpgResSpace, probe:2860
MESSAGE : ADM1823E The active log is full and is held by application handle
"3308". Terminate this application by COMMIT, ROLLBACK or FORCE
APPLICATION.
The db2diag.log shows you the agent ID (application handle) of the application causing the problem (3308).
Provided you are seeing this in real time (as opposed to looking at db2diag.log after the fact), you can:
Use db2top to view information about this connection
Query sysibmadm.snapstmt (looking at stmt_text and agent_id)
Use db2pd -activestatements and db2pd -dynamic (keying on AnchID and StmtUID
Use good old get snapshot for application
There are also many 3rd party tools that can also give you the information you are looking for.
Is CMS Replication required for ApplicationPool also?
When I run the command Get-CsManagementStoreReplicationStatus I get UpToDate : True for my domain but it comes False for my ApplicationPool.
UpToDate : True
ReplicaFqdn : ****.*****
LastStatusReport : 07-08-2014 11:42:26
LastUpdateCreation : 07-08-2014 11:42:26
ProductVersion : 5.0.8308.0
UpToDate : False
ReplicaFqdn : MyApplicationPool.****.*****
LastStatusReport :
LastUpdateCreation : 08-08-2014 15:16:03
ProductVersion :
UpToDate : False
ReplicaFqdn : ****.*****
LastStatusReport :
LastUpdateCreation : 08-08-2014 15:10:59
Am I on the right track? Have I created my ApplicationPool wrongly?
Yes, UCMA applications running on an app server generally require access to the CMS, so replication should be enabled.
On the app server, you'd need to:
Ensure the "Lync Server Replica Replicator Agent" service is running
Run Enable-CsReplica in the management shell
Run Enable-CsTopoloy
Then run Invoke-CSManagementStoreReplication to force a replication
I've noticed that it often takes a while for the CMS to be replicated to the app server, so you might need to run Get-CsManagementStoreReplicationStatus a few times before you see UpToDate change to True.
I'm using mongodb v2.2.2 on single server(Ubuntu 12.04).
It crashed with no log on /var/log/mongodb/mongodb.log.
It seemed crashed during logging.(Character is interrupted. And, this log is normal query log.)
And, I checked on syslog about memory-issue(for example, killed proccess),
but couldn't find it.
Then, I found the following error on mongo-shell(db.printCollectionStats() command).
DLLConnectionResultData
{
"ns" : "UserData.DLLConnectionResultData",
"count" : 8215398,
"size" : 4831306500,
"avgObjSize" : 588.0794211065611,
"errmsg" : "exception: assertion src/mongo/db/database.cpp:300",
"code" : 0,
"ok" : 0
}
How do I figure out problems?
Thank you,
I checked that line in the source code for 2.2.2 (see here for reference). That error is specifically related to enforcing quotas on MongoDB. You haven't mentioned enforcing quotas here or what you have set the files limit to (default is 8) but you could be running into the limit here.
First, I would recommend getting onto a more recent version of 2.2 (and upgrading to 2.4 eventually, but definitely 2.2.7+ initially). If you are using quotas, this fix which went into 2.2.5 would log quota exceeded messages (previously logged only at log level 1, default is log level 0). Hence if a quota violation is the culprit here, you may get an early warning.
If that is the root cause, then you have a couple of options:
After upgrading to the latest version of 2.2, of the issue happens repeatedly, file a bug report for the crash on 2.2
Upgrade to 2.4, verify that the issue still occurs, and file a bug (or add to the above report for 2.2)
In either case, turning off quotas in the interim would be the obvious way to prevent the crash.