RDS Connection being in idle state. Airflow Celery worker - postgresql

I am using airflow 1.10.9 and celery worker. I have dags which run whenever task comes and it spins up new ec2 instance and it connects to RDS on the basis of logic but ec2 holds the connection even when there no task is running and it keeps holding connection until Auto scaling scales down the instance.
RDS Details -
Class : db.t3.xlarge
Engine : PostgreSQL
I have checked the RDS logs but no luck.
LOG: could not receive data from client: Connection reset by peer
here is RDS connections.
state | wait_event | wait_event_type | count
--------+---------------------+-----------------+-------
| AutoVacuumMain | Activity | 1
| BgWriterHibernate | Activity | 1
| CheckpointerMain | Activity | 1
idle | ClientRead | Client | 525
| LogicalLauncherMain | Activity | 1
| WalWriterMain | Activity | 1
active | | | 1
All the connections are from celery workers.
Any help is appreciated.

Related

Movement of restart_lsn position movement of logical replication slots is very slow

We have two logical replication slots in our postgresql database (version-11) instance and we are using pgJDBC to stream data from these two slots.
We are ensuring that when we regularly send feedback and update the confirmed_flush_lsn (every 10 minutes) for both the slots to the same position. However
From our data we have seen that the restart_lsn movement of the two are not in sync and most of the time one of them lags too far behind
to hold the WAL files unnecessarily.
Here are some data points to indicate the problem
Thu Dec 10 05:37:13 CET 2020
slot_name | restart_lsn | confirmed_flush_lsn
--------------------------------------+---------------+---------------------
db_dsn_metadata_src_private | 48FB/F3000208 | 48FB/F3000208
db_dsn_metadata_src_shared | 48FB/F3000208 | 48FB/F3000208
(2 rows)
Thu Dec 10 13:53:46 CET 2020
slot_name | restart_lsn | confirmed_flush_lsn
-------------------------------------+---------------+---------------------
db_dsn_metadata_src_private | 48FC/2309B150 | 48FC/233AA1D0
db_dsn_metadata_src_shared | 48FC/233AA1D0 | 48FC/233AA1D0
(2 rows)
Thu Dec 10 17:13:51 CET 2020
slot_name | restart_lsn | confirmed_flush_lsn
-------------------------------------+---------------+---------------------
db_dsn_metadata_src_private | 4900/B4C3AE8 | 4900/94FDF908
db_dsn_metadata_src_shared | 48FD/D2F66F10 | 4900/94FDF908
(2 rows)
Though we are using setFlushLsn() and forceStatusUpdate for both the slot's stream regularly still the slot with name private is far behind the confirmed_flush_lsn and
slot with name shared is also behind with confirmed_flush_lsn but not too far. Since the restart_lsn is not moving fast enough, causing lot of issues with WAL log
file management and not allowing to delete them to free up disk space
How can this problem be solved? Are there any general guidelines to overcome this issue ?
We have seen another thread with similar question but no response there too.
WALs getting pilled up - restart_lsn of logical replication not moving in PostgreSQL
I am using the sample program published by pgJDBC here :
https://jdbc.postgresql.org/documentation/head/replication.html
to get streaming changes from the postgresql here.

How to have fault tolerance on producer end with Kafka

I am new to Kafa and data ingestion. I know Kafka is fault tolerant, as it keeps the data redundantly on multiple nodes. However, what I don't understand is how can we achieve fault tolerance on the source/producer end. For example, If I have netcat as souce, as given in the example below.
nc -l [some_port] | ./bin/kafka-console-producer --broker-list [kafka_server]:9092 --topic [my_topic]
The producer would fail pushing messages if the node doing netcat goes down. I was thinking if there was a mechanism where Kafka can pull the input itself, so for example, if on one node the netcat fails, another node can take over and start pushing messages using netcat.
My second question is how this is achieved in Flume, as it is a pull based architecture. Would Flume work in this case, that is, if one node doing netcat fails?
Every topic, is a particular stream of data (similar to a table in a database). Topics, are split into partitions (as many as you like) where each message within a partition gets an incremental id, known as offset as shown below.
Partition 0:
+---+---+---+-----+
| 0 | 1 | 2 | ... |
+---+---+---+-----+
Partition 1:
+---+---+---+---+----+
| 0 | 1 | 2 | 3 | .. |
+---+---+---+---+----+
Now a Kafka cluster is composed of multiple brokers. Each broker is identified with an ID and can contain certain topic partitions.
Example of 2 topics (each having 3 and 2 partitions respectively):
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 2 |
| Partition 1 |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 2 |
| |
| |
| Topic 2 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Note that data is distributed (and Broker 3 doesn't hold any data of topic 2).
Topics, should have a replication-factor > 1 (usually 2 or 3) so that when a broker is down, another one can serve the data of a topic. For instance, assume that we have a topic with 2 partitions with a replication-factor set to 2 as shown below:
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| |
| |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 1 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Now assume that Broker 2 has failed. Broker 1 and 3 can still serve the data for topic 1. So a replication-factor of 3 is always a good idea since it allows for one broker to be taken down for maintenance purposes and also for another one to be taken down unexpectedly. Therefore, Apache-Kafka offers strong durability and fault tolerance guarantees.
Note about Leaders:
At any time, only one broker can be a leader of a partition and only that leader can receive and serve data for that partition. The remaining brokers will just synchronize the data (in-sync replicas). Also note that when the replication-factor is set to 1, the leader cannot be moved elsewhere when a broker fails. In general, when all replicas of a partition fail or go offline, the leader will automatically be set to -1.
Having said that, as far as your producer lists all the addresses of the Kafka brokers that are in the cluster (bootstrap_servers), you should be fine. Even when a broker is down, your producer will attempt to write the record to another broker.
Finally, make sure to set acks=all (might have an impact to throughput though) so that the all in-sync replicas acknowledge that they received the message.

Datadog Agent Error : Unable to collect configurations from provider docker: temporary failure in dockerutil

Datadog installation: using helm was successful
Docs used
Agent version (7.19.0)
Agent Status
$ kubectl exec -it datadog-release-7jttj agent status | egrep "OK|ERROR"
Defaulting container name to agent.
Use 'kubectl describe pod/datadog-release-7jttj -n default' to see all of the containers in this pod.
Instance ID: cpu [OK]
Instance ID: disk:e5dffb8bef24336f [OK]
Instance ID: docker [ERROR]
Instance ID: file_handle [OK]
Instance ID: io [OK]
Instance ID: kubelet:d884b5186b651429 [OK]
Instance ID: kubernetes_apiserver [OK]
Instance ID: load [OK]
Instance ID: memory [OK]
Instance ID: network:e0204ad63d43c949 [OK]
Instance ID: ntp:d884b5186b651429 [OK]
Instance ID: uptime [OK]
Error in Logs
$ kubectl logs -f datadog-release-7jttj agent | grep "ERROR"
2020-05-03 14:49:17 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check docker: temporary failure in dockerutil, will retry later: try delay not elapsed yet
2020-05-03 14:49:18 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider docker: temporary failure in dockerutil, will retry later: try delay not elapsed yet
2020-05-03 14:49:19 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider docker: temporary failure in dockerutil, will retry later: try delay not elapsed yet
2020-05-03 14:49:20 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider docker: temporary failure in dockerutil, will retry later: try delay not elapsed yet
2020-05-03 14:49:21 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider docker: temporary failure in dockerutil, will retry later: try delay not elapsed yet
2020-05-03 14:49:22 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider docker: temporary failure in dockerutil, will retry later: try delay not elapsed yet
2020-05-03 14:49:23 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider docker: temporary failure in dockerutil, will retry later: try delay not elapsed yet
ERROR which i see in log agent and the same is true for all pods
default/datadog-release-7jttj/init-config
-----------------------------------------
Type: file
Path: /var/log/pods/default_datadog-release-7jttj_b373c792-b5da-46b9-a906-1fd71e9a41bc/init-config/*.log
Status: Error: could not find any file matching pattern /var/log/pods/default_datadog-release-7jttj_b373c792-b5da-46b9-a906-1fd71e9a41bc/init-config/*.log, check that all its subdirectories are executable
0 files tailed out of 0 files matching
Not sure where I am going wrong. Any help is highly appreciated.

pcp_attach_node gives EOFError in pgpool

I have successfully setup replication for my Postgres database using pg_pool.
Then I stopped the master server and checked the pool status. It is as like below
postgres=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role
---------+------------+------+--------+-----------+--------
0 | 10.140.0.9 | 5432 | 3 | 0.500000 | slave
1 | 10.140.0.7 | 5432 | 2 | 0.500000 | master
(2 rows)
Then I started the server, but it still shows the same status for the slave. So I used the following command to start the node:
/usr/sbin/pcp_node_info 10 10.140.0.9 5432 postgres postgres 1
But it is giving "EOFError" error. Please help to solve this issue.
Or please let me know a way to bring back the status 2 from status 3?
I solved the issue myself. In configuration the pcp port is 9898. Also there should be no space before password in pcp.conf file.
The pcp command should be as follows
/usr/sbin/pcp_node_info 10 localhost 9898 postgres postgres 1

ActiveMQ 5.8 / XMPP Federation (Support for dialback?)

I'm trying to set up XMPP federation between a Cisco UCM platform and ActiveMQ 5.8. (Would like to consume XMPP messages over JMS). I've verified XMPP is set up on ActiveMQ by attaching to it with iChat, and have sent messages through it that arrive on a JMS topic.
Cisco Federation, however, is not working. I'm seeing the following in the ActiveMQ logs, and I'm not sure where to go with this. I see dialback classes in the xmpp jar files in ActiveMQ..
2013-08-27 11:48:29,789 | DEBUG | Creating new instance of XmppTransport | org.apache.activemq.transport.xmpp.XmppTransport | ActiveMQ Transport Server Thread Handler: xmpp://0.0.0.0:61222
2013-08-27 11:48:29,796 | DEBUG | XMPP consumer thread starting | org.apache.activemq.transport.xmpp.XmppTransport | ActiveMQ Transport: tcp:///10.67.55.53:50750#61222
2013-08-27 11:48:29,800 | DEBUG | Sending initial stream element | org.apache.activemq.transport.xmpp.XmppTransport | ActiveMQ BrokerService[localhost] Task-106
2013-08-27 11:48:29,801 | DEBUG | Initial stream element sent! | org.apache.activemq.transport.xmpp.XmppTransport | ActiveMQ BrokerService[localhost] Task-106
2013-08-27 11:48:29,852 | DEBUG | Unmarshalled new incoming event - jabber.server.dialback.Result | org.apache.activemq.transport.xmpp.XmppTransport | ActiveMQ Transport: tcp:///10.67.55.53:50750#61222
2013-08-27 11:48:29,852 | WARN | Unkown command: jabber.server.dialback.Result#6b7acfe1 of type: jabber.server.dialback.Result | org.apache.activemq.transport.xmpp.ProtocolConverter | ActiveMQ Transport: tcp:///10.67.55.53:50750#61222
2013-08-27 11:48:59,864 | DEBUG | Unmarshalled new incoming event - org.jabber.etherx.streams.Error | org.apache.activemq.transport.xmpp.XmppTransport | ActiveMQ Transport: tcp:///10.67.55.53:50750#61222
2013-08-27 11:48:59,864 | WARN | Unkown command: org.jabber.etherx.streams.Error#69d2b85a of type: org.jabber.etherx.streams.Error | org.apache.activemq.transport.xmpp.ProtocolConverter | ActiveMQ Transport: tcp:///10.67.55.53:50750#61222
2013-08-27 11:48:59,865 | DEBUG | Unmarshalled new incoming event - org.jabber.etherx.streams.Error | org.apache.activemq.transport.xmpp.XmppTransport | ActiveMQ Transport: tcp:///10.67.55.53:50750#61222
2013-08-27 11:48:59,865 | WARN | Unkown command: org.jabber.etherx.streams.Error#94552fd of type: org.jabber.etherx.streams.Error | org.apache.activemq.transport.xmpp.ProtocolConverter | ActiveMQ Transport: tcp:///10.67.55.53:50750#61222
2013-08-27 11:48:59,865 | DEBUG | XMPP consumer thread starting | org.apache.activemq.transport.xmpp.XmppTransport | ActiveMQ Transport: tcp:///10.67.55.53:50750#61222
2013-08-27 11:48:59,866 | DEBUG | Transport Connection to: tcp://10.67.55.53:50750 failed: java.io.IOException: Unexpected EOF in prolog