Configuring PostgreSQL HA on RHEL 7.0 - postgresql

We are experiencing a problem configuring PostgreSQL for HA using Corosync and Pacemaker.
crm_mon output is
Last updated: Thu Dec 18 10:24:04 2014
Last change: Thu Dec 18 10:16:30 2014 via crmd on umhtvappdpj05.arqiva.local
Stack: corosync
Current DC: umhtvappdpj06.arqiva.local (1) - partition with quorum
Version: 1.1.10-29.el7-368c726
2 Nodes configured
4 Resources configured
Online: [ umhtvappdpj05.arqiva.local umhtvappdpj06.arqiva.local ]
Full list of resources:
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
vip-rep (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
Node Attributes:
* Node umhtvappdpj05.arqiva.local:
+ master-pgsql : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-status : HS:alone
+ pgsql-xlog-loc : 0000000097000168
* Node umhtvappdpj06.arqiva.local:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000094000090
+ pgsql-status : PRI
Migration summary:
* Node umhtvappdpj05.arqiva.local:
* Node umhtvappdpj06.arqiva.local:`
Here node 06(umhtvappdpj06.arqiva.local) started as primary and node 05(umhtvappdpj05.arqiva.local) acts as standby but both are not connected.
recovery.conf on node 05
standby_mode = 'on'
primary_conninfo = 'host=10.52.6.95 port=5432 user=postgres application_name=umhtvappdpj05.arqiva.local keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p'
recovery_target_timeline = 'latest'`
Resources created are:
pcs resource create vip-master IPaddr2 \
ip="10.52.6.94" \
nic="ens192" \
cidr_netmask="24" \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"
pcs resource create vip-rep IPaddr2 \
ip="10.52.6.95" \
nic="ens192" \
cidr_netmask="24" \
meta migration-threshold="0" \
op start timeout="60s" interval="0s" on-fail="stop" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="ignore"
pcs resource create pgsql ocf:heartbeat:pgsql \
pgctl="/usr/pgsql-9.3/bin/pg_ctl" \
psql="/usr/pgsql-9.3/bin/psql" \
pgdata="/pgdata/data" \
rep_mode="sync" \
node_list="10.52.6.85 10.52.6.92" \
restore_command="scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p" \
master_ip="10.52.6.95" \
primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
restart_on_promote='true' \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op monitor timeout="60s" interval="9s" on-fail="restart" role="Master" \
op promote timeout="60s" interval="0s" on-fail="restart" \
op demote timeout="60s" interval="0s" on-fail="stop" \
op stop timeout="60s" interval="0s" on-fail="block" \
op notify timeout="60s" interval="0s"
[root#umhtvappdpj05 data]# pcs resource show --all
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started
vip-rep (ocf::heartbeat:IPaddr2): Started
[root#umhtvappdpj05 data]# `
The only anomaly was the corosync and pacemaker were first installed on node 6 when it was on a different subnet from node 5. Subsequently node 6 was shifted to same subnet as 5. Could this be the cause? Maybe re-install on node 6. Does seem to make sense.
thank you
Sameer

Here is how you can connect your replica with primary PostgreSQL.
1. Touch PGSQL.lock file in /var/lib/pgsql/tmp/ in umhtvappdpj05.arqiva.local
2. Stop PostgreSQL in node umhtvappdpj05.arqiva.local using systemctl
3. Do base backup/rsync of data dir/ from primary server to replica
4. Remove PGSQL.lock file from replica
5. Run pcs resource cleanup pgsql
These steps works for me always.
Let me know if doesn't.

Related

Topic not created when using debezium/connect Docker image

A Kafka topic for my table is not created when using the debezium/connect Docker image. Here's how I'm starting the container:
docker run -it --rm --name debezium -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my-connect-configs \
-e OFFSET_STORAGE_TOPIC=my-connect-offsets -e BOOTSTRAP_SERVERS=192.168.56.1:9092 \
-e CONNECT_NAME=my-connector -e CONNECT_CONNECTOR_CLASS=io.debezium.connector.postgresql.PostgresConnector \
-e CONNECT_TOPIC_PREFIX=my-prefix -e CONNECT_DATABASE_HOSTNAME=host.docker.internal -e CONNECT_DATABASE_PORT=5432 \
-e CONNECT_DATABASE_USER=postgres -e CONNECT_DATABASE_PASSWORD=root -e DATABASE_SERVER_NAME=mydb \
-e CONNECT_DATABASE_DBNAME=mydb -e CONNECT_TABLE_INCLUDE_LIST=myschema.my_table -e CONNECT_PLUGIN_NAME=pgoutput \
debezium/connect
I've tried using CONNECT__ instead of CONNECT_, but I get the same results. A topic for the table is not created if I use the API:
curl -H 'Content-Type: application/json' 127.0.0.1:8083/connectors --data '
{
"name": "prism",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"topic.prefix": my-connector",
"database.hostname": "host.docker.internal",
"database.port": "5432",
"database.user": "postgres",
"database.password": "root",
"database.server.name": "mydb",
"database.dbname" : "mydb",
"table.include.list": "myschema.my_table",
"plugin.name": "pgoutput"
}
}'
The topics my-connect-configs and my-connect-offsets, specified by CONFIG_STORAGE_TOPIC and OFFSET_STORAGE_TOPIC, are created.
http://localhost:8083/connectors/my-connector/status shows this:
{"name":"my-connector","connector":{"state":"RUNNING","worker_id":"172.17.0.3:8083"},"tasks":[{"id":0,"state":"RUNNING","worker_id":"172.17.0.3:8083"}],"type":"source"}
I was able to create a topic when using bin/connect-standalone.sh instead of the Docker image as per this question.
Automatic topic creation is enabled and I don't see any errors/warnings in the log.
Check kafka connect container logs, what msg it shows when it tries to insert the data to kafka cluster.
You need to enable auto topic creation in kafka broker config, checkout this doc
Make sure that table exist in the db and "table.include.list": "myschema.my_table", is correct. For experimentation you can remove this config temporary.
You can use UI platform created by redpanda team to manage topic, broker and kafka-connect config - here
The issue was that the underlying table didn't have any data, and, therefore, the topic was not created. The topic is created either if the table has data when the connector is started or if rows are added while the connector is running.

Atomic operation when writing on kafka using pyspark

I'm using Pyspark V3.2.1 to write on a kafka broker
data_frame \
.selectExpr('CAST(id AS STRING) AS key', "to_json(struct(*)) AS value") \
.write \
.format('kafka') \
.option('topic', topic)\
.option('kafka.bootstrap.servers', 'localhost:9092') \
.mode('append') \
.save()
I'm facing a real engineering issue ! How to ensure having an atomic operation and an idempotent producer ?
Any envisaged solutions, thanks

Loading CSV data into Kafka

I was working on event monitoring / microservice monitoring with kafka.
Following a guide from https://rmoff.net/2020/06/17/loading-csv-data-into-kafka/
When i was at the area
kafkacat -b kafka:29092 -t orders_spooldir_00 \
-C -o-1 -J \
-s key=s -s value=avro -r http://schema-registry:8081 | \
jq '.payload'
I got an error, not sure what went wrong, docker end or the server end?
Would appreciate any helps on how I can proceed! thanks
% ERROR: Failed to query metadata for topic orders_spooldir_00: Local: Broker transport failure
kafka-connect | [2021-11-09 16:02:35,778] ERROR [source-csv-spooldir-00|worker] WorkerConnector{id=source-csv-spooldir-00} Error while starting connector (org.apache.kafka.connect.runtime.WorkerConnector:193)

Spark Error : executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

I am working with following spark config
maxCores = 5
driverMemory=2g
executorMemory=17g
executorInstances=100
Issue:
Out of 100 Executors, My job ends up with only 10 active executors, nonetheless enough memory is available. Even tried setting the executors to 250 only 10 remains active.All I am trying to do is loading a mulitpartition hive table and doing df.count over it.
Please help me understanding the issue causing the executors kill
17/12/20 11:08:21 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
17/12/20 11:08:21 INFO storage.DiskBlockManager: Shutdown hook called
17/12/20 11:08:21 INFO util.ShutdownHookManager: Shutdown hook called
Not sure why yarn is killing my executors.
I faced a similar issue where the investigation of the NodeManager-Logs lead me to the root cause.
You can access them via the Web-interface
nodeManagerAddress:PORT/logs
The PORT is specified in the yarn-site.xml under yarn.nodemanager.webapp.address. (default: 8042)
My Investigation-Workflow:
Collect logs (yarn logs ... command)
Identify node and container (in these logs) emitting the error
Search the NodeManager-logs by Timestamp of the error for a root cause
Btw: you can access the aggregated collection (xml) of all configurations affecting a node at the same port with:
nodeManagerAdress:PORT/conf
I believe this issue has more to do with the memory and the dynamic time allocations on executor/container levels. Make sure you can change the config params on executor/container level.
One of the ways you can resolve this issue is by changing this config value either on your spark-shell or spark job.
spark.dynamicAllocation.executorIdleTimeout
This thread has more detailed information on how to resolve this issue which worked for me:
https://jira.apache.org/jira/browse/SPARK-21733
I had the same issue, my spark job was using only 1 task node and killing the other provisioned nodes. This also happened when switching to EMR Serverless, my job was being run on only one "thread". Please see below as it fixed it for me:
spark-submit \
--name KSSH-0.3 \
--class com.jiuye.KSSH \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 1 \
--num-executors 8 \
--jars $(echo /opt/software/spark2.1.1/spark_on_yarn/libs/*.jar | tr ' ' ',') \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.yarn.am.memory=1024m" \
--conf "spark.yarn.am.memoryOverhead=1024m" \
--conf "spark.yarn.driver.memoryOverhead=1024m" \
--conf "spark.yarn.executor.memoryOverhead=1024m" \
--conf "spark.yarn.am.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.streaming.kafka.maxRatePerPartition=1250" \
--conf "spark.locality.wait=1s" \
--conf "spark.shuffle.consolidateFiles=true" \
--conf "spark.executor.heartbeatInterval=360000" \
--conf "spark.network.timeout=420000" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" \
/opt/software/spark2.1.1/spark_on_yarn/KSSH-0.3.jar

Cannot create dataproc cluster

When creating DataProc cluster using Web Console or gcloud, getting the same error:
gcloud dataproc --region us-west1 clusters create cluster-02eb --subnet default --zone us-west1-a --master-machine-type n1-standard-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-standard-4 --worker-boot-disk-size 500 --image-version 1.2 --scopes 'https://www.googleapis.com/auth/cloud-platform' --project projectxyz123 --initialization-actions 'gs://dataproc-initialization-actions/jupyter/jupyter.sh'
Waiting on operation [projects/projectxyz123/regions/us-west1/operations/53badbbc-41cc-4241-aedb-490981864bf9].
Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/projectxyz123/regions/us-west1/operations/53badbbc-41cc-4241-aedb-490981864bf9] failed: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
Not Found.
What does this error means ?
This is currently a known issue for zone us-west1-a. Using an alternative zone should work (e.g. us-west1-b).