I'm having trouble in setting up a task migrating the data in a RDS Database (PostgreSQL, engine 10.15) into an S3 bucket in the initial migration + CDC mode.
Both endpoints are configured and tested successfully.
I have created the task twice, both times it ran a couple of hours at most, the first time the initial dump went fine and some of the incremental dumps took place as well, the second time only the initial dump finished and no incremental dump was performed before the task failed.
The error message is now:
Last Error Task 'data-migration-bp-dev' was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL_
but just after it failed for the first time it was:
Last Error An internal WAL conversational protocol error has occurred. Task error notification received from subtask 0, thread 0 reptask/replicationtask.c:2859 1020452 Error executing source loop; Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev; Stream component 'st_0_data-migration-rds-bp-dev' terminated reptask/replicationtask.c:2866 1020452 Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE
In the CloudWatch logs I see the following error messages:
SOURCE_CAPTURE I: Streaming initiated successfully (postgres_pglogical.c:274)
SOURCE_CAPTURE I: #1 : Non-monotonic LSN sequence: Current LSN '00000000/00000000' < Previous LSN '000001E3/94016430'. Event is ignored. (postgres_endpoint_wal_engine.c:710)
SOURCE_CAPTURE I: Unable to resolve attributes for relation id '28804'. Aborting action. (postgres_pglogical.c:1643)
SOURCE_CAPTURE I: End of CDC / CAPTURE events for POSTGRES endpoint. (postgres_endpoint_capture.c:520)
SOURCE_CAPTURE I: CAPTURE ended with exceptions. (postgres_endpoint_capture.c:527)
SOURCE_CAPTURE E: Could not find relation id '28804' in hash. 1020483 (postgres_pglogical.c:1470)
SOURCE_CAPTURE E: Failed to parse relation from dml command 1020483 (postgres_pglogical.c:2515)
SOURCE_CAPTURE E: Failed to find relation id on target while processing message from source 1020452 (postgres_endpoint_wal_engine.c:805)
SOURCE_CAPTURE E: WAL stream loop ended abnormally. (STATUS_PROTOCOL_ERROR) 1020452 (postgres_endpoint_wal_engine.c:992)
SOURCE_CAPTURE E: WAL reader terminated with irrecoverable error. 1020452 (postgres_endpoint_capture.c:496)
TASK_MANAGER I: Task - data-migration-bp-dev is in ERROR state, updating starting status to AR_NOT_APPLICABLE (repository.c:5102)
SOURCE_CAPTURE E: Error executing source loop 1020452 (streamcomponent.c:1870)
TASK_MANAGER E: Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev 1020452 (subtask.c:1409)
SOURCE_CAPTURE E: Stream component 'st_0_data-migration-rds-bp-dev' terminated 1020452 (subtask.c:1578)
TASK_MANAGER E: Task error notification received from subtask 0, thread 0 1020452 (replicationtask.c:2859)
TASK_MANAGER E: Error executing source loop; Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev; Stream component 'st_0_data-migration-rds-bp-dev' terminated 1020452 (replicationtask.c:2866)
TASK_MANAGER E: Task 'data-migration-bp-dev' encountered a recoverable error, retry attempt # 0 (repository.c:5184)
At this point I should mention, that we had to configure the pglogical plugin and restart the database, but we got an error in the end, which we ignored since the DMS task started after that operation.
ERROR: current database is not configured as pglogical node
HINT: create pglogical node first
Is the problem of our failing DMS task related to the pglogical plugin configuration? If so, how can we configure it for it to work (our db engine should be compatible with it, no?)? And if not, how to fix it?
Thank you in advance!
Should anyone get the same error in the future, here is what we were told by the AWS tech specialist:
There is a known (to AWS) issue with the pglogical plugin. The solution requires using the test_decoding plugin instead.
Enforce using the test_decoding plugin on the DMS Endpoint by specifying pluginName=test_decoding in Extra Connection Attributes
Create a new DMS task using this endpoint (using the old task may cause it to fail due to dissynchronization between the task and the logs)
It sure did resolve the issue, but we still don't know what the problem really was with the plugin that is strongly suggested everywhere in the DMS documentation (at the moment).
I have created an S3 bucket backed by CEPH and through java S3 client and via S3 object gateway am listing the directory in a paginated fashion and randomly the listing is failing some times after listing 1100 blobs in batches, some times after listing 2000 blobs in batches and am not able to figure out how to debug this issue, this is the exception am getting and if you notice there is a requestId in the exception I think basis this I can filter the logs but where can i find the logs is the question, I have checked the s3 gateway pod logs but I couldn't find any such logs over there, please let me know where should I look for the same
com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 500; Error Code: UnknownError; Request ID: tx00000000000000000e7df-005e626049-1146-rook-ceph-store; S3 Extended Request ID: 1146-rook-ceph-store-rook-ceph-store), S3 Extended Request ID: 1146-rook-ceph-store-rook-ceph-store
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1799)
and this is my code to iterate through the blobs, this is non paginated, the paginated version, both the versions are throwing the same exception after listing few hundred blobs
ObjectListing objects = conn.listObjects(bucket.getName());
do {
for (S3ObjectSummary objectSummary : objects.getObjectSummaries()) {
System.out.println(objectSummary.getKey() + "\t" +
objectSummary.getSize() + "\t" +
StringUtils.fromDate(objectSummary.getLastModified()));
}
objects = conn.listNextBatchOfObjects(objects);
} while (objects.isTruncated());
So, any pointers on how to debug this would be helpful.. Thanks
Try ListObjectV2.
Returns some or all (up to 1,000) of the objects in a bucket.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
I'm new to Apache Airflow. I want to call a REST end point using DAG.
REST end point for example
#PostMapping(path = "/api/employees", consumes = "application/json")
Now I want to call this rest end point using Airflow DAG, and schedule it. What I'm doing is using SimpleHttpOperator to call the Rest end point.
t1 = SimpleHttpOperator(
task_id='post_op',
endpoint='http://localhost:8084/api/employees',
data=json.dumps({"department": "Digital","id": 102,"name": "Rakesh","salary": 80000}),
headers={"Content-Type": "application/json"},
dag=dag,)
When I trigger the DAG the task is getting failed
[2019-12-30 09:09:06,330] {{taskinstance.py:862}} INFO - Executing <Task(SimpleHttpOperator):
post_op> on 2019-12-30T08:57:00.674386+00:00
[2019-12-30 09:09:06,331] {{base_task_runner.py:133}} INFO - Running: ['airflow', 'run',
'example_http_operator', 'post_op', '2019-12-30T08:57:00.674386+00:00', '--job_id', '6', '--pool',
'default_pool', '--raw', '-sd', 'DAGS_FOLDER/ExampleHttpOperator.py', '--cfg_path',
'/tmp/tmpf9t6kzxb']
[2019-12-30 09:09:07,446] {{base_task_runner.py:115}} INFO - Job 6: Subtask post_op [2019-12-30
09:09:07,445] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2019-12-30 09:09:07,446] {{base_task_runner.py:115}} INFO - Job 6: Subtask post_op [2019-12-30
09:09:07,446] {{dagbag.py:92}} INFO - Filling up the DagBag from
/usr/local/airflow/dags/ExampleHttpOperator.py
[2019-12-30 09:09:07,473] {{base_task_runner.py:115}} INFO - Job 6: Subtask post_op [2019-12-30
09:09:07,472] {{cli.py:545}} INFO - Running <TaskInstance: example_http_operator.post_op 2019-12-
30T08:57:00.674386+00:00 [running]> on host 855dbc2ce3a3
[2019-12-30 09:09:07,480] {{http_operator.py:87}} INFO - Calling HTTP method
[2019-12-30 09:09:07,483] {{logging_mixin.py:112}} INFO - [2019-12-30 09:09:07,483]
{{base_hook.py:84}} INFO - Using connection to: id: http_default. Host: https://www.google.com/,
Port: None, Schema: None, Login: None, Password: None, extra: {}
[2019-12-30 09:09:07,484] {{logging_mixin.py:112}} INFO - [2019-12-30 09:09:07,484]
{{http_hook.py:131}} INFO - Sending 'POST' to url:
https://www.google.com/http://localhost:8084/api/employees
[2019-12-30 09:09:07,501] {{logging_mixin.py:112}} INFO - [2019-12-30 09:09:07,501]
{{http_hook.py:181}} WARNING - HTTPSConnectionPool(host='www.google.com', port=443): Max retries
exceeded with url: /http://localhost:8084/api/employees (Caused by SSLError(SSLError("bad handshake:
SysCallError(-1, 'Unexpected EOF')"))) Tenacity will retry to execute the operation
[2019-12-30 09:09:07,501] {{taskinstance.py:1058}} ERROR -
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url:
/http://localhost:8084/api/employees (Caused by SSLError(SSLError("bad handshake: SysCallError(-1,
'Unexpected EOF')")))
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 485, in wrap_socket
cnx.do_handshake()
File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1934, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1664, in _raise_ssl_error
raise SysCallError(-1, "Unexpected EOF")
OpenSSL.SSL.SysCallError: (-1, 'Unexpected EOF')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 376, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 994, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 394, in connect
ssl_context=context,
File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 370, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 491, in wrap_socket
raise ssl.SSLError("bad handshake: %r" % e)
ssl.SSLError: ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
Airflow is running on Docker and the docker image is puckel/docker-airflow.
Why it is calling the host http_default. Host: https://www.google.com/
You need to consider both the Operator you are using and the underlying Hook which it uses to connect.
The Hook fetches connection information from an Airflow Connection which is just a container used to store credentials and other connection information. You can configure Connections in the Airflow UI (using the Airflow UI -> Admin -> Connections).
So in this case, you need to first configure your HTTP Connection.
From the http_hook documentation:
http_conn_id (str) – connection that has the base API url i.e https://www.google.com/
It so happens that for the httpHook, you should configure the Connection by setting the host argument equal to the base_url of your endpoint: http://localhost:8084/.
Since your operator has the default http_conn_id, the hook will use the Airflow Connection called "http_default" in the Airflow UI.
If you don't want to change the default one you can create another Airflow Connection using the Airflow UI, and pass the new conn_id argument to your operator.
See the source code to get a better idea how the Connection object is used.
Lastly, according to the http_operator documentation:
endpoint (str) – The relative part of the full url. (templated)
You should only be passing the relative part of your URL to the operator. The rest it will get from the underlying http_hook.
In this case, the value of endpoint for your Operator should be api/employees (not the full URL).
The Airflow project documentation is unfortunately not very clear in this case. Please consider contributing an improvement, they are always welcome :)
I think you need to set your ENV variable of connection string in your Dockerfile or docker run command:
ENV AIRFLOW__CORE__SQL_ALCHEMY_CONN my_conn_string
see this and this
Connections
The connection information to external systems is stored in the
Airflow metadata database and managed in the UI (Menu -> Admin ->
Connections) A conn_id is defined there and hostname / login /
password / schema information attached to it. Airflow pipelines can
simply refer to the centrally managed conn_id without having to hard
code any of this information anywhere.
Many connections with the same conn_id can be defined and when that is
the case, and when thehooks uses the get_connection method from
BaseHook, Airflow will choose one connection randomly, allowing for
some basic load balancing and fault tolerance when used in conjunction
with retries.
Airflow also has the ability to reference connections via environment
variables from the operating system. The environment variable needs to
be prefixed with AIRFLOW_CONN_ to be considered a connection. When
referencing the connection in the Airflow pipeline, the conn_id should
be the name of the variable without the prefix. For example, if the
conn_id is named POSTGRES_MASTER the environment variable should be
named AIRFLOW_CONN_POSTGRES_MASTER. Airflow assumes the value returned
from the environment variable to be in a URI format
(e.g.postgres://user:password#localhost:5432/master).
see this
therefore you are now using the default:
Using connection to: id: http_default. Host: https://www.google.com/
I am creating 4 mountpoint disk in Windows OS. I need to copy files up to a threshold value (say 50 GB).
I tried with vdbench. It works fine, but it throws an exception at last.
compratio=4
dedupratio=1
dedupunit=256k
* Host Definition section
hd=default,user=Administator,shell=vdbench,jvms=1
hd=localhost,system=localhost
********************************************************************************
* Storage Definition section
fsd=fsd1,anchor=C:\UnMapTest-Volume1\disk1\,depth=1,width=1,files=1,size=5g
fsd=fsd2,anchor=C:\UnMapTest-Volume2\disk2\,depth=1,width=1,files=1,size=5g
fwd=fwd1,fsd=fsd*,operation=write,xfersize=1m,fileio=sequential,fileselect=random,threads=10
rd=rd1,fwd=fwd1,fwdrate=max,format=yes,elapsed=1h,interval=1
Below is the exception from vdbench. Due to this my calling script would fail.
05:29:14.287 Message from slave localhost-0:
05:29:14.289 file=C:\UnMapTest-Volume1\disk1\\vdb.1_1.dir\vdb_f0001.file,busy=true
05:29:14.290 Thread: FwgThread write C:\UnMapTest-Volume1\disk1\ rd=rd1 For loops: None
05:29:14.291
05:29:14.292 last_ok_request: Thu Dec 28 05:28:57 PST 2017
05:29:14.292 Duration: 16.92 seconds
05:29:14.293 consecutive_blocks: 10001
05:29:14.294 last_block: FILE_BUSY File busy
05:29:14.294 operation: write
05:29:14.295
05:29:14.296 Do you maybe have more threads running than that you have
05:29:14.296 files and therefore some threads ultimately give up after 10000 tries?
05:29:14.300 *
05:29:14.301 ******************************************************
05:29:14.302 * Slave localhost-0 aborting: Too many thread blocks *
05:29:14.302 ******************************************************
05:29:14.303 *
05:29:21.235
05:29:21.235 Slave localhost-0 prematurely terminated.
05:29:21.235
05:29:21.235 Slave aborted. Abort message received:
05:29:21.235 Too many thread blocks
05:29:21.235
05:29:21.235 Look at file localhost-0.stdout.html for more information.
05:29:21.735
05:29:21.735 Slave localhost-0 prematurely terminated.
05:29:21.735
java.lang.RuntimeException: Slave localhost-0 prematurely terminated.
at Vdb.common.failure(common.java:335)
at Vdb.SlaveStarter.startSlave(SlaveStarter.java:198)
at Vdb.SlaveStarter.run(SlaveStarter.java:47)
I am using PowerShell in a Windows machine. Even if some other tools like Diskspd is having way to fill data up to some threshold then please provide me.
I found the answer by myself.
I have done this using Diskspd.exe as below
The following command fill 50 GB data in the mentioned disk folder
.\diskspd.exe -c50G -b4K -t2 C:\UnMapTest-Volume1\disk1\testfile1.dat
It is very simple than Vdbench for my requirement.
Caution : But it is not having real data so array side disk size is
not shown up with the size
When a cron job runs, I get an email that says
HTTP request sent, awaiting response... 200 OK
Length: 19 [text/html]
Saving to: “filefeed.16”
0K 100% 4.93M=0s
2017-03-23 10:10:04 (4.93 MB/s) - “filefeed.16” saved [19/19]
So it's my understanding that Saving to: “filefeed.16” means that is storing this file somewhere in my server, where is it?
After looking a few hours, I found it was quite simple, depending on the user who is running the cron job, it will store this file in the user's directory, for example, let's say I am using user_03, it will save it on /home/user_03/.