GAE Socket API error - ApplicationError: 4 Unknown error - sockets

We have an app engine cron that checks the liveness of a number of DNS servers using dnspython. It has been working without issue until [12/Nov/2014:13:28:12 -0800] (about 12 hours ago) where it started failing 100% of the time with the following:
DNS Lookup failed: 'ApplicationError: 4 Unknown error.'. Traceback (most recent call last):
File ".../handlers/tasks.py", line 150, in _checkDNSServer
answers = resolver.query(domain, 'A', source='')
File "lib/dns/resolver.py", line 830, in query
source_port=source_port)
File "lib/dns/query.py", line 213, in udp
s.bind(source)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/socket.py", line 222, in meth
return getattr(self._sock,name)(*args)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 660, in bind
self._CreateSocket(bind_address=address)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 611, in _CreateSocket
raise _SystemExceptionFromAppError(e)
ApplicationError: ApplicationError: 4 Unknown error.
The code in question is fairly simple ...
def _checkDNSServer(self, ip):
""" Return true if the server is up and responds within 1 second.
False the server is down or responded slowly
"""
domain = 'www.testdomain.com'
resolver = dns.resolver.Resolver()
resolver.nameservers = [ip]
starttime = datetime.now()
try:
answers = resolver.query(domain, 'A', source='')
duration = datetime.now() - starttime
logging.debug("DNS Lookup Time %s" % duration)
# Max delay of 1 second
if duration > timedelta(seconds=1):
return False
return True
except Exception as e:
tb = traceback.format_exc()
logging.error("DNS Lookup failed: '%s'. %s", e, tb)
return False
the code continues to work on the local development server
billing is enabled
quota is sufficient
no changes to app engine release version (1.9.16) before/after the error appearing
target servers are live and responding ok
Suggestions?

Seems it was a transient issue and is now resolved.

Related

Socket conflict while running VAEs

I am currently trying to run VAEs on my personal labtop following the steps from: https://github.com/AntixK/PyTorch-VAE
I am only trying to run the simplest VAE with configs/vae.yaml
Since my personal device only have one gpu so I changed the gpus parameter in the config to [0] and I also add:
import os
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"
to run.py to let the file runnable. However, I am now getting such error:
Global seed set to 1265
break point 3!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
break point 4!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======= Training VanillaVAE =======
Global seed set to 1265
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W ..\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [DESKTOP-NMQ53KV]:50425 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [DESKTOP-NMQ53KV]:50425 (system error: 10049 - The requested address is not valid in its context.).
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\core\datamodule.py:469: LightningDeprecationWarning: DataModule.setup has already been called, so it will not be called again. In v1.6 this behavior will change to always call DataModule.setup.
rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-------------------------------------
0 | model | VanillaVAE | 3.9 M
-------------------------------------
3.9 M Trainable params
0 Non-trainable params
3.9 M Total params
15.751 Total estimated model params size (MB)
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]break point 1!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
break point 2!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Global seed set to 1265
break point 3!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
break point 4!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======= Training VanillaVAE =======
Global seed set to 1265
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W ..\torch\csrc\distributed\c10d\socket.cpp:401] [c10d] The server socket has failed to bind to [DESKTOP-NMQ53KV]:50425 (system error: 10048 - Only one usage of each socket address (protocol/network address/port) is normally permitted.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:401] [c10d] The server socket has failed to bind to DESKTOP-NMQ53KV:50425 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
[E ..\torch\csrc\distributed\c10d\socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\huklab\Desktop\odin\PyTorch-VAE\run.py", line 67, in <module>
runner.fit(experiment, datamodule=data)
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\trainer\trainer.py", line 737, in fit
self._call_and_handle_interrupt(
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\trainer\trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\trainer\trainer.py", line 772, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\trainer\trainer.py", line 1132, in _run
self.accelerator.setup_environment()
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\accelerators\gpu.py", line 39, in setup_environment
super().setup_environment()
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\accelerators\accelerator.py", line 83, in setup_environment
self.training_type_plugin.setup_environment()
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 185, in setup_environment
self.setup_distributed()
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 272, in setup_distributed
init_dist_connection(self.cluster_environment, self.torch_distributed_backend)
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pytorch_lightning\utilities\distributed.py", line 386, in init_dist_connection
torch.distributed.init_process_group(
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\rendezvous.py", line 257, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "C:\Users\huklab\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\rendezvous.py", line 188, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [DESKTOP-NMQ53KV]:50425 (system error: 10048 - Only one usage of each socket address (protocol/network address/port) is normally permitted.). The server socket has failed to bind to DESKTOP-NMQ53KV:50425 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
Does anyone has any idea what happens and how to fix this?
Thanks a lot for your help!

How to use `client.start_ipython_workers()` in dask-distributed?

I am trying to get workers to output some information from their ipython kernel and execute various commands in the ipython session. I tried the examples in the documentation and the ipyparallel example works, but not the second example (with ipython magics). I cannot get the workers to execute any commands. For example, I am stuck on the following issue:
from dask.distributed import Client
client = Client()
info = client.start_ipython_workers()
list_workers = info.keys()
%remote info[list_workers[0]]
The last line returns an error:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-9118451af441> in <module>
----> 1 get_ipython().run_line_magic('remote', "info['tcp://127.0.0.1:50497'] worker.active")
~/miniconda/envs/dask/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line, _stack_depth)
2334 kwargs['local_ns'] = self.get_local_scope(stack_depth)
2335 with self.builtin_trap:
-> 2336 result = fn(*args, **kwargs)
2337 return result
2338
~/miniconda/envs/dask/lib/python3.7/site-packages/distributed/_ipython_utils.py in remote_magic(line, cell)
115 info_name = split_line[0]
116 if info_name not in ip.user_ns:
--> 117 raise NameError(info_name)
118 connection_info = dict(ip.user_ns[info_name])
119
NameError: info['tcp://127.0.0.1:50497']
I would appreciate any examples of how to get any information from the ipython kernel running on workers.
Posting here just for keeping track, I raised an issue for this on GitHub: https://github.com/dask/distributed/issues/4522

Airflow tasks timeout in one hour even the setting is larger than 1 hour

Currently I'm using airflow with celery-executor+redis to run dags, and I have set execution_timeout to be 12 hours in a S3 key sensor, but it will fail in one hour in each retry
I have tried to update visibility_timeout = 64800 in airflow.cfg but the issue still exist
file_sensor = CorrectedS3KeySensor(
task_id = 'listen_for_file_drop', dag = dag,
aws_conn_id = 'aws_default',
poke_interval = 15,
timeout = 64800, # 18 hours
bucket_name = EnvironmentConfigs.S3_SFTP_BUCKET_NAME,
bucket_key = dag_config[ConfigurationConstants.FILE_S3_PATTERN],
wildcard_match = True,
execution_timeout = timedelta(hours=12)
)
For my understanding, execution_timeout should work that it will last for 12 hours after total four times run (retry = 3). But the issue is for each retry, it will fail in an hour and it only last total 4 hours+
[2019-08-06 13:00:08,597] {{base_task_runner.py:101}} INFO - Job 9: Subtask listen_for_file_drop [2019-08-06 13:00:08,595] {{timeout.py:41}} ERROR - Process timed out
[2019-08-06 13:00:08,612] {{models.py:1788}} ERROR - Timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1652, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/sensors/base_sensor_operator.py", line 97, in execute
while not self.poke(context):
File "/usr/local/airflow/dags/ProcessingStage/sensors/sensors.py", line 91, in poke
time.sleep(30)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 42, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout
I figure it out a few days before.
Since I'm using AWS to deploy airflow with celery executor, there's a few improper cloudwatch alarm will keep scale up and down the workers and webserver/scheuler :(
After those alarms updated, it works well now!!

TypeError: must be str, not bytes , Python 3, Raspberry pi

I am trying to send video from raspberry pi to my laptop via laptop
and save them as pictures so i found the below code online
but I get the following errors when I run them
so i run this client code on the pi using Thonny ide that comes preloaded
, I apologize for the way code is formatted below and would be very grateful if anybody can help me sort this out
Server on the laptop is run using python 3.6 idle
import sys
import numpy as np
import cv2
import socket
class VideoStreamingTest(object):
def __init__(self):
self.server_socket = socket.socket()
self.server_socket.bind(('0.0.0.0', 9006))
self.server_socket.listen(0)
self.connection, self.client_address = self.server_socket.accept()
self.connection = self.connection.makefile('rb')
self.streaming()
def streaming(self):
try:
print("Connection from: ", self.client_address)
print("Streaming...")
print("Press 'q' to exit")
stream_bytes = ' '
while True:
stream_bytes += self.connection.read(1024)
first = stream_bytes.find('\xff\xd8')
last = stream_bytes.find('\xff\xd9')
if first != -1 and last != -1:
jpg = stream_bytes[first:last + 2]
stream_bytes = stream_bytes[last + 2:]
#image = cv2.imdecode(np.fromstring(jpg, dtype=np.uint8), cv2.CV_LOAD_IMAGE_GRAYSCALE)
image = cv2.imdecode(np.fromstring(jpg, dtype=np.uint8), cv2.CV_LOAD_IMAGE_UNCHANGED)
cv2.imshow('image', image)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
finally:
self.connection.close()
self.server_socket.close()
if __name__ == '__main__':
VideoStreamingTest()
I get the following error
Connection from: ('192.168.43.3', 47518)
Streaming...
Press 'q' to exit
Traceback (most recent call last):
File "C:\Users\John Doe\d-ff\Desktop\AutoRCCar-master
3\test\stream_server_test.py", line 46, in <module>
VideoStreamingTest()
File "C:\Users\John Doe\d-ff\Desktop\AutoRCCar-master
3\test\stream_server_test.py", line 16, in __init__
self.streaming()
File "C:\Users\John Doe\d-ff\Desktop\AutoRCCar-master
3\test\stream_server_test.py", line 28, in streaming
stream_bytes += self.connection.read(1024)
TypeError: must be str, not bytes
Client side on the pi
import io
import socket
import struct
import time
import picamera
# create socket and bind host
client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client_socket.connect(('ToM', 9006))
connection = client_socket.makefile('wb')
try:
with picamera.PiCamera() as camera:
camera.resolution = (320, 240) # pi camera resolution
camera.framerate = 5 # 10 frames/sec
time.sleep(2) # give 2 secs for camera to initilize
start = time.time()
stream = io.BytesIO()
# send jpeg format video stream
for foo in camera.capture_continuous(stream, 'jpeg', use_video_port = True):
connection.write(struct.pack('<L', stream.tell()))
connection.flush()
stream.seek(0)
connection.write(stream.read())
if time.time() - start > 600:
break
stream.seek(0)
stream.truncate()
connection.write(struct.pack('<L', 0))
finally:
connection.close()
client_socket.close()
I get the following error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/pi/Desktop/stream_client.py", line 40, in <module>
connection.close()
File "/usr/lib/python3.5/socket.py", line 594, in write
return self._sock.send(b)
BrokenPipeError: [Errno 32] Broken pipe
I first thought it might be because of the limited bandwidth since i was running vnc viewer (remote desktop) via wifi on the pi but I don't think it is
I also had same problem. After some searching I found solution.
In python 3 we have to specify whether string is regular string or binary.Thats why we use b'string' instead of just 'string'
Change
stream_bytes = ' '
to
stream_bytes = b' '
Also change
first = stream_bytes.find('\xff\xd8')
last = stream_bytes.find('\xff\xd9')
to
first = stream_bytes.find(b'\xff\xd8')
last = stream_bytes.find(b'\xff\xd9')
Note that you are using cv2.CV_LOAD_IMAGE_UNCHANGED which is not available in opencv3.0
Use cv2.IMREAD_COLOR to show image in color.
Edit these changes and your stream should run smoothly.
connection.write(struct.pack('<L', 0))
Check out by inserting the above within try

MongoDB Assertion Error: starting_from == self.__retrieved (pymongo driver)

MongoDB Question:
We're using a sharded replicaset, running pymongo 2.2 against mongo (version: 2.1.1-pre-). We're getting a traceback when a query returns more than one result document.
Traceback (most recent call last):
File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/opt/DCM/mods/plugin.py", line 25, in run
self._mod.collect_metrics_dcm()
File "/opt/DCM/plugins/res.py", line 115, in collect_metrics_dcm
ms.updateSpecificMetric(metricName, value, timestamp)
File "/opt/DCM/mods/mongoSaver.py", line 155, in updateSpecificMetric
latestDoc = self.getLatestDoc(metricName)
File "/opt/DCM/mods/mongoSaver.py", line 70, in getLatestDoc
for d in dlist:
File "/usr/lib64/python2.6/site-packages/pymongo/cursor.py", line 747, in next
if len(self.__data) or self._refresh():
File "/usr/lib64/python2.6/site-packages/pymongo/cursor.py", line 698, in _refresh
self.__uuid_subtype))
File "/usr/lib64/python2.6/site-packages/pymongo/cursor.py", line 668, in __send_message
assert response["starting_from"] == self.__retrieved
AssertionError
The code that give what dlist is is a simple find(). I've tried reIndex(), no joy. I've tried stopping and starting the mongo server, no joy.
This is easily replicable for me. Any ideas?
Ok, so traced this down a bit, and I have a SOLUTION for this assertion error.
There is a BUG in Mongo. When querying a sharded replicaset, Mongo returns an incorrect value for 'starting_from'. Instead of returning 0 on the first query, it's returning the number of records received instead of the offset value. I have a patch for pymongo to protect against this bad info:
File is site-packages/pymongo/cursor.py.
[user#hostname]$ diff cursor.py.orig cursor.py
631,632c631,634
< if not self.__tailable:
< assert response["starting_from"] == self.__retrieved
---
> if ((not self.__tailable) and (self.__retrieved != 0) and (response["starting_from"] != self.__retrieved)):
> from pprint import pformat
> msg = "Server response of 'starting_from' is '%s', but self__retrieved (which is only set to nonzero below here) is '%s'." % (pformat(response), pformat(self.__retrieved))
> assert False, msg
The 'starting_from' comes from helpers.py decoding the response from Mongo:
result["starting_from"] = struct.unpack("<i", response[12:16])[0]
So, it's the 12th thru the 15th byte of Mongo's response.
This is a bug in the 2.1.1 development release of mongos. See https://jira.mongodb.org/browse/SERVER-5844