I have a simple Pyspark code
import pyarrow
fs = pyarrow.hdfs.connect()
If I run this using spark-submit in "client"mode, it works fine, but in "cluster" mode, throws the error
Traceback (most recent call last):
File "t3.py", line 17, in <module>
fs = pa.hdfs.connect()
File "/opt/anaconda/3.6/lib/python3.6/site-packages/pyarrow/hdfs.py", line 181, in connect
kerb_ticket=kerb_ticket, driver=driver)
File "/opt/anaconda/3.6/lib/python3.6/site-packages/pyarrow/hdfs.py", line 37, in __init__
self._connect(host, port, user, kerb_ticket, driver)
File "io-hdfs.pxi", line 99, in pyarrow.lib.HadoopFileSystem._connect
File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed
All the necessary python libraries are installed on every node in my Hadoop cluster. I have verified by testing this code under pyspark every node individually.
But cannot make it work through spark-submit in cluster mode?
Any ideas?
shankar
Related
I am running Airflow 1.10.12 on Ubuntu. Airflow was running fine using local executor and MySql.
In order to conduct some tests, I have moved to Celery executer with RabbitMQ.
Based on a tutorial, here is my config file:
[core]
executor = CeleryExecutor
[celery]
broker_url = pyamqp://rabbitmq:rabbitmq#localhost/
result_backend = db+mysql+pymysql://airflow:airflow#localhost:3306/airflow_db
But when I run:
airflow webserver
The following error is thrown:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 37, in <module>
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/airflow/utils/cli.py", line 76, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/airflow/bin/cli.py", line 1076, in webserver
app = cached_app_rbac(None) if settings.RBAC else cached_app(None)
File "/usr/local/lib/python3.8/dist-packages/airflow/www_rbac/app.py", line 300, in cached_app
app, _ = create_app(config, session, testing)
File "/usr/local/lib/python3.8/dist-packages/airflow/www_rbac/app.py", line 65, in create_app
app.config.from_pyfile(settings.WEBSERVER_CONFIG, silent=True)
File "/usr/local/lib/python3.8/dist-packages/flask/config.py", line 132, in from_pyfile
exec(compile(config_file.read(), filename, "exec"), d.__dict__)
File "/home/helia/airflow/webserver_config.py", line 21, in <module>
from airflow.www.fab_security.manager import AUTH_DB
ModuleNotFoundError: No module named 'airflow.www.fab_security'
I am getting an error while running the following command. I have updated gcloud SDK but still facing same error.
kubectl get nodes
Unable to connect to the server: error executing access token command "/Users/salayhin/google-cloud-sdk/bin/gcloud config config-helper --format=json": err=exit status 1 output= stderr=Traceback (most recent call last):
File "/Users/salayhin/google-cloud-sdk/lib/gcloud.py", line 95, in <module>
main()
File "/Users/salayhin/google-cloud-sdk/lib/gcloud.py", line 54, in main
from googlecloudsdk.core.util import encoding
File "/Users/salayhin/google-cloud-sdk/lib/googlecloudsdk/__init__.py", line 23, in <module>
from googlecloudsdk.core.util import lazy_regex
File "/Users/salayhin/google-cloud-sdk/lib/googlecloudsdk/core/util/lazy_regex.py", line 25, in <module>
from googlecloudsdk.core.util import lazy_regex_patterns
ImportError: cannot import name lazy_regex_patterns
It seems like it's just an error from the Google Cloud SDK happening here indicating that you are probably missing this file on the same directory.
I would recommend re-install the Google Cloud SDK on whatever system you are using.
I have installed Google Cloud SDK but having an issue running "gsutil". Here are the error I'm getting:
~/gcloud/google-cloud-sdk#> gsutil
Traceback (most recent call last):
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 13, in <module>
import bootstrapping
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 40, in <module>
from googlecloudsdk.core import execution_utils
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/lib/googlecloudsdk/core/execution_utils.py", line 33, in <module>
from googlecloudsdk.core import log
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/lib/googlecloudsdk/core/log.py", line 810, in <module>
_log_manager = _LogManager()
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/lib/googlecloudsdk/core/log.py", line 526, in __init__
self._file_formatter = _LogFileFormatter()
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/lib/googlecloudsdk/core/log.py", line 315, in __init__
super(_LogFileFormatter, self).__init__(fmt=_LogFileFormatter.FORMAT)
TypeError: must be type, not classobj
~/gcloud/google-cloud-sdk#>
I have Python 2.7 installed, also tried to install this with "brew cask" as well as download it directly from Google Cloud site and install, but no luck either.
I googled this error but it seems like I'm the only one with this error..
"gcloud" command works just fine; but it's just "gsutil" that's not working.
Thank you
I have followed
this tutorial in attempt to build an airflow cluster on localhost with my own DAGs. When I ran airflow scheduler after having set executor = CeleryExecutor in the config file, I received the following traceback:
Traceback (most recent call last):
File "/home/yurii/Tools/anaconda3/bin/airflow", line 28, in
args.func(args)
File"/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/bin/cli.py", line 839, in scheduler job.run()
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 200, in run
self._execute()
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1309, in _execute
self._execute_helper(processor_manager)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1441, in _execute_helper
self.executor.heartbeat()
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/executors/base_executor.py", line 124, in heartbeat
self.execute_async(key, command=command, queue=queue)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 80, in execute_async
args=[command], queue=queue)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/celery/app/task.py", line 573, in apply_async
**dict(self._get_exec_options(), **options)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/celery/app/base.py", line 354, in send_task
reply_to=reply_to or self.oid, **options
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/celery/app/amqp.py", line 310, in publish_task
**kwargs
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/kombu/messaging.py", line 172, in publish
routing_key, mandatory, immediate, exchange, declare)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/kombu/connection.py", line 449, in _ensured
return fun(*args, **kwargs)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/kombu/messaging.py", line 188, in _publish
mandatory=mandatory, immediate=immediate,
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/librabbitmq/init.py", line 122, in basic_publish
mandatory or False, immediate or False,
TypeError: an integer is required (got type NoneType)
Some additional information:
I am using Airflow 1.8.0 along with Celery 3.1.25 and RabbitMQ 3.5.7 as a broker and backend, but also tried Airflow 1.9.0 with Celery 4.2.
Airflow with sequential executor works without any problems.
`airflow test "dag_name" "task_name" "exec_date" runs succeessfully.
I am new to Airflow/Celery/RabbitMQ/SQL, so any help would be appreciated!
To add to previous answer. Using py-amqp involves either changing from broker_url = amqp://XXXXX to broker_url = pyamqp://XXXXX OR
pip uninstall librabbitmq.
Additionally you may need to change celery_result_backend variable to result_backend in your airflow.cfg. The celery_ prefix has been removed for variables in the [celery] node in airflow.cfg in recent versions.
It seems you are using librabbitmq as amqp broker which is not recommended by celery core team. Use py-amqp as the rabbitmq broker and you should get rid of this error.
I'm getting the following error when trying to run psycopg2 in a AWS Lambda:
/var/task/functions/../vendored/psycopg2/_psycopg.so: ELF file's phentsize not the expected size: ImportError
Traceback (most recent call last):
File "/var/task/functions/refresh_mv.py", line 64, in execute
session = SessionFactoryGraphQL.get_session(app=item['app'])
File "/var/task/lib/session_factory.py", line 22, in get_session
engine = create_engine(conn_string, poolclass=NullPool)
File "/var/task/functions/../vendored/sqlalchemy/engine/__init__.py", line 387, in create_engine
return strategy.create(*args, **kwargs)
File "/var/task/functions/../vendored/sqlalchemy/engine/strategies.py", line 80, in create
dbapi = dialect_cls.dbapi(**dbapi_args)
File "/var/task/functions/../vendored/sqlalchemy/dialects/postgresql/psycopg2.py", line 554, in dbapi
import psycopg2
File "/var/task/functions/../vendored/psycopg2/__init__.py", line 50, in <module>
from psycopg2._psycopg import ( # noqa
ImportError: /var/task/functions/../vendored/psycopg2/_psycopg.so: ELF file's phentsize not the expected size
The weird thing is: everything was working fine until yesterday (for more than 5 months), and suddenly stopped working. None of the libraries has been updated.
I tried to build from scratch, as in https://github.com/jkehler/awslambda-psycopg2, but still having the same error.
Can someone help me with it?
The problem is in the latest version of serverless framework. I assume that you are using serverless to deploy your lambda function.
serverless remove
npm install serverless#1.20.2 -g
This should work.