pyarrow through spark-submit in cluster mode fails - pyspark

I have a simple Pyspark code
import pyarrow
fs = pyarrow.hdfs.connect()
If I run this using spark-submit in "client"mode, it works fine, but in "cluster" mode, throws the error
Traceback (most recent call last):
File "t3.py", line 17, in <module>
fs = pa.hdfs.connect()
File "/opt/anaconda/3.6/lib/python3.6/site-packages/pyarrow/hdfs.py", line 181, in connect
kerb_ticket=kerb_ticket, driver=driver)
File "/opt/anaconda/3.6/lib/python3.6/site-packages/pyarrow/hdfs.py", line 37, in __init__
self._connect(host, port, user, kerb_ticket, driver)
File "io-hdfs.pxi", line 99, in pyarrow.lib.HadoopFileSystem._connect
File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed
All the necessary python libraries are installed on every node in my Hadoop cluster. I have verified by testing this code under pyspark every node individually.
But cannot make it work through spark-submit in cluster mode?
Any ideas?
shankar

Related

Airflow webserver No module named 'airflow.www.fab_security'

I am running Airflow 1.10.12 on Ubuntu. Airflow was running fine using local executor and MySql.
In order to conduct some tests, I have moved to Celery executer with RabbitMQ.
Based on a tutorial, here is my config file:
[core]
executor = CeleryExecutor
[celery]
broker_url = pyamqp://rabbitmq:rabbitmq#localhost/
result_backend = db+mysql+pymysql://airflow:airflow#localhost:3306/airflow_db
But when I run:
airflow webserver
The following error is thrown:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 37, in <module>
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/airflow/utils/cli.py", line 76, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/airflow/bin/cli.py", line 1076, in webserver
app = cached_app_rbac(None) if settings.RBAC else cached_app(None)
File "/usr/local/lib/python3.8/dist-packages/airflow/www_rbac/app.py", line 300, in cached_app
app, _ = create_app(config, session, testing)
File "/usr/local/lib/python3.8/dist-packages/airflow/www_rbac/app.py", line 65, in create_app
app.config.from_pyfile(settings.WEBSERVER_CONFIG, silent=True)
File "/usr/local/lib/python3.8/dist-packages/flask/config.py", line 132, in from_pyfile
exec(compile(config_file.read(), filename, "exec"), d.__dict__)
File "/home/helia/airflow/webserver_config.py", line 21, in <module>
from airflow.www.fab_security.manager import AUTH_DB
ModuleNotFoundError: No module named 'airflow.www.fab_security'

Getting error on kubectl get nodes command

I am getting an error while running the following command. I have updated gcloud SDK but still facing same error.
kubectl get nodes
Unable to connect to the server: error executing access token command "/Users/salayhin/google-cloud-sdk/bin/gcloud config config-helper --format=json": err=exit status 1 output= stderr=Traceback (most recent call last):
File "/Users/salayhin/google-cloud-sdk/lib/gcloud.py", line 95, in <module>
main()
File "/Users/salayhin/google-cloud-sdk/lib/gcloud.py", line 54, in main
from googlecloudsdk.core.util import encoding
File "/Users/salayhin/google-cloud-sdk/lib/googlecloudsdk/__init__.py", line 23, in <module>
from googlecloudsdk.core.util import lazy_regex
File "/Users/salayhin/google-cloud-sdk/lib/googlecloudsdk/core/util/lazy_regex.py", line 25, in <module>
from googlecloudsdk.core.util import lazy_regex_patterns
ImportError: cannot import name lazy_regex_patterns
It seems like it's just an error from the Google Cloud SDK happening here indicating that you are probably missing this file on the same directory.
I would recommend re-install the Google Cloud SDK on whatever system you are using.

Google Cloud SDK installation - gsutil error

I have installed Google Cloud SDK but having an issue running "gsutil". Here are the error I'm getting:
~/gcloud/google-cloud-sdk#> gsutil
Traceback (most recent call last):
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 13, in <module>
import bootstrapping
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 40, in <module>
from googlecloudsdk.core import execution_utils
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/lib/googlecloudsdk/core/execution_utils.py", line 33, in <module>
from googlecloudsdk.core import log
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/lib/googlecloudsdk/core/log.py", line 810, in <module>
_log_manager = _LogManager()
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/lib/googlecloudsdk/core/log.py", line 526, in __init__
self._file_formatter = _LogFileFormatter()
File "/Users/gonyi/Desktop/gonyyi/gcloud/google-cloud-sdk/lib/googlecloudsdk/core/log.py", line 315, in __init__
super(_LogFileFormatter, self).__init__(fmt=_LogFileFormatter.FORMAT)
TypeError: must be type, not classobj
~/gcloud/google-cloud-sdk#>
I have Python 2.7 installed, also tried to install this with "brew cask" as well as download it directly from Google Cloud site and install, but no luck either.
I googled this error but it seems like I'm the only one with this error..
"gcloud" command works just fine; but it's just "gsutil" that's not working.
Thank you

Airflow scheduler failure

I have followed
this tutorial in attempt to build an airflow cluster on localhost with my own DAGs. When I ran airflow scheduler after having set executor = CeleryExecutor in the config file, I received the following traceback:
Traceback (most recent call last):
File "/home/yurii/Tools/anaconda3/bin/airflow", line 28, in
args.func(args)
File"/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/bin/cli.py", line 839, in scheduler job.run()
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 200, in run
self._execute()
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1309, in _execute
self._execute_helper(processor_manager)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1441, in _execute_helper
self.executor.heartbeat()
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/executors/base_executor.py", line 124, in heartbeat
self.execute_async(key, command=command, queue=queue)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 80, in execute_async
args=[command], queue=queue)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/celery/app/task.py", line 573, in apply_async
**dict(self._get_exec_options(), **options)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/celery/app/base.py", line 354, in send_task
reply_to=reply_to or self.oid, **options
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/celery/app/amqp.py", line 310, in publish_task
**kwargs
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/kombu/messaging.py", line 172, in publish
routing_key, mandatory, immediate, exchange, declare)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/kombu/connection.py", line 449, in _ensured
return fun(*args, **kwargs)
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/kombu/messaging.py", line 188, in _publish
mandatory=mandatory, immediate=immediate,
File "/home/yurii/Tools/anaconda3/lib/python3.6/site-packages/librabbitmq/init.py", line 122, in basic_publish
mandatory or False, immediate or False,
TypeError: an integer is required (got type NoneType)
Some additional information:
I am using Airflow 1.8.0 along with Celery 3.1.25 and RabbitMQ 3.5.7 as a broker and backend, but also tried Airflow 1.9.0 with Celery 4.2.
Airflow with sequential executor works without any problems.
`airflow test "dag_name" "task_name" "exec_date" runs succeessfully.
I am new to Airflow/Celery/RabbitMQ/SQL, so any help would be appreciated!
To add to previous answer. Using py-amqp involves either changing from broker_url = amqp://XXXXX to broker_url = pyamqp://XXXXX OR
pip uninstall librabbitmq.
Additionally you may need to change celery_result_backend variable to result_backend in your airflow.cfg. The celery_ prefix has been removed for variables in the [celery] node in airflow.cfg in recent versions.
It seems you are using librabbitmq as amqp broker which is not recommended by celery core team. Use py-amqp as the rabbitmq broker and you should get rid of this error.

Issue running psycopg2 inside AWS Lambda Function

I'm getting the following error when trying to run psycopg2 in a AWS Lambda:
/var/task/functions/../vendored/psycopg2/_psycopg.so: ELF file's phentsize not the expected size: ImportError
Traceback (most recent call last):
File "/var/task/functions/refresh_mv.py", line 64, in execute
session = SessionFactoryGraphQL.get_session(app=item['app'])
File "/var/task/lib/session_factory.py", line 22, in get_session
engine = create_engine(conn_string, poolclass=NullPool)
File "/var/task/functions/../vendored/sqlalchemy/engine/__init__.py", line 387, in create_engine
return strategy.create(*args, **kwargs)
File "/var/task/functions/../vendored/sqlalchemy/engine/strategies.py", line 80, in create
dbapi = dialect_cls.dbapi(**dbapi_args)
File "/var/task/functions/../vendored/sqlalchemy/dialects/postgresql/psycopg2.py", line 554, in dbapi
import psycopg2
File "/var/task/functions/../vendored/psycopg2/__init__.py", line 50, in <module>
from psycopg2._psycopg import ( # noqa
ImportError: /var/task/functions/../vendored/psycopg2/_psycopg.so: ELF file's phentsize not the expected size
The weird thing is: everything was working fine until yesterday (for more than 5 months), and suddenly stopped working. None of the libraries has been updated.
I tried to build from scratch, as in https://github.com/jkehler/awslambda-psycopg2, but still having the same error.
Can someone help me with it?
The problem is in the latest version of serverless framework. I assume that you are using serverless to deploy your lambda function.
serverless remove
npm install serverless#1.20.2 -g
This should work.