VS Code No Module Found Error - Need help running PySpark code locally - visual-studio-code

I've been trying to switch over from PyCharm to VS Code full time, and while I've figured out most things, I'm having a hell of a time trying to run Spark jobs locally (OS X). As far as I can tell I have set up the same configuration (virtualenv and environment variables) as I had working on PyCharm. Here's the configuration I've got on VS Code (defined in launch.json):
{
"name": "Python: spark sql query (local)",
"type": "python",
"request": "launch",
"program": "${workspaceRoot}/scripts/my_script.py",
"console": "integratedTerminal",
"cwd": "${fileDirname}",
"args": [
.
.
.
],
"terminal.integrated.env.osx": {
"SPARK_HOME": "/usr/local/spark-3.1.2-bin-hadoop3.2"
},
"env": {
"PYTHONUNBUFFERED": "1",
"APP_NAME": "Local Script",
"LOGFILE": "output.log",
"SPARK_HOME": "/usr/local/spark-3.1.2-bin-hadoop3.2",
"JAVA_HOME": "/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/"
}
},
When I run this I just get ModuleNotFound errors even though I haven't changed any other piece of code from what was working in PyCharm. Any ideas for me to try?
Edit:
Traceback (most recent call last):
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/evan***/.vscode/extensions/ms-python.python-2021.9.1191016588/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
cli.main()
File "/Users/evan***/.vscode/extensions/ms-python.python-2021.9.1191016588/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
run()
File "/Users/evan***/.vscode/extensions/ms-python.python-2021.9.1191016588/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 263, in run_path
return _run_module_code(code, init_globals, run_name,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/evan***/px_seed_model/scripts/sql_query.py", line 10, in <module>
from model import SparkModel
ModuleNotFoundError: No module named 'model'

The answer to this was to add PYTHONPATH to the env dict:
"env": {
"PYTHONUNBUFFERED": "1",
"APP_NAME": "Local Script",
"LOGFILE": "output.log",
"SPARK_HOME": "/usr/local/spark-3.1.2-bin-hadoop3.2",
"JAVA_HOME": "/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/",
"PYTHONPATH": "/path/to/project/root:$PYTHONPATH"
}
I believe this is equivalent to checking the box Add content roots to PYTHONPATH in PyCharm.

Related

Setting Environment Variables for VSCode Debug session

I must be missing something very obvious here, but I cannot seem to get this to work.
I want to set an environment variable FOO to be available in the VSCode debug sessions started on the current file by hitting the Debug button in the top right corner
This one: .
I tried setting the env dictionary in the launch.json file like so:
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": true,
"env": {
"FOO": "BAR"
}
}
]
}
But when I try to read the variable in my code, is get a KeyError since the variable hasn't been set.
import os
print(os.environ["FOO"])
yields this:
Traceback (most recent call last):
File ".../save_model.py", line 74, in <module>
print(os.environ["FOO"])
File ".../lib/python3.10/os.py", line 679, in __getitem__
raise KeyError(key) from None
KeyError: 'FOO'

triyng run the crawler with docker image

I'm triyng run the crawler with docker image, but it's returning this error.
PS C:\Users\Santosgab\Desktop\documentation> docker run -it --env-file=.env -e "CONFIG=$(cat config.json | jq -r tostring)" algolia/docsearch-scraper
Traceback (most recent call last):
File "/root/src/config/config_loader.py", line 101, in _load_config
data = json.loads(config, object_pairs_hook=OrderedDict)
File "/usr/lib/python3.6/json/__init__.py", line 367, in loads
return cls(**kw).decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
i used the config provided by Algolia, just made some small changes to run in my project, my config.json has no problem.
{
"index_name": "Sequor",
"sitemap_urls": ["http://localhost:3000/sitemap.xml"],
"sitemap_alternate_links": true,
"stop_urls": ["/tests"],
"selectors": {
"lvl0": {
"selector": "(//ul[contains(#class,'menu__list')]//a[contains(#class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(#class, 'navbar')]//a[contains(#class, 'navbar__link--active')]/text())[last()]",
"type": "xpath",
"global": true,
"default_value": "Documentation"
},
"lvl1": "header h1",
"lvl2": "article h2",
"lvl3": "article h3",
"lvl4": "article h4",
"lvl5": "article h5, article td:first-child",
"lvl6": "article h6",
"text": "article p, article li, article td:last-child"
},
"strip_chars": " .,;:#",
"custom_settings": {
"separatorsToIndex": "_",
"attributesForFaceting": ["language", "version", "type", "docusaurus_tag"],
"attributesToRetrieve": [
"hierarchy",
"content",
"anchor",
"url",
"url_without_anchor",
"type"
]
},
"conversation_id": ["833762294"],
"nb_hits": 46250
}
Two things I notice:
There's no path on your config file. Can you try it with pathing (i.e. ./config.json)
Your config is missing any start_urls as entry points to crawling.

Unable to use environment variables in VS Code launch configuration

I have the following in my workspace settings.json file:
"terminal.integrated.env.osx": {
"AUTH_TOKEN": "secret_XXXXXX"
}
However, when trying to pass this via a launch command (defined in launch.json):
{
"name": "Example: Query",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/examples/query.py",
"args": [ "${env:AUTH_TOKEN}" ]
}
The resulting command contains an empty string for the argument:
/usr/bin/env /.../.venv/bin/python /.../debugpy/launcher 58644 -- /.../examples/query.py ""
However, if I print the variable from within the script, it is set properly.
I believe there is an ordering issue, such that the launch.json commands are generated before the terminal environment is set up - resulting in empty vars. Any ideas how to propagate the env value to the command line?
Update: I have also tried using a .env file for the variables (rather than settings.json), but the result is the same.
Try using "env" in launch.json...
{
"name": "Example: Query",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/examples/query.py",
"args": ["${AUTH_TOKEN}"], // using var from env on args
"env": {
"AUTH_TOKEN": "XXXX",
"ENV2" : "XXX"
}
}
you can use envs from file too
{
// ...
"args": ["${AUTH_TOKEN}"],
"envFile": "${workspaceFolder}/local.env",
}
You can create a .env file, and then put the variable in there, and read it from the environmental variables in the program instead of it being an argument.

Unable to run airflow scheduler

I have recently installed airflow on an AWS server by using this guide for ubuntu 16.04. After a painful and successful install started the webserver. I tried a sample dag as follows
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import timedelta
from airflow import DAG
import airflow
# DEFAULT ARGS
default_args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'depends_on_past': False}
dag = DAG('init_run', default_args=default_args, description='DAG SAMPLE',
schedule_interval='#daily')
def print_something():
print("HELLO AIRFLOW!")
with dag:
task_1 = PythonOperator(task_id='do_it', python_callable=print_something)
task_2 = DummyOperator(task_id='dummy')
task_1 << task_2
But when i open the UI the tasks in the dag are still in "No Status" no matter how many times i trigger manually or refresh the page.
Later i found out that airflow scheduler is not running and shows the following error:
{celery_executor.py:228} ERROR - Error sending Celery task:No module named 'MySQLdb'
Celery Task ID: ('init_run', 'dummy', datetime.datetime(2019, 5, 30, 18, 0, 24, 902499, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1)
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/executors/celery_executor.py", line 118, in send_task_to_executor
result = task.apply_async(args=[command], queue=queue)
File "/usr/local/lib/python3.7/site-packages/celery/app/task.py", line 535, in apply_async
**options
File "/usr/local/lib/python3.7/site-packages/celery/app/base.py", line 728, in send_task
amqp.send_task_message(P, name, message, **options)
File "/usr/local/lib/python3.7/site-packages/celery/app/amqp.py", line 552, in send_task_message
**properties
File "/usr/local/lib/python3.7/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/usr/local/lib/python3.7/site-packages/kombu/connection.py", line 510, in _ensured
return fun(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kombu/messaging.py", line 194, in _publish
[maybe_declare(entity) for entity in declare]
File "/usr/local/lib/python3.7/site-packages/kombu/messaging.py", line 194, in <listcomp>
[maybe_declare(entity) for entity in declare]
File "/usr/local/lib/python3.7/site-packages/kombu/messaging.py", line 102, in maybe_declare
return maybe_declare(entity, self.channel, retry, **retry_policy)
File "/usr/local/lib/python3.7/site-packages/kombu/common.py", line 121, in maybe_declare
return _maybe_declare(entity, channel)
File "/usr/local/lib/python3.7/site-packages/kombu/common.py", line 145, in _maybe_declare
entity.declare(channel=channel)
File "/usr/local/lib/python3.7/site-packages/kombu/entity.py", line 608, in declare
self._create_queue(nowait=nowait, channel=channel)
File "/usr/local/lib/python3.7/site-packages/kombu/entity.py", line 617, in _create_queue
self.queue_declare(nowait=nowait, passive=False, channel=channel)
File "/usr/local/lib/python3.7/site-packages/kombu/entity.py", line 652, in queue_declare
nowait=nowait,
File "/usr/local/lib/python3.7/site-packages/kombu/transport/virtual/base.py", line 531, in queue_declare
self._new_queue(queue, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kombu/transport/sqlalchemy/__init__.py", line 82, in _new_queue
self._get_or_create(queue)
File "/usr/local/lib/python3.7/site-packages/kombu/transport/sqlalchemy/__init__.py", line 70, in _get_or_create
obj = self.session.query(self.queue_cls) \
File "/usr/local/lib/python3.7/site-packages/kombu/transport/sqlalchemy/__init__.py", line 65, in session
_, Session = self._open()
File "/usr/local/lib/python3.7/site-packages/kombu/transport/sqlalchemy/__init__.py", line 56, in _open
engine = self._engine_from_config()
File "/usr/local/lib/python3.7/site-packages/kombu/transport/sqlalchemy/__init__.py", line 51, in _engine_from_config
return create_engine(conninfo.hostname, **transport_options)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/__init__.py", line 443, in create_engine
return strategy.create(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py", line 87, in create
dbapi = dialect_cls.dbapi(**dbapi_args)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/dialects/mysql/mysqldb.py", line 104, in dbapi
return __import__("MySQLdb")
ModuleNotFoundError: No module named 'MySQLdb'
Here is the setting in the config file (airflow.cfg):
sql_alchemy_conn = postgresql+psycopg2://airflow#localhost:5432/airflow
broker_url = sqla+mysql://airflow:airflow#localhost:3306/airflow
result_backend = db+postgresql://airflow:airflow#localhost/airflow
I been struggling with this issue for two days now, Please help
In your airflow.cfg, there should also be a config option for celery_result_backend. Are you able to let us know what this value is set to? If it is not present in your config, set it to the same value as the result_backend
i.e:
celery_result_backend = db+postgresql://airflow:airflow#localhost/airflow
And then restart the airflow stack to ensure the configuration changes apply.
(I wanted to leave this as a comment but don't have enough rep to do so)
I think the example you are following didnt told you to install mysql and it seems you are using it in broker URL.
you can install mysql and than configure it. (for python 3.5+)
pip install mysqlclient
Alternatively, for a quick fix. You can also use rabbit MQ(Rabbitmq is a message broker, that you will require to rerun airflow dags with celery) guest user login
and then your broker_url will be
broker_url = amqp://guest:guest#localhost:5672//
if not already installed, Rabbitmq can be installed with following command.
sudo apt install rabbitmq-server
Change configuration NODE_IP_ADDRESS=0.0.0.0 in configuration file located at
/etc/rabbitmq/rabbitmq-env.conf
start RabbitMQ service
sudo service rabbitmq-server start

MongoDB Heroku - not authorized on chatterbot-database to execute command { createIndexes

I provisioned a MongoDB for my Heroku app and I can connect to the database fine (using a third-party MongoDB UI tool). I created a new user and got the following message back:
Successfully added user: {
"user" : "username",
"roles" : [
{
"role" : "dbAdmin",
"db" : "heroku_340171"
}
]
}
However, when running my Django app I get the following error:
File "g:\Git\ChatterbotTest\chatterbottest\urls.py", line 3, in <module>
from views import FbView
File "g:\Git\ChatterbotTest\chatterbottest\views.py", line 33, in <module>
database_uri="mongodb://username:password#ds23123.mlab.com:13938/heroku_340171"
File "g:\Python\lib\site-packages\chatterbot\chatterbot.py", line 37, in __init__
self.storage = utils.initialize_class(storage_adapter, **kwargs)
File "g:\Python\lib\site-packages\chatterbot\utils.py", line 33, in initialize_class
return Class(**kwargs)
File "g:\Python\lib\site-packages\chatterbot\storage\mongodb.py", line 102, in __init__
self.statements.create_index('text', unique=True)
File "g:\Python\lib\site-packages\pymongo\collection.py", line 1529, in create_index
self.__create_index(keys, kwargs)
File "g:\Python\lib\site-packages\pymongo\collection.py", line 1430, in __create_index
parse_write_concern_error=True)
File "g:\Python\lib\site-packages\pymongo\collection.py", line 232, in _command
collation=collation)
File "g:\Python\lib\site-packages\pymongo\pool.py", line 419, in command
collation=collation)
File "g:\Python\lib\site-packages\pymongo\network.py", line 116, in command
parse_write_concern_error=parse_write_concern_error)
File "g:\Python\lib\site-packages\pymongo\helpers.py", line 210, in _check_command_response
raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: not authorized on chatterbot-database to execute command { createIndexes: "statements", indexes: [ { unique: true, name: "text_1", key: { text: 1 } } ] }
dbAdmin should have the permissions to run createIndexes. What's going wrong?