triyng run the crawler with docker image - algolia

I'm triyng run the crawler with docker image, but it's returning this error.
PS C:\Users\Santosgab\Desktop\documentation> docker run -it --env-file=.env -e "CONFIG=$(cat config.json | jq -r tostring)" algolia/docsearch-scraper
Traceback (most recent call last):
File "/root/src/config/config_loader.py", line 101, in _load_config
data = json.loads(config, object_pairs_hook=OrderedDict)
File "/usr/lib/python3.6/json/__init__.py", line 367, in loads
return cls(**kw).decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
i used the config provided by Algolia, just made some small changes to run in my project, my config.json has no problem.
{
"index_name": "Sequor",
"sitemap_urls": ["http://localhost:3000/sitemap.xml"],
"sitemap_alternate_links": true,
"stop_urls": ["/tests"],
"selectors": {
"lvl0": {
"selector": "(//ul[contains(#class,'menu__list')]//a[contains(#class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(#class, 'navbar')]//a[contains(#class, 'navbar__link--active')]/text())[last()]",
"type": "xpath",
"global": true,
"default_value": "Documentation"
},
"lvl1": "header h1",
"lvl2": "article h2",
"lvl3": "article h3",
"lvl4": "article h4",
"lvl5": "article h5, article td:first-child",
"lvl6": "article h6",
"text": "article p, article li, article td:last-child"
},
"strip_chars": " .,;:#",
"custom_settings": {
"separatorsToIndex": "_",
"attributesForFaceting": ["language", "version", "type", "docusaurus_tag"],
"attributesToRetrieve": [
"hierarchy",
"content",
"anchor",
"url",
"url_without_anchor",
"type"
]
},
"conversation_id": ["833762294"],
"nb_hits": 46250
}

Two things I notice:
There's no path on your config file. Can you try it with pathing (i.e. ./config.json)
Your config is missing any start_urls as entry points to crawling.

Related

How can I correct this error with AWS CloudFormation template

Team, I'm stressing out because I cannot find the errors with the following JSON script I'm trying to run in AWS Cloudformation; I'm receiving the following error:
(Cannot render the template because of an error.: YAMLException: end of the stream or a document separator is expected at line 140, column 65: ... e" content="{"version": "4", "rollouts& ... ^
<meta name="optimizely-datafile" content="{"version": "4", "rollouts": [], "typedAudiences": [], "anonymizeIP": true, "projectId":
Please help!!!

VS Code No Module Found Error - Need help running PySpark code locally

I've been trying to switch over from PyCharm to VS Code full time, and while I've figured out most things, I'm having a hell of a time trying to run Spark jobs locally (OS X). As far as I can tell I have set up the same configuration (virtualenv and environment variables) as I had working on PyCharm. Here's the configuration I've got on VS Code (defined in launch.json):
{
"name": "Python: spark sql query (local)",
"type": "python",
"request": "launch",
"program": "${workspaceRoot}/scripts/my_script.py",
"console": "integratedTerminal",
"cwd": "${fileDirname}",
"args": [
.
.
.
],
"terminal.integrated.env.osx": {
"SPARK_HOME": "/usr/local/spark-3.1.2-bin-hadoop3.2"
},
"env": {
"PYTHONUNBUFFERED": "1",
"APP_NAME": "Local Script",
"LOGFILE": "output.log",
"SPARK_HOME": "/usr/local/spark-3.1.2-bin-hadoop3.2",
"JAVA_HOME": "/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/"
}
},
When I run this I just get ModuleNotFound errors even though I haven't changed any other piece of code from what was working in PyCharm. Any ideas for me to try?
Edit:
Traceback (most recent call last):
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/evan***/.vscode/extensions/ms-python.python-2021.9.1191016588/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
cli.main()
File "/Users/evan***/.vscode/extensions/ms-python.python-2021.9.1191016588/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
run()
File "/Users/evan***/.vscode/extensions/ms-python.python-2021.9.1191016588/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 263, in run_path
return _run_module_code(code, init_globals, run_name,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/evan***/px_seed_model/scripts/sql_query.py", line 10, in <module>
from model import SparkModel
ModuleNotFoundError: No module named 'model'
The answer to this was to add PYTHONPATH to the env dict:
"env": {
"PYTHONUNBUFFERED": "1",
"APP_NAME": "Local Script",
"LOGFILE": "output.log",
"SPARK_HOME": "/usr/local/spark-3.1.2-bin-hadoop3.2",
"JAVA_HOME": "/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/",
"PYTHONPATH": "/path/to/project/root:$PYTHONPATH"
}
I believe this is equivalent to checking the box Add content roots to PYTHONPATH in PyCharm.

use jq to replace cloud formation parameter value

I am trying to replace project ParameterKey:Project with ParameterValue:test in the below Cloudformation parameters file
[{
"ParameterKey": "Project",
"ParameterValue": "<changeMe>"
},
{
"ParameterKey": "DockerInstanceType",
"ParameterValue": "m3.medium"
}]
I am trying to execute below jq command
cat config.json |
jq "map(if .ParameterKey == "Project"
then . + {\"ParameterValue\":\"test\"}
else .
end)" > populated_config.json
I am getting the below error
jq: error: Project/0 is not defined at <top-level>, line 1:
map(if .ParameterKey == Project
jq: 1 compile error
You're prematurely closing the string passed to jq by not escaping the quotes for "Project" in the equality.
You can simplify by enclosing the expression with single quotes, and no escaping is necessary:
$ cat config.json | jq 'map(if .ParameterKey == "Project" then . + {"ParameterValue":"test"} else . end)'
[
{
"ParameterKey": "Project",
"ParameterValue": "test"
},
{
"ParameterKey": "DockerInstanceType",
"ParameterValue": "m3.medium"
}
]

MongoDB Heroku - not authorized on chatterbot-database to execute command { createIndexes

I provisioned a MongoDB for my Heroku app and I can connect to the database fine (using a third-party MongoDB UI tool). I created a new user and got the following message back:
Successfully added user: {
"user" : "username",
"roles" : [
{
"role" : "dbAdmin",
"db" : "heroku_340171"
}
]
}
However, when running my Django app I get the following error:
File "g:\Git\ChatterbotTest\chatterbottest\urls.py", line 3, in <module>
from views import FbView
File "g:\Git\ChatterbotTest\chatterbottest\views.py", line 33, in <module>
database_uri="mongodb://username:password#ds23123.mlab.com:13938/heroku_340171"
File "g:\Python\lib\site-packages\chatterbot\chatterbot.py", line 37, in __init__
self.storage = utils.initialize_class(storage_adapter, **kwargs)
File "g:\Python\lib\site-packages\chatterbot\utils.py", line 33, in initialize_class
return Class(**kwargs)
File "g:\Python\lib\site-packages\chatterbot\storage\mongodb.py", line 102, in __init__
self.statements.create_index('text', unique=True)
File "g:\Python\lib\site-packages\pymongo\collection.py", line 1529, in create_index
self.__create_index(keys, kwargs)
File "g:\Python\lib\site-packages\pymongo\collection.py", line 1430, in __create_index
parse_write_concern_error=True)
File "g:\Python\lib\site-packages\pymongo\collection.py", line 232, in _command
collation=collation)
File "g:\Python\lib\site-packages\pymongo\pool.py", line 419, in command
collation=collation)
File "g:\Python\lib\site-packages\pymongo\network.py", line 116, in command
parse_write_concern_error=parse_write_concern_error)
File "g:\Python\lib\site-packages\pymongo\helpers.py", line 210, in _check_command_response
raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: not authorized on chatterbot-database to execute command { createIndexes: "statements", indexes: [ { unique: true, name: "text_1", key: { text: 1 } } ] }
dbAdmin should have the permissions to run createIndexes. What's going wrong?

Mongo connector unable to connect to mongos

I am connecting to mongo with a user with clusterAdmin and backup roles, but I get the error:
2017-02-09 17:51:23,254 [ERROR] mongo_connector.util:96 - Fatal Exception
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/mongo_connector/util.py", line 94, in wrapped
func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/mongo_connector/connector.py", line 370, in run
'listShards')['shards']:
File "/usr/lib/python2.7/site-packages/mongo_connector/util.py", line 78, in retry_until_ok
return func(*args, **kwargs)
File "/usr/lib64/python2.7/site-packages/pymongo/database.py", line 494, in command
codec_options, **kwargs)
File "/usr/lib64/python2.7/site-packages/pymongo/database.py", line 406, in _command
parse_write_concern_error=parse_write_concern_error)
File "/usr/lib64/python2.7/site-packages/pymongo/pool.py", line 419, in command
collation=collation)
File "/usr/lib64/python2.7/site-packages/pymongo/network.py", line 116, in command
parse_write_concern_error=parse_write_concern_error)
File "/usr/lib64/python2.7/site-packages/pymongo/helpers.py", line 210, in _check_command_response
raise OperationFailure(msg % errmsg, code, response)
OperationFailure: not authorized on admin to execute command { listShards: 1 }
This page under Required Permissions says The simplest way to get mongo-connector running is to create a user with the backup role:
https://github.com/mongodb-labs/mongo-connector/wiki/Usage-with-Authentication
db.getSiblingDB("admin").createUser({ user:"backup",pwd:"password_here", roles: ["backup"] })
But I cant even connect with such a user (Authentication error):
2017-02-10 16:52:01,448 [ERROR] mongo_connector.util:96 - Fatal Exception
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/mongo_connector/util.py", line 94, in wrapped
func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/mongo_connector/connector.py", line 398, in run
hosts, replicaSet=repl_set)
File "/usr/lib/python2.7/site-packages/mongo_connector/connector.py", line 299, in create_authed_client
client['admin'].authenticate(self.auth_username, self.auth_key)
File "/usr/lib64/python2.7/site-packages/pymongo/database.py", line 1048, in authenticate
connect=True)
File "/usr/lib64/python2.7/site-packages/pymongo/mongo_client.py", line 505, in _cache_credentials
sock_info.authenticate(credentials)
File "/usr/lib64/python2.7/site-packages/pymongo/pool.py", line 523, in authenticate
auth.authenticate(credentials, self)
File "/usr/lib64/python2.7/site-packages/pymongo/auth.py", line 470, in authenticate
auth_func(credentials, sock_info)
File "/usr/lib64/python2.7/site-packages/pymongo/auth.py", line 450, in _authenticate_default
return _authenticate_scram_sha1(credentials, sock_info)
File "/usr/lib64/python2.7/site-packages/pymongo/auth.py", line 201, in _authenticate_scram_sha1
res = sock_info.command(source, cmd)
File "/usr/lib64/python2.7/site-packages/pymongo/pool.py", line 419, in command
collation=collation)
File "/usr/lib64/python2.7/site-packages/pymongo/network.py", line 116, in command
parse_write_concern_error=parse_write_concern_error)
File "/usr/lib64/python2.7/site-packages/pymongo/helpers.py", line 210, in _check_command_response
raise OperationFailure(msg % errmsg, code, response)
OperationFailure: Authentication failed.
When I log into mongos with both these users and run the command
db.getSiblingDB("admin").runCommand( { listShards: 1 } )
I get a shard listing no probs
{
"shards" : [
{
"_id" : "shard001",
"host" : "shard001/timgrhlmdb01:27020,timgrhlmdb02:27020",
"state" : 1
},
{
"_id" : "shard002",
"host" : "shard002/timgrhlmdb03:27020,timgrhlmdb04:27020",
"state" : 1
}
],
"ok" : 1
}
So what does this mean:
OperationFailure: not authorized on admin to execute command { listShards: 1 }
Update
I rebuilt the cluster from scratch and still have the same problem: OperationFailure: not authorized on admin to execute command { listShards: 1 }
I have also tried the user 'backup' with only the roles 'clusterManager' and 'readAnyDatabase'. This allows the user to list shards, but now mongo-connector fails with 'Authentication failed':
{ "_id" : "admin.backup", "user" : "backup", "db" : "admin", "credentials" : { "SCRAM-SHA-1" : { "iterationCount" : 10000, "salt" : "pWcEU7uFqfHPgGe8z+E9Wg==", "storedKey" : "k2tapXQPtM2dHlxYnJiWVxO/rtg=", "serverKey" : "EGG8M4i27OYBy+fLYaL13+Nn4mc=" } }, "roles" : [ { "role" : "readAnyDatabase", "db" : "admin" }, { "role" : "clusterManager", "db" : "admin" } ] }
Check out users by running this command:
db.system.users.find({})
Make sure that the user you created is with a backup role,if you can log in as the backup user and you can also run those commands,that means backup user was created and granted a role and its privileges.
Make sure that you have the role of clusterManager to perform this.
Provides management and monitoring actions on the cluster. A user with
this role can access the config and local databases, which are used in
sharding and replication, respectively.
Provides the following actions on the cluster as a whole:
addShard
appendOplogNote
applicationMessage
cleanupOrphaned
flushRouterConfig
listShards
removeShard
etc
Have a look at built-in-roles.
By the way,have a look at this issue.Hope this helps.
Response from bug submitted to mongodb-labs/mongo-connector:
This is indeed a subtle bug introduced in #563. We changed a find on
config.shards into a call to listShards assuming that it would have no
change in behavior. Unfortunately (and annoyingly), the backup role
has privileges to read the list of shards in the config.shards
collection but, as you can see, does not have the privilege to run the
listShards command. I'll revert this change to fix the problem in
the upcoming 2.5.1 bug-fix release.
In the meantime, you will need to grant the mongo-connector user the
backup AND clusterMonitor roles.
An important point that is not yet mentioned in the documentation is
that the user must be created on a mongos and all the shards. This
enables mongo-connector to authenticate to the cluster as a whole and
to each shard individually.
This now works! yay
That will teach me for following the manual lol!