How can I run NLTK on App Engine or Kubernetes? - kubernetes

I am busy writing a model to predict types of text like names or dates on a pdf document.
The model uses nltk.word_tokenize and nltk.pos_tag
When I try to use this on Kubernetes on Google Cloud Platform I get the following error:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
tokenized_word = tokenize_word('x')
tagges_word = pos_tag(['x'])
stacktrace:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/env/nltk_data'
- '/env/share/nltk_data'
- '/env/lib/nltk_data'
- ''
But obviously downloading it to your local device will not solve the problem if it has to run on Kubernetes and we do not have NFS set up on the project yet.

How I ended up solving this problem was adding the download of the nltk packages in an init function
import logging
import nltk
from nltk import word_tokenize, pos_tag
LOGGER = logging.getLogger(__name__)
LOGGER.info('Catching broad nltk errors')
DOWNLOAD_DIR = '/usr/lib/nltk_data'
LOGGER.info(f'Saving files to {DOWNLOAD_DIR} ')
try:
tokenized = word_tokenize('x')
LOGGER.info(f'Tokenized word: {tokenized}')
except Exception as err:
LOGGER.info(f'NLTK dependencies not downloaded: {err}')
try:
nltk.download('punkt', download_dir=DOWNLOAD_DIR)
except Exception as e:
LOGGER.info(f'Error occurred while downloading file: {e}')
try:
tagged_word = pos_tag(['x'])
LOGGER.info(f'Tagged word: {tagged_word}')
except Exception as err:
LOGGER.info(f'NLTK dependencies not downloaded: {err}')
try:
nltk.download('averaged_perceptron_tagger', download_dir=DOWNLOAD_DIR)
except Exception as e:
LOGGER.info(f'Error occurred while downloading file: {e}')
I realize that the amount of try catch expressions are not needed. I also specify the download dir because it seemed that if you do not do that it downloads and unzips 'tagger' to /usr/lib and the nltk does not look for the the files there.
This will download the files on every first run on a new pod and the files will persist until the pod dies.
The error was solved on a Kubernetes stateless set which means this can deal with non persistent applications like App Engine, but will not be the most efficient because it will need to be download every time the instance spins up.

Related

Beam SDK harness still trying to launch docker when I set environment_type to be `PROCESS`

According to the beam harness documentation:
PROCESS: User code is executed by processes that are automatically started by the runner on each worker node.
args = [
"--runner=portableRunner",
"--streaming",
"--sdk_worker_parallelism=2",
"--environment_type=PROCESS",
"--environment_config={\"command\": \"/opt/apache/beam/boot\"}",
]
consumer_config = {
"security.protocol": "SASL_SSL",
"sasl.mechanism": "AWS_MSK_IAM",
"sasl.jaas.config": "software.amazon.msk.auth.iam.IAMLoginModule required;",
"sasl.client.callback.handler.class": "software.amazon.msk.auth.iam.IAMClientCallbackHandler",
"bootstrap.servers": bootstrap_servers,
}
with beam.Pipeline(options=PipelineOptions(args)) as p:
data = p | "Reading messages from Kafka" >> ReadFromKafka(
consumer_config=consumer_config,
topics=topics,
with_metadata=True
)
data | 'Writing to stdout' >> beam.Map(logging.info)
But when I run the code (deployed to k8s using flinkk8soperator), it is complaining:
Caused by: java.io.IOException: Cannot run program "docker": error=2, No such file or directory
Wondering if I misunderstand anything? Thanks!
After couple digging, I finally make the cross language work without using DinD or DooD. Here's the steps:
Ensure both job and task manager mount a shared volume for artifact staging. (This is required, otherwise the task manager will complained unable to find the submitted jar)
Ensure your docker image can run both java and python beam code, here's what I did:
# python SDK
COPY --from=apache/beam_python3.7_sdk:2.41.0 /opt/apache/beam/ /opt/apache/beam/
# java SDK
COPY --from=apache/beam_java8_sdk:2.41.0 /opt/apache/beam/ /opt/apache/beam_java/
In the job, you'll need to start the expansion service with extra args, for example the KafkaIo:
from apache_beam.io.kafka import ReadFromKafka, default_io_expansion_service
ReadFromKafka(
consumer_config=consumer_config,
topics=[topic],
with_metadata=False,
expansion_service=default_io_expansion_service(
append_args=[
'--defaultEnvironmentType=PROCESS',
"--defaultEnvironmentConfig={\"command\":\"/opt/apache/beam_java/boot\"}",
]
)
You portable execution relies on xLang support that requires starting a Java SDK with docker. Your cluster doesn't have docker installed.

pypi erro when try to install gcsfs in google composer(airflow)

I use google composer-1.0.0-airflow-1.9.0. I used dask in one of my DAG and wanted to setup composer to use dask. One of the required package for this DAG is gcsfs. When I tried to install it via Web UI I got the below error:
Composer Backend timed out. Currently running tasks are [stage: CP_COMPOSER_AGENT_RUNNING description: "Composer Agent Running. Latest Agent Stage: stage: DEPLOYMENTS_UPDATED\n ." response_timestamp { seconds: 1540331648 nanos: 860000000 } ].
Updated:
The error is coming from this line of code when dask tries to read file from gcp bucket:dd.read_csv(bucket)
log:
[2018-10-24 22:25:12,729] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 350, in get_fs_token_paths
[2018-10-24 22:25:12,733] {base_task_runner.py:98} INFO - Subtask: fs, fs_token = get_fs(protocol, options)
[2018-10-24 22:25:12,735] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 473, in get_fs
[2018-10-24 22:25:12,740] {base_task_runner.py:98} INFO - Subtask: "Need to install `gcsfs` library for Google Cloud Storage support\n"
[2018-10-24 22:25:12,741] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/utils.py", line 94, in import_required
[2018-10-24 22:25:12,748] {base_task_runner.py:98} INFO - Subtask: raise RuntimeError(error_msg)
[2018-10-24 22:25:12,751] {base_task_runner.py:98} INFO - Subtask: RuntimeError: Need to install `gcsfs` library for Google Cloud Storage support
[2018-10-24 22:25:12,756] {base_task_runner.py:98} INFO - Subtask: conda install gcsfs -c conda-forge
[2018-10-24 22:25:12,758] {base_task_runner.py:98} INFO - Subtask: or
[2018-10-24 22:25:12,762] {base_task_runner.py:98} INFO - Subtask: pip install gcsfs
When tried to install gcsfs in google composer UI using pypi got below error:
{
insertId: "17ks763f726w1i"
logName: "projects/xxxxxxxxx/logs/airflow-worker"
receiveTimestamp: "2018-10-25T15:42:24.935880717Z"
resource: {…}
severity: "ERROR"
textPayload: "Traceback (most recent call last):
File "/usr/local/bin/gcsfuse", line 7, in <module>
from gcsfs.cli.gcsfuse import main
File "/usr/local/lib/python2.7/site-
packages/gcsfs/cli/gcsfuse.py", line 3, in <module>
fuse import FUSE
ImportError: No module named fuse
"
timestamp: "2018-10-25T15:41:53Z"
}
Unfortunately, your error mssage doesn't mean much to me.
gcsfs is pure python code, so it is very unlikely that anything is going wrong with installing it - as is done very commonly with pip or conda. The dependency libraries are a bunch of google ones, some of which may require compilation (I don't know), so I would suggest trying to find out from logs which one is stalling and taking it up with them. On the other hand, this kind of thing can often be a network/intermittent problem, so waiting may also fix things.
For the future, I recommend basing installations around conda, which never needs to compile anything and is generally better at dependency tracking.
This has to do with the fact that Composer and Airflow have silent dependencies and they are not syncd. So if gcsfs installation has conflicts with Airflow dependency, we get this error. More details here. The only workarounds ( other than updating to the Nov 28 release of composer) are:
Source: Thanks to Jake Biesinger (jake.biesinger#infusionsoft.com)
use a separate Kubernetes Pod for running various jobs, but it's a
large change and requires infra we're not very familiar with (GKE).
This particular issue can also be solved by installing dbt in a
PythonVirtualEnvOperator, then having the python_callable re-use the
virtualenv's bin dir, something like:
``` def _run_cmd_in_virtual_env(cmd):
subprocess.check_call(os.path.join(os.path.split(sys.argv[0])[0], cmd)
task =
PythonVirtualEnvOperator(python_callable=_run_cmd_in_virtual_env,
op_args=('dbt',)) # this will call the temporarily-installed dbt
binary, something like /tmp/virtualenv-asdasd/bin/dbt.
```
I haven't tried this, but this might help you out.
In general, installing arbitrary system packages (like fuse or whatever which becomes the dependencies of what you are trying to install) is not supported by Google Composer. As discussed here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!searchin/cloud-composer-discuss/sugimiyanto%7Csort:date/cloud-composer-discuss/jpxAGCPFkZo/mCx_P1LPCQAJ
However, you may be able to do this by uploading the package folder that you have installed it in your local (i.e. fuse), into your Google Cloud Storage bucket for example: gs://<your_bukcet_name>/libs, so that it becomes shared libraries.
Then, you can set LD_LIBRARY_PATH environment variable in Google Composer to /home/airflow/gcs/libs, to make GCC look for shared libraries in that directory.
Then, try to reinstall the gcsfs using pypi Google Composer.

What could cause a JSPM bundled module to seek to load from jspm_packages?

On a bundled production build I see a network request for: systemjs-plugin-babel#0.0.21.json from cache... on the same instance where it's unable to load fully all the way through to my app.
It fails to load anything being routed from http://thehost/myapp/jspm_packages/npm
My understanding is that in a bundled JSPM module it shouldn't ever load from this kind of route since everything is bundled into the build.js file.
I currently bundle my application with the following gulp task:
gulp.task('jspm-bundle', plugins.shell.task([
'node node_modules/jspm/jspm.js bundle myapp/index.jsx' +
' + myapp/things/**/*.jsx + myapp/otherthings/**/*.jsx' +
' + systemjs-plugin-babel + babel-preset-stage-0 + transform-react-jsx' +
' + transform-decorators-legacy' +
' --minify --skip-source-maps'
])
I'm not sure where to start... could someone tell me a few of the reasons why a bundled app would even attempt to make this kind of request to the browser?
Update 1:
At one point we saw a promise rejection that seemed related as well for this singular client:
Unhandled promise rejection Error: Syntax Error
Instantiating http://myhost/myapp/jspm_packages/npm/systemjs-plugin-babel#0.0.21.json
Loading http://myhost/myapp/jspm_packages/npm/systemjs-plugin-babel#0.0.21.json
Unable to fetch package configuration file http://myhost/myapp/jspm_packages/npm/systemjs-plugin-babel#0.0.21.json
Resolving plugin-babel to http://myhost/myapp/app/index.jsx
Resolving myapp/index.jsx
Loading myapp/index.jsx
Update 2:
My FULL SystemJS config: https://pastebin.com/aJFPqNGn
Update 3 (last update?):
I can recreate the issue in production if I explicitly import from 'npm:systemjs-plugin-babel' but I can't explain why this import would occur in production at the client's installation yet. The syntax error occurs because the request for the non-existent file returns the login html and the syntax error occurs on the first '<' in the html

volttron.platform.vip.agent.core ERROR: Possible conflicting identity

I have been working towards building my agent development skills in Volttron. I am completely new to the platform and trying to understand how to create basic agents that publish and subscribe to the Volttron bus. I'm not alone in this venture and get help from a few other people with experience, but even they are stumped. We are using the same agent files we share through GitHub, but the agent works on their computers and not on mine.
The publishing agent reads from a CSV file that is in the same directory as the agent and is suppose to publish information from the file. I have been careful to map the file directory in my source code to match my setup. I receive the following messages when I start running my publishing agent using eclipse "mars" that's running on Linux Mint 18.1 Serena:
2017-02-02 14:27:22,290 volttron.platform.agent.utils DEBUG: missing file /home/edward/.volttron/keystores/f9d18589-d62b-42b7-bac8-3498a0c37220/keystore.json
2017-02-02 14:27:22,290 volttron.platform.agent.utils INFO: creating file /home/edward/.volttron/keystores/f9d18589-d62b-42b7-bac8-3498a0c37220/keystore.json
2017-02-02 14:27:22,292 volttron.platform.vip.agent.core DEBUG: address: ipc://#/home/edward/.volttron/run/vip.socket
2017-02-02 14:27:22,292 volttron.platform.vip.agent.core DEBUG: identity: None
2017-02-02 14:27:22,292 volttron.platform.vip.agent.core DEBUG: agent_uuid: None
2017-02-02 14:27:22,292 volttron.platform.vip.agent.core DEBUG: severkey: None
2017-02-02 14:27:32,324 volttron.platform.vip.agent.core ERROR: No response to hello message after 10 seconds.
2017-02-02 14:27:32,324 volttron.platform.vip.agent.core ERROR: A common reason for this is a conflicting VIP IDENTITY.
2017-02-02 14:27:32,324 volttron.platform.vip.agent.core ERROR: Shutting down agent.
2017-02-02 14:27:32,324 volttron.platform.vip.agent.core ERROR: Possible conflicting identity is: f9d18589-d62b-42b7-bac8-3498a0c37220
I have done the following:
Created missing file "/home/edward/.volttron/keystores/f9d18589-d62b-42b7-bac8-3498a0c37220/keystore.json". The only thing that happens when I run the agent again is it gives me the same DEBUG message but with a different file name.
I looked into the "volttron.platform.vip.agent.core" file and have no idea what to do in there. I dont want to create more problems for myself.
I have been using the "Volttron's Documentation" to try and trouble shoot, but I always get the same message when ever I try to run any agent. I have had success when testing the platform and running "make-listener" through the terminal, but that's all.
I have been searching the web for the last couple of days and seen similar issues, but when attempting to follow the advise posted to remedy the situation, I have no luck. Error: volttron.platform.web INFO: Web server not started
Reinstalled Volttron, Mint, and Eclipse on my VM a few times to overcome any compatibility issues...
The source code for the agent is as follows:
#testcodeisforpublishingandprinting
import logging
import sys
#import json
from volttron.platform.vip.agent import Agent, Core, PubSub, compat
#from volttron.platform.vip.agent import *
#from volttron.platform.vip.agent import compat
from volttron.platform.agent import utils
from volttron.platform.messaging import headers as headers_mod
from datetime import datetime
#import numpy as NP
#from numpy import linalg as LA
import csv
outdata=open("/home/edward/volttron/testagent/Agent/PredictionfileP.csv","rb")
Pdata=csv.DictReader(outdata)
Price=[]
for row in Pdata:
Price.append(float(row['Price'])*0.01)
#from volttron.platform.agent import BaseAgent, PublishMixin, periodic, matching, utils
#from volttron.platform.agent import BaseAgent, PublishMixin, periodic
utils.setup_logging()
_log = logging.getLogger(__name__)
class testagent1(Agent):
def __init__(self, config_path, **kwargs):
self.config = utils.load_config(config_path)
super(testagent1, self).__init__(**kwargs)
self.step=0
#print('TestAgent example agent start-up function')
#Core.receiver('onsetup')
def onsetup(self, sender, **kwargs):
self._agent_id = self.config['agentid']
#Core.receiver('onstart')
def onstart(self, sender, **kwargs):
pass
#Core.receiver('onstop')
def onstop(self, sender, **kwargs):
pass
#Core.receiver('onfinish')
def onfinish(self, sender, **kwargs):
pass
#Core.periodic(5)
def simulate(self):
self.step=self.step+1#timestep increase
print('Simulationrunning')
now = datetime.utcnow().isoformat(' ')#time now
headers = {
'AgentID': self._agent_id,
headers_mod.CONTENT_TYPE: headers_mod.CONTENT_TYPE.PLAIN_TEXT,
headers_mod.DATE: now,
}
print(self.step)
self.vip.pubsub.publish('pubsub', 'testcase1/Step', headers, self.step)
print('Simulationupdatingloopingindex')
def main(argv=sys.argv):
'''Main method called by the eggsecutable.'''
try:
utils.vip_main(testagent1)
except Exception as e:
_log.exception('unhandled exception')
if __name__ == '__main__':
# Entry point for script
sys.exit(main())
I installed my version of Volttron using the 3.5RC1 manual published Jan. 2017.
I am assuming that you are running this from eclipse and not through the installation process. The installation of the agent would specify an identity that would remain for the life of the agent.
The remaining answer is specific to running within the eclipse environment.
def main(argv=sys.argv):
'''Main method called by the eggsecutable.'''
try:
# This is where the change is.
utils.vip_main(testagent1, identity='Thisidentity')
except Exception as e:
_log.exception('unhandled exception')
You will have to authorize the agent to be able to connect to the message bus by adding the agent publickey through the auth mechanism. Or you can add the wildcard /.*/ to the credentials of an entry through volttron-ctl auth add.
Thank you for asking this question. We will are updating the documentation to highlight this.
You will need to do the following at the command line:
volttron-ctl auth add
domain []:
address []:
user_id []:
capabilities (delimit multiple entries with comma) []:
roles (delimit multiple entries with comma) []:
groups (delimit multiple entries with comma) []:
mechanism [CURVE]:
credentials []: /.*/
comments []:
enabled [True]:
added entry domain=None, address=None, mechanism='CURVE', credentials=u'/.*/', user_id='ff6fea8e-53bd-4506-8237-fbb718aca70d'

FATAL org.apache.hadoop.conf.Configuration - error parsing conf file: org.xml.sax.SAXParseException

I'm trying to run pig locally, installed using homebrew, to test a script. However, I get the following error when I attempt to run a simple dump from the interactive prompt pig -x local:
2012-07-16 23:20:40,447 [Thread-7] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
[Fatal Error] :63:85: Character reference "&#2" is an invalid XML character.
2012-07-16 23:20:40,688 [Thread-7] FATAL org.apache.hadoop.conf.Configuration - error parsing conf file: org.xml.sax.SAXParseException: Character reference "&#2" is an invalid XML character.
The same load/dump works fine on Elastic MapReduce.
I can't find any XML config files, and I've tried with both version 0.9.2 and 0.10.0
What am I missing?
Edit: Just checked a direct download (vs. homebrew) and it doesn't seem to work either
You should check that your Hadoop configuration files have correct configuration data.
Have a look in your hadoop/conf directory.
Have a look inside:
hdfs-site.xml
mapred-site.xml
core-site.xml
Finally worked out what the problem was. I ended up having to use dtruss -p on the pig/java process. This revealed a temporary directory and dynamically generated xml files. Once the temporary directory was discovered, it all fell quickly into place.
It was picking up the proxy excludes from my network connections, which had, as far as I can tell, &#2 (http://www.fileformat.info/info/unicode/char/02/index.htm) embedded in it. How this invalid value came to be in my network preferences in the first place, I haven't the faintest clue.
The value was then being pulled into dynamically generated files, for example /tmp/hadoop-vertis/mapred/staging/vertis-1005847898/.staging/job_local_0001/job.xml.
The offending lines:
<property><name>ftp.nonProxyHosts</name><value>localhost|*.localhost|127.0.0.1|h|*.h</value></property>
<property><name>socksNonProxyHosts</name><value>localhost|*.localhost|127.0.0.1|h|*.h</value></property>
<property><name>http.nonProxyHosts</name><value>localhost|*.localhost|127.0.0.1|h|*.h</value></property>