Loading MongoDB dump (*.bson.gz) file using Pyspark - pyspark

I am getting snapshot of Mango DB with bson.gz extension in a s3 location.
I am trying to read and load the file using pyspark, but it is giving me error. I am
using Jupyter notebook for this activity. Has anyone handled anything like this ? Please suggest.
from pyspark import SparkContext, SparkConf
sc.install_pypi_package("pymongo==3.2.2")
import pymongo_spark
pymongo_spark.activate()
conf = SparkConf().setAppName("pyspark-bson")
file_path = "s3://location/users.bson.gz"
bsonFileRdd = sc.BSONFileRDD(file_path)
bsonFileRdd.take(5)
Error
An error was encountered:
No module named 'pymongo_spark'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'pymongo_spark'

Related

Pyspark: SPARK_HOME may not be configured correctly

I'm trying to run pyspark using a notebook in a conda enviroment.
$ which python
inside the enviroment 'env', returns:
/Users/<username>/anaconda2/envs/env/bin/p
ython
and outside the environment:
/Users/<username>/anaconda2/bin/python
My .bashrc file has:
export PATH="/Users/<username>/anaconda2/bin:$PATH"
export JAVA_HOME=`/usr/libexec/java_home`
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.1.2
export PYTHONPATH=$SPARK_HOME/libexec/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
But still, when I run:
import findspark
findspark.init()
I'm getting the error:
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
Any ideas?
Full traceback
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~/anaconda2/envs/ai/lib/python3.7/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
142 try:
--> 143 py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
144 except IndexError:
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
/var/folders/dx/dfb8h2h925l7vmm7y971clpw0000gn/T/ipykernel_72686/1796740182.py in <module>
1 import findspark
2
----> 3 findspark.init()
~/anaconda2/envs/ai/lib/python3.7/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
144 except IndexError:
145 raise Exception(
--> 146 "Unable to find py4j, your SPARK_HOME may not be configured correctly"
147 )
148 sys.path[:0] = [spark_python, py4j]
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
EDIT:
If I run the following in the notebook:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
I get the error:
/usr/local/Cellar/apache-spark/3.1.2/bin/load-spark-env.sh: line 2: /usr/local/Cellar/apache-spark/3.1.2/libexec/bin/load-spark-env.sh: Permission denied
/usr/local/Cellar/apache-spark/3.1.2/bin/load-spark-env.sh: line 2: exec: /usr/local/Cellar/apache-spark/3.1.2/libexec/bin/load-spark-env.sh: cannot execute: Undefined error: 0

Issue running psycopg2 inside AWS Lambda Function

I'm getting the following error when trying to run psycopg2 in a AWS Lambda:
/var/task/functions/../vendored/psycopg2/_psycopg.so: ELF file's phentsize not the expected size: ImportError
Traceback (most recent call last):
File "/var/task/functions/refresh_mv.py", line 64, in execute
session = SessionFactoryGraphQL.get_session(app=item['app'])
File "/var/task/lib/session_factory.py", line 22, in get_session
engine = create_engine(conn_string, poolclass=NullPool)
File "/var/task/functions/../vendored/sqlalchemy/engine/__init__.py", line 387, in create_engine
return strategy.create(*args, **kwargs)
File "/var/task/functions/../vendored/sqlalchemy/engine/strategies.py", line 80, in create
dbapi = dialect_cls.dbapi(**dbapi_args)
File "/var/task/functions/../vendored/sqlalchemy/dialects/postgresql/psycopg2.py", line 554, in dbapi
import psycopg2
File "/var/task/functions/../vendored/psycopg2/__init__.py", line 50, in <module>
from psycopg2._psycopg import ( # noqa
ImportError: /var/task/functions/../vendored/psycopg2/_psycopg.so: ELF file's phentsize not the expected size
The weird thing is: everything was working fine until yesterday (for more than 5 months), and suddenly stopped working. None of the libraries has been updated.
I tried to build from scratch, as in https://github.com/jkehler/awslambda-psycopg2, but still having the same error.
Can someone help me with it?
The problem is in the latest version of serverless framework. I assume that you are using serverless to deploy your lambda function.
serverless remove
npm install serverless#1.20.2 -g
This should work.

Error invoking plotly's init_notebook_mode with Jupyter (Apache Toree PySpark)

I'm running Jupyter (v4.2.1) with Apache Toree - PySpark. When I try to invoke plotly's init_notebook_mode function, I run into the following error :
import numpy as np
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
Error :
Name: org.apache.toree.interpreter.broker.BrokerException
Message: Traceback (most recent call last):
File "/tmp/kernel-PySpark-6415c581-01c4-4c90-b4d9-81773c2bc03f/pyspark_runner.py", line 134, in <module>
eval(compiled_code)
File "<string>", line 7, in <module>
File "/usr/local/lib/python3.4/dist-packages/plotly/offline/offline.py", line 151, in init_notebook_mode
display(HTML(script_inject))
File "/usr/local/lib/python3.4/dist-packages/IPython/core/display.py", line 158, in display
format = InteractiveShell.instance().display_formatter.format
File "/usr/local/lib/python3.4/dist-packages/traitlets/config/configurable.py", line 412, in instance
inst = cls(*args, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 499, in __init__
self.init_io()
File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 658, in init_io
io.stdout = io.IOStream(sys.stdout)
File "/usr/local/lib/python3.4/dist-packages/IPython/utils/io.py", line 34, in __init__
raise ValueError("fallback required, but not specified")
ValueError: fallback required, but not specified
StackTrace: org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:140)
org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:140)
scala.Option.foreach(Option.scala:236)
org.apache.toree.interpreter.broker.BrokerState.markFailure(BrokerState.scala:139)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
py4j.Gateway.invoke(Gateway.java:259)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:209)
java.lang.Thread.run(Thread.java:745)
I'm unable to find any info about this on the web. When I digged into the code where this is failing - io.py in IPython utils, I see that the stream that is being passed must have both the attributes - write as well as flush. But for some reason, the stream passed in this case - sys.stdout has only the "write" attribute, and not the "flush" attribute.
I believe this happens because plotly's notebook mode assumes that it is running inside an IPython jupyter kernel doing the notebook communictation; you see in the stacktrace that it's trying to call into IPython packages.
Toree, however, is a different jupyter kernel and has its own protocol handling for communicating with the notebook server. Even when you use toree to run a PySpark interpreter, you get a "plain" PySpark (just like when you start it from a shell) and toree drives the input/output of that interpreter.
So the IPython machinery is not set up and calling init_notebook_mode() in that environment will fail, just like it would when you run in in a PySpark started directly from the shell, which knows nothing about notebooks.
To my knowledge, there is currently no way to get plotting output from a PySpark session run via toree -- we recently faced the same problem. Instead of running python via toree, you need to run an IPython kernel, import the PySpark libs there and connect to your Spark cluster. See https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook for a dockerized example to do that.

Ming 0.3.2 Installs and Imports but Crashes

After installing Ming 0.3.2, I tested the installation by running the following code:
>>> from ming.datastore import DataStore
>>> bind = DataStore('mongodb://localhost:27017/', database='tutorial')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __init__() got an unexpected keyword argument 'database'
>>> ^D
I looked at the installation files and in the datastore.py file I found that the class's constructor did not contain a "database" argument.
class DataStore(object):
def __init__(self, bind, name, authenticate=None):
self.bind = bind
self.name = name
self._authenticate = authenticate
self._db = None
I then installed Ming 0.3.0 to look at the datastore.py file and found the DataStore class to match the documentation (it contained a database arg) and then tried that version where I encountered other complications.
I use easy_install to install Ming and I have a good install of mongodb and pymongo running. I run these on OS X Lion. Any advise on getting Ming running would be appreciated.
I think there may be a conflict with the newest version of pymongo and ming.
bind = DataStore('mongodb://localhost:27017/', name='test') gets me a bit further along, but I ended up just using pymongo by itself.
I've met the same issue. Here are the steps I've tried, and it works! Hopes it works for your environment too.
Uninstall the Ming 0.3.2 version by : pip uninstall Ming
Install 0.3.0 by: pip install -Iv http://downloads.sourceforge.net/project/merciless/0.3.0/Ming-0.3.0.tar.gz
Try the example on the Ming office website again. There will be another error
Traceback (most recent call last):
File "tutorial.py", line 1, in <module>
from ming.datastore import DataStore
File "/home/me/work/deploy/test/local/lib/python2.7/site-packages/ming/init.py", line 3, in <module>
from session import Session
File "/home/me/work/deploy/test/local/lib/python2.7/site-packages/ming/session.py", line 7, in <module>
from pymongo.son import SON
ImportError: No module named son
change the line 7 of "/home/me/work/deploy/test/local/lib/python2.7/site-packages/ming/session.py" to from bson.son import SON
try again. and it will works.
Here is the link I've referenced. It's a Japanese webpage, but you can translate it to English by google translator.
http://ryooo321.blogspot.com/2012/05/macsleepymongoose.html
try to remove database=.
In [8]: from ming.datastore import DataStore
In [9]: bind = DataStore('mongodb://grid:27017/', 'tutorial')
In [10]: bind.name
Out[10]: 'tutorial'

Python :ImportError: DLL load failed: The specified module could not be found

I am working on Python plugins for QGIS. I set my variables as
PATH :=C:\OSGeo4W\apps\qgis;%PATH%
PYTHONPATH:=C:\OSGeo4W\apps\qgis\python
But when i try to run my .py file in IDLE i gets error saying:
Traceback (most recent call last):
File "C:\rt_sql_layer_working\DlgQueryBuilder.py", line 30, in
from qgis.core import *
ImportError: DLL load failed: The specified module could not be found.
what i should set for PATH variable??