SQLAlchemy cannot connect to Postgresql on localhost - postgresql

I'm sure this is such an easy error to fix, if I could only find where it is. This is the error from the Flask app:
11:58:18 web.1 | ERROR:xxxxxx.core:Exception on / [GET]
11:58:18 web.1 | Traceback (most recent call last):
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app
11:58:18 web.1 | response = self.full_dispatch_request()
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
11:58:18 web.1 | rv = self.handle_user_exception(e)
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
11:58:18 web.1 | reraise(exc_type, exc_value, tb)
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
11:58:18 web.1 | rv = self.dispatch_request()
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request
11:58:18 web.1 | return self.view_functions[rule.endpoint](**req.view_args)
11:58:18 web.1 | File "xxxxxxx/web.py", line 202, in home
11:58:18 web.1 | d = {'featured': cached_apps.get_featured_front_page(),
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_cache/__init__.py", line 245, in decorated_function
11:58:18 web.1 | rv = f(*args, **kwargs)
11:58:18 web.1 | File "/Users/xxxxxxx/Desktop/PythonProjects/xxxxxx/xxxxxx2/xxxxxxx/cache/apps.py", line 35, in get_featured_front_page
11:58:18 web.1 | results = db.engine.execute(sql)
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 780, in engine
11:58:18 web.1 | return self.get_engine(self.get_app())
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 797, in get_engine
11:58:18 web.1 | return connector.get_engine()
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 470, in get_engine
11:58:18 web.1 | self._sa.apply_driver_hacks(self._app, info, options)
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 739, in apply_driver_hacks
11:58:18 web.1 | if info.drivername.startswith('mysql'):
11:58:18 web.1 | AttributeError: 'NoneType' object has no attribute 'drivername'
From what I've been able to find online, it seems like the problem is that I might not be connecting correctly to the database. The app works fine in Heroku, but not when I run on localhost.
which psql:
/Applications/Postgres.app/Contents/MacOS/bin/psql
which postgres:
/Applications/Postgres.app/Contents/MacOS/bin/postgres
Postgres.app is running on 5432.
I don't know what else to check.
If it's supposed to connect to the same postgres DB on heroku regardless, why would it work on heroku, but not from localhost?
Maybe the app on localhost is using a wrong version of Postgres? I've tried uninstalling them (and only leaving Postgres.app), but I'm not sure if there's anything left on my computer that's causing conflicts. How would I check that? I'd appreciate any help.
EDIT: More info
Segment from the alembic.ini file:
[alembic]
# path to migration scripts
script_location = alembic
# template used to generate migration files
# file_template = %%(rev)s_%%(slug)s
# under Heroku, the line below needs to be inferred from
# the environment
sqlalchemy.url = postgres://xxxxxxxxxx:xxxxxxxxxxxx#xxxxxx.compute-1.amazonaws.com:5432/xxxxxxxxx
# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic
I have a short script that produces the same error:
I run python cli.py db_create
on
#!/usr/bin/env python
import os
import sys
import optparse
import inspect
import xxxxxxx.model as model
from xxxxxx.core import db
import xxxxx.web as web
from alembic.config import Config
from alembic import command
def setup_alembic_config():
if "DATABASE_URL" not in os.environ:
alembic_cfg = Config("alembic.ini")
else:
dynamic_filename = "alembic-heroku.ini"
with file("alembic.ini.template") as f:
with file(dynamic_filename, "w") as conf:
for line in f.readlines():
if line.startswith("sqlalchemy.url"):
conf.write("sqlalchemy.url = %s\n" %
os.environ['DATABASE_URL'])
else:
conf.write(line)
alembic_cfg = Config(dynamic_filename)
command.stamp(alembic_cfg, "head")
def db_create():
'''Create the db'''
db.create_all()
# then, load the Alembic configuration and generate the
# version table, "stamping" it with the most recent rev:
setup_alembic_config()
# finally, add a minimum set of categories: Volunteer Thinking, Volunteer Sensing, Published and Draft
categories = []
categories.append(model.Category(name="Thinking",
short_name='thinking',
description='Volunteer Thinking apps'))
categories.append(model.Category(name="Volunteer Sensing",
short_name='sensing',
description='Volunteer Sensing apps'))
db.session.add_all(categories)
db.session.commit()
and I get:
Traceback (most recent call last):
File "cli.py", line 111, in <module>
_main(locals())
File "cli.py", line 106, in _main
_methods[method](*args[1:])
File "cli.py", line 33, in db_create
db.create_all()
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 856, in create_all
self._execute_for_all_tables(app, bind, 'create_all')
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 848, in _execute_for_all_tables
op(bind=self.get_engine(app, bind), tables=tables)
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 797, in get_engine
return connector.get_engine()
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 470, in get_engine
self._sa.apply_driver_hacks(self._app, info, options)
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 739, in apply_driver_hacks
if info.drivername.startswith('mysql'):
AttributeError: 'NoneType' object has no attribute 'drivername'

My guess is that you haven't correctly configured Flask-SQLAlchemy. You have a lot of code that seems like it tries to configure it, but without going through all of it, my guess is that it either is setting up your configuration incorrectly, or setting up your configuration too late.
Make sure that before you call anything that will hit the database (like the db.create_all()) that your app.config["SQLALCHEMY_DATABASE_URI"] is set to the correct URI. It is probably set to None, and that is causing your issue.

agree with Mark Hildreth whose answer is 7 years old, took me two days to get here -
I was getting such errors:
FAILED tests/test_models.py::test_endpoint - AttributeError: 'NoneType' object has no attribute 'drivername'
ERROR tests/test_models.py::test_endpoint2 - AttributeError: 'NoneType' object has no attribute 'response_class'
traceback shows smth like:
self = <[AttributeError("'NoneType' object has no attribute 'drivername'") raised in repr()] SQLAlchemy object at 0x1fb8a622848>, app = <Flask 'app'>, sa_url = None, options = {}
Finaly with tihs answer i remembered that Flask sometimes need to have set config variables in place.
helped a lot thanks - app.config["SQLALCHEMY_DATABASE_URI"] = os.environ.get('mysql://root:1234#127.0.0.1/kvadrat?charset=utf8')

Related

psycopg2 i got the error message "where syntax 'column' is not in list error " using pyspark

Same as title
I'm Using AWS Glue, Script, Glue3 and Python3 also using psycopg2 library
When i delete AWS Aurora(postgresql) record
my syntax is very simple, like this
tuple_list : [(value1,value2),(value1,value2), ....]
value 1 : String, value 2 : timestamp
query = "DELETE FROM {table} WHERE (col1, col2) IN ( VALUES %s)".format(table=table)
extras.execute_values(db.cur(), query, tuple_list, template="(%s, %s)", page_size=2000)
db_conn.commit()
i got the message "where syntax 'col2' is not in list error"
i have no idea why i didn't work....
thx for your interesting and have a good day
all my error message
Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 24) (IP executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1571, in __getattr__
idx = self.__fields__.index(item)
ValueError: 'col2' is not in list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/worker.py", line 594, in process
out_iter = func(split_index, iterator)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 418, in func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 932, in func
File "/tmp/job_test", line 136, in process_partition
File "/tmp/job_test", line 88, in execute_batch_delete
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1576, in __getattr__
raise AttributeError(item)
AttributeError: col2
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2278)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:

I cannot save spark DataFram to csv in Jupyter notebook ,

Py4JJavaError Traceback (most recent call last)
Untitled-2.ipynb Cell 13' in <cell line: 1>()
----> 1 dt_clean.write.csv('C:/Users/K/Desktop/Mywork/Cleaned_data.csv',)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pyspark\sql\readwriter.py:1372, in DataFrameWriter.csv(self, path, mode, compression, sep, quote, escape, header, nullValue, escapeQuotes, quoteAll, dateFormat, timestampFormat, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, charToEscapeQuoteEscaping, encoding, emptyValue, lineSep)
1364 self.mode(mode)
1365 self._set_opts(compression=compression, sep=sep, quote=quote, escape=escape, header=header,
1366 nullValue=nullValue, escapeQuotes=escapeQuotes, quoteAll=quoteAll,
1367
Pls help me
i want to save in csv file from spark dataframe

Round off the data frame in Pyspark

i am trying to round off "perc_of_count_total" column in pyspark, but i could not do it, below is my script,
Auto_data1 = Auto_data.groupBy("Make", "Fuel") \
.count() \
.withColumnRenamed('count', 'cnt_per_group') \
.withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 ) \
.show(10)
Auto_data1.select(round(col('cnt_per_group'),2)).show(5)
Output
+-----------+----+-------------+--------------------+
| Make|Fuel|cnt_per_group| perc_of_count|
+-----------+----+-------------+--------------------+
| C | I| 34748|0.027960585487965286|
| P | D| 489| 3.93482396213164E-4|
Error message
An error was encountered:
'NoneType' object has no attribute 'select'
Traceback (most recent call last):
AttributeError: 'NoneType' object has no attribute 'select'
Remove the last show function, it doesn't return anything.

pyspark filter with parameter value is not working

Below is the pyspark code that I tried to run. I am not able to substitute the value with filter. Please advise.
>>> coreWordFilter = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter
"crawlResult.url.like('%furniture%')"
>>> preFilter = crawlResult.filter(coreWordFilter)
20/02/11 09:19:54 INFO execution.SparkSqlParser: Parsing command: crawlResult.url.like('%furniture%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line 1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name 'crawlResult.url.like'(line 1, pos 0)\n\n== SQL ==\ncrawlResult.url.like('%furniture%')\n^^^\n"
>>> preFilter = crawlResult.filter(crawlResult.url.like('%furniture%'))
>>>
I need some help with how to add more crawlResult.url.like logic:
Code from today 2/12/2020:
>>> coreWordFilter = crawlResult.url.like('%{}%'.format(IncoreWords[0]))
>>> coreWordFilter
Column<url LIKE %furniture%>
>>> InmoreWords
['couch', 'couches']
>>> for a in InmoreWords:
... coreWordFilter=coreWordFilter+" | crawlResult.url.like('%"+a+"%')"
>>> coreWordFilter
Column<((((((url LIKE %furniture% + | crawlResult.url.like('%) + couch) + %')) + | crawlResult.url.like('%) + couches) + %'))>
preFilter = crawlResult.filter(coreWordFilter) does not work with the above coreWordFilter.
I was hoping I could do the below but not able to do - got an error:
>>> coreWordFilter2 = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter2
"crawlResult.url.like('%furniture%')"
>>> for a in InmoreWords:
... coreWordFilter2=coreWordFilter2+" | crawlResult.url.like('%"+a+"%')"
...
>>> coreWordFilter2
"crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')"
>>> preFilter = crawlResult.filter(coreWordFilter2)
20/02/12 08:55:26 INFO execution.SparkSqlParser: Parsing command:
crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line
1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-
src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in
deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name
'crawlResult.url.like'(line 1, pos 0)\n\n== SQL
==\ncrawlResult.url.like('%furniture%') |
crawlResult.url.like('%couch%') | crawlResult.url.like('%couches%')\n^^^\n"
I think the correct syntax is:
preFilter = crawlResult.filter(crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%'))
Since you want dynamic or condition i think filtering based on String operator (AND, OR, NOT etc) would be easy compare to Column based logical operators (&, |, ~ etc).
Dummy dataframe and lists:
crawlResult.show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| table|
| 1| test-test|
| 1| couch|
+---+--------------+
# IncoreWords
# ['furniture', 'office-table', 'counch', 'blah']
# InmoreWords
# ['couch', 'couches']
Now, I am just following your OP sequence for building dynamic filter clause but it will give you broad idea.
coreWordFilter2 = "url like ('%"+IncoreWords[0]+"%')"
# coreWordFilter2
#"url like ('%furniture%')"
for a in InmoreWords:
coreWordFilter2=coreWordFilter2+" or url like('%"+a+"%')"
# coreWordFilter2
# "url like ('%furniture%') or url like('%couch%') or url like('%couches%')"
crawlResult.filter(coreWordFilter2).show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| couch|
+---+--------------+

Bag of words with pySpark reduceByKey

I am trying to do some text mining tasks with pySpark. I am new to Spark and I've been following this example http://mccarroll.net/blog/pyspark2/index.html to build the bag of words for my data.
Originally my data looked something like this
df.show(5)
+------------+---------+----------------+--------------------+
|Title |Month | Author | Document|
+------------+---------+----------------+--------------------+
| a | Jan| John |This is a document |
| b | Feb| Mary |A book by Mary |
| c | Mar| Luke |Newspaper article |
+------------+---------+----------------+--------------------+
So far I have extracted the terms of each document with
bow0 = df.rdd\
.map( lambda x: x.Document.replace(',',' ').replace('.',' ').replace('-',' ').lower())\
.flatMap(lambda x: x.split())\
.map(lambda x: (x, 1))
Which gives me
[('This', 1),
('is', 1),
('a', 1),
('document', 1)]
But when I try to compute the frequency with reduceByKey and try to see the result
bow0.reduceByKey(lambda x,y:x+y).take(50)
I get this error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-53-966f90775397> in <module>()
----> 1 bow0.reduceByKey(lambda x,y:x+y).take(50)
/usr/local/spark/python/pyspark/rdd.py in take(self, num)
1341
1342 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1343 res = self.context.runJob(self, takeUpToNumLeft, p)
1344
1345 items += res
/usr/local/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
990 # SparkContext#runJob.
991 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 992 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
993 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
994
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 31.0 failed 4 times, most recent failure: Lost task 1.3 in stage 31.0 (TID 84, 9.242.64.15, executor 7): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 346, in func
return f(iterator)
File "/usr/local/spark/python/pyspark/rdd.py", line 1842, in combineLocally
merger.mergeValues(iterator)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
for k, v in iterator:
File "<ipython-input-48-5c0753c6b152>", line 1, in <lambda>
AttributeError: 'NoneType' object has no attribute 'replace'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:455)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 346, in func
return f(iterator)
File "/usr/local/spark/python/pyspark/rdd.py", line 1842, in combineLocally
merger.mergeValues(iterator)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
for k, v in iterator:
File "<ipython-input-48-5c0753c6b152>", line 1, in <lambda>
AttributeError: 'NoneType' object has no attribute 'replace'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
To expand on my comment, the error you are receiving is due to the presence of a null value in your Document column. Here's a small example to demonstrate:
data = [
['a', 'Jan', 'John', 'This is a document'],
['b', 'Feb', 'Mary', 'A book by Mary'],
['c', 'Mar', 'Luke', 'Newspaper article'],
['d', 'Apr', 'Mark', None]
]
columns = ['Title', 'Month', 'Author', 'Document']
df = spark.createDataFrame(data, columns)
df.show()
#+-----+-----+------+------------------+
#|Title|Month|Author| Document|
#+-----+-----+------+------------------+
#| a| Jan| John|This is a document|
#| b| Feb| Mary| A book by Mary|
#| c| Mar| Luke| Newspaper article|
#| d| Apr| Mark| null|
#+-----+-----+------+------------------+
For the last row, the value in the Document column is null. When you compute bow0 as in your question, when the map function operates on that row it tries to call x.Document.replace where x is None. This results in AttributeError: 'NoneType' object has no attribute 'replace'.
One way to overcome this is to filter out the bad values before calling map:
bow0 = df.rdd\
.filter(lambda x: x.Document)\
.map( lambda x: x.Document.replace(',',' ').replace('.',' ').replace('-',' ').lower())\
.flatMap(lambda x: x.split())\
.map(lambda x: (x, 1))
bow0.reduceByKey(lambda x,y:x+y).take(50)
#[(u'a', 2),
# (u'this', 1),
# (u'is', 1),
# (u'newspaper', 1),
# (u'article', 1),
# (u'by', 1),
# (u'book', 1),
# (u'mary', 1),
# (u'document', 1)]
Or you can build in the check for None condition inside of your map function. In general, it is good practice to make your map function robust to bad inputs.
As an aside, you can do the same thing using the DataFrame API functions. In this case:
from pyspark.sql.functions import explode, split, regexp_replace, col, lower
df.select(explode(split(regexp_replace("Document", "[,.-]", " "), "\s+")).alias("word"))\
.groupby(lower(col("word")).alias("lower"))\
.count()\
.show()
#+---------+-----+
#| lower|count|
#+---------+-----+
#| document| 1|
#| by| 1|
#|newspaper| 1|
#| article| 1|
#| mary| 1|
#| is| 1|
#| a| 2|
#| this| 1|
#| book| 1|
#+---------+-----+