Same as title
I'm Using AWS Glue, Script, Glue3 and Python3 also using psycopg2 library
When i delete AWS Aurora(postgresql) record
my syntax is very simple, like this
tuple_list : [(value1,value2),(value1,value2), ....]
value 1 : String, value 2 : timestamp
query = "DELETE FROM {table} WHERE (col1, col2) IN ( VALUES %s)".format(table=table)
extras.execute_values(db.cur(), query, tuple_list, template="(%s, %s)", page_size=2000)
db_conn.commit()
i got the message "where syntax 'col2' is not in list error"
i have no idea why i didn't work....
thx for your interesting and have a good day
all my error message
Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 24) (IP executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1571, in __getattr__
idx = self.__fields__.index(item)
ValueError: 'col2' is not in list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/worker.py", line 594, in process
out_iter = func(split_index, iterator)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 418, in func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 932, in func
File "/tmp/job_test", line 136, in process_partition
File "/tmp/job_test", line 88, in execute_batch_delete
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1576, in __getattr__
raise AttributeError(item)
AttributeError: col2
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2278)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
Py4JJavaError Traceback (most recent call last)
Untitled-2.ipynb Cell 13' in <cell line: 1>()
----> 1 dt_clean.write.csv('C:/Users/K/Desktop/Mywork/Cleaned_data.csv',)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pyspark\sql\readwriter.py:1372, in DataFrameWriter.csv(self, path, mode, compression, sep, quote, escape, header, nullValue, escapeQuotes, quoteAll, dateFormat, timestampFormat, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, charToEscapeQuoteEscaping, encoding, emptyValue, lineSep)
1364 self.mode(mode)
1365 self._set_opts(compression=compression, sep=sep, quote=quote, escape=escape, header=header,
1366 nullValue=nullValue, escapeQuotes=escapeQuotes, quoteAll=quoteAll,
1367
Pls help me
i want to save in csv file from spark dataframe
i am trying to round off "perc_of_count_total" column in pyspark, but i could not do it, below is my script,
Auto_data1 = Auto_data.groupBy("Make", "Fuel") \
.count() \
.withColumnRenamed('count', 'cnt_per_group') \
.withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 ) \
.show(10)
Auto_data1.select(round(col('cnt_per_group'),2)).show(5)
Output
+-----------+----+-------------+--------------------+
| Make|Fuel|cnt_per_group| perc_of_count|
+-----------+----+-------------+--------------------+
| C | I| 34748|0.027960585487965286|
| P | D| 489| 3.93482396213164E-4|
Error message
An error was encountered:
'NoneType' object has no attribute 'select'
Traceback (most recent call last):
AttributeError: 'NoneType' object has no attribute 'select'
Remove the last show function, it doesn't return anything.
Below is the pyspark code that I tried to run. I am not able to substitute the value with filter. Please advise.
>>> coreWordFilter = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter
"crawlResult.url.like('%furniture%')"
>>> preFilter = crawlResult.filter(coreWordFilter)
20/02/11 09:19:54 INFO execution.SparkSqlParser: Parsing command: crawlResult.url.like('%furniture%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line 1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name 'crawlResult.url.like'(line 1, pos 0)\n\n== SQL ==\ncrawlResult.url.like('%furniture%')\n^^^\n"
>>> preFilter = crawlResult.filter(crawlResult.url.like('%furniture%'))
>>>
I need some help with how to add more crawlResult.url.like logic:
Code from today 2/12/2020:
>>> coreWordFilter = crawlResult.url.like('%{}%'.format(IncoreWords[0]))
>>> coreWordFilter
Column<url LIKE %furniture%>
>>> InmoreWords
['couch', 'couches']
>>> for a in InmoreWords:
... coreWordFilter=coreWordFilter+" | crawlResult.url.like('%"+a+"%')"
>>> coreWordFilter
Column<((((((url LIKE %furniture% + | crawlResult.url.like('%) + couch) + %')) + | crawlResult.url.like('%) + couches) + %'))>
preFilter = crawlResult.filter(coreWordFilter) does not work with the above coreWordFilter.
I was hoping I could do the below but not able to do - got an error:
>>> coreWordFilter2 = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter2
"crawlResult.url.like('%furniture%')"
>>> for a in InmoreWords:
... coreWordFilter2=coreWordFilter2+" | crawlResult.url.like('%"+a+"%')"
...
>>> coreWordFilter2
"crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')"
>>> preFilter = crawlResult.filter(coreWordFilter2)
20/02/12 08:55:26 INFO execution.SparkSqlParser: Parsing command:
crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line
1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-
src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in
deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name
'crawlResult.url.like'(line 1, pos 0)\n\n== SQL
==\ncrawlResult.url.like('%furniture%') |
crawlResult.url.like('%couch%') | crawlResult.url.like('%couches%')\n^^^\n"
I think the correct syntax is:
preFilter = crawlResult.filter(crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%'))
Since you want dynamic or condition i think filtering based on String operator (AND, OR, NOT etc) would be easy compare to Column based logical operators (&, |, ~ etc).
Dummy dataframe and lists:
crawlResult.show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| table|
| 1| test-test|
| 1| couch|
+---+--------------+
# IncoreWords
# ['furniture', 'office-table', 'counch', 'blah']
# InmoreWords
# ['couch', 'couches']
Now, I am just following your OP sequence for building dynamic filter clause but it will give you broad idea.
coreWordFilter2 = "url like ('%"+IncoreWords[0]+"%')"
# coreWordFilter2
#"url like ('%furniture%')"
for a in InmoreWords:
coreWordFilter2=coreWordFilter2+" or url like('%"+a+"%')"
# coreWordFilter2
# "url like ('%furniture%') or url like('%couch%') or url like('%couches%')"
crawlResult.filter(coreWordFilter2).show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| couch|
+---+--------------+
I am trying to do some text mining tasks with pySpark. I am new to Spark and I've been following this example http://mccarroll.net/blog/pyspark2/index.html to build the bag of words for my data.
Originally my data looked something like this
df.show(5)
+------------+---------+----------------+--------------------+
|Title |Month | Author | Document|
+------------+---------+----------------+--------------------+
| a | Jan| John |This is a document |
| b | Feb| Mary |A book by Mary |
| c | Mar| Luke |Newspaper article |
+------------+---------+----------------+--------------------+
So far I have extracted the terms of each document with
bow0 = df.rdd\
.map( lambda x: x.Document.replace(',',' ').replace('.',' ').replace('-',' ').lower())\
.flatMap(lambda x: x.split())\
.map(lambda x: (x, 1))
Which gives me
[('This', 1),
('is', 1),
('a', 1),
('document', 1)]
But when I try to compute the frequency with reduceByKey and try to see the result
bow0.reduceByKey(lambda x,y:x+y).take(50)
I get this error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-53-966f90775397> in <module>()
----> 1 bow0.reduceByKey(lambda x,y:x+y).take(50)
/usr/local/spark/python/pyspark/rdd.py in take(self, num)
1341
1342 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1343 res = self.context.runJob(self, takeUpToNumLeft, p)
1344
1345 items += res
/usr/local/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
990 # SparkContext#runJob.
991 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 992 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
993 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
994
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 31.0 failed 4 times, most recent failure: Lost task 1.3 in stage 31.0 (TID 84, 9.242.64.15, executor 7): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 346, in func
return f(iterator)
File "/usr/local/spark/python/pyspark/rdd.py", line 1842, in combineLocally
merger.mergeValues(iterator)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
for k, v in iterator:
File "<ipython-input-48-5c0753c6b152>", line 1, in <lambda>
AttributeError: 'NoneType' object has no attribute 'replace'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:455)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 346, in func
return f(iterator)
File "/usr/local/spark/python/pyspark/rdd.py", line 1842, in combineLocally
merger.mergeValues(iterator)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
for k, v in iterator:
File "<ipython-input-48-5c0753c6b152>", line 1, in <lambda>
AttributeError: 'NoneType' object has no attribute 'replace'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
To expand on my comment, the error you are receiving is due to the presence of a null value in your Document column. Here's a small example to demonstrate:
data = [
['a', 'Jan', 'John', 'This is a document'],
['b', 'Feb', 'Mary', 'A book by Mary'],
['c', 'Mar', 'Luke', 'Newspaper article'],
['d', 'Apr', 'Mark', None]
]
columns = ['Title', 'Month', 'Author', 'Document']
df = spark.createDataFrame(data, columns)
df.show()
#+-----+-----+------+------------------+
#|Title|Month|Author| Document|
#+-----+-----+------+------------------+
#| a| Jan| John|This is a document|
#| b| Feb| Mary| A book by Mary|
#| c| Mar| Luke| Newspaper article|
#| d| Apr| Mark| null|
#+-----+-----+------+------------------+
For the last row, the value in the Document column is null. When you compute bow0 as in your question, when the map function operates on that row it tries to call x.Document.replace where x is None. This results in AttributeError: 'NoneType' object has no attribute 'replace'.
One way to overcome this is to filter out the bad values before calling map:
bow0 = df.rdd\
.filter(lambda x: x.Document)\
.map( lambda x: x.Document.replace(',',' ').replace('.',' ').replace('-',' ').lower())\
.flatMap(lambda x: x.split())\
.map(lambda x: (x, 1))
bow0.reduceByKey(lambda x,y:x+y).take(50)
#[(u'a', 2),
# (u'this', 1),
# (u'is', 1),
# (u'newspaper', 1),
# (u'article', 1),
# (u'by', 1),
# (u'book', 1),
# (u'mary', 1),
# (u'document', 1)]
Or you can build in the check for None condition inside of your map function. In general, it is good practice to make your map function robust to bad inputs.
As an aside, you can do the same thing using the DataFrame API functions. In this case:
from pyspark.sql.functions import explode, split, regexp_replace, col, lower
df.select(explode(split(regexp_replace("Document", "[,.-]", " "), "\s+")).alias("word"))\
.groupby(lower(col("word")).alias("lower"))\
.count()\
.show()
#+---------+-----+
#| lower|count|
#+---------+-----+
#| document| 1|
#| by| 1|
#|newspaper| 1|
#| article| 1|
#| mary| 1|
#| is| 1|
#| a| 2|
#| this| 1|
#| book| 1|
#+---------+-----+