psycopg2 i got the error message "where syntax 'column' is not in list error " using pyspark - postgresql

Same as title
I'm Using AWS Glue, Script, Glue3 and Python3 also using psycopg2 library
When i delete AWS Aurora(postgresql) record
my syntax is very simple, like this
tuple_list : [(value1,value2),(value1,value2), ....]
value 1 : String, value 2 : timestamp
query = "DELETE FROM {table} WHERE (col1, col2) IN ( VALUES %s)".format(table=table)
extras.execute_values(db.cur(), query, tuple_list, template="(%s, %s)", page_size=2000)
db_conn.commit()
i got the message "where syntax 'col2' is not in list error"
i have no idea why i didn't work....
thx for your interesting and have a good day
all my error message
Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 24) (IP executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1571, in __getattr__
idx = self.__fields__.index(item)
ValueError: 'col2' is not in list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/worker.py", line 594, in process
out_iter = func(split_index, iterator)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 418, in func
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 932, in func
File "/tmp/job_test", line 136, in process_partition
File "/tmp/job_test", line 88, in execute_batch_delete
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1576, in __getattr__
raise AttributeError(item)
AttributeError: col2
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2278)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:

Related

Py4JJavaError: An error occurred while calling o10495.join. : java.lang.StackOverflowError

In Azure Synapse notebook, after running quite a number of functions, I'm trying to do a semi join of two dataframes where DF1 has one column called ID and the DF2 has five columns: ID, SID, Name, Term, Desc. Now the issue is everytime I start the session, I get this error. But when I run the code cell 5-6 times, it starts working. Not sure why it keeps happening.
df1 is a union of all distinct IDs from two other dataframes.
df2 = ogdata.select('SID', 'ID', 'Name', 'Term', 'Desc').distinct()
My Join:
df3 = df2.join(df1, ["uid"], "semi")
I've tried:
Changing it to left, different join syntax where I do df1.id = df2.id but always get the error everytime I start a session. Then when I run the cell 5-6 times, it works.
My error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-127-9d80a53> in <module>
1 ##teaching_data_current = teaching_data_current.join(uid_agg_course_teacher, teaching_data_current.uid == uid_agg_course_teacher.uid, "semi").drop(uid_agg_course_teacher.uid)
2 teaching_data_c = coursedata.select('SubjectID','uid').distinct()
----> 3 teaching_data_curr = teaching_data_c.join(uid_agg_course_teacher, ["uid"], "semi")
4 #teaching_data_curr = teaching_data_c.alias("t1").join(uid_agg_course_teacher.alias("t2"), teaching_data_c.uid==uid_agg_course_teacher.uid, "semi")
/opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py in join(self, other, on, how)
1337 on = self._jseq([])
1338 assert isinstance(how, str), "how should be a string"
-> 1339 jdf = self._jdf.join(other._jdf, on, how)
1340 return DataFrame(jdf, self.sql_ctx)
1341
~/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
~/cluster-env/env/lib/python3.8/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o10428.join.
: java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$findAliases$1.applyOrElse(Analyzer.scala:1763)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$findAliases$1.applyOrElse(Analyzer.scala:1763)
at scala.PartialFunction.$anonfun$runWith$1$adapted(PartialFunction.scala:145)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.collect(TraversableLike.scala:359)
at scala.collection.TraversableLike.collect$(TraversableLike.scala:357)
at scala.collection.immutable.List.collect(List.scala:327)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.findAliases(Analyzer.scala:1763)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.collectConflictPlans$1(Analyzer.scala:1388)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$dedupRight$10(Analyzer.scala:1464)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.collectConflictPlans$1(Analyzer.scala:1464)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$dedupRight$10(Analyzer.scala:1464)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.collectConflictPlans$1(Analyzer.scala:1464)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$dedupRight$10(Analyzer.scala:1464)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.collectConflictPlans$1(Analyzer.scala:1464)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$dedupRight$10(Analyzer.scala:1464)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)

Is there a pyspark function to give me 2 decimal places on multiple dataframe columns?

I'm new to coding and am new to pyspark and python (by new I mean I am a student and am learning it).
I keep getting error in my code and I can't figure out why. what I'm trying to do is get my code to give me a 2 decimal output that looks like this. Below is a sample output of what I want my output to look like:
+------+--------+------+------+
|col_ID| f.name |bal | avg. |
+------+--------+------+------+
|1234 | Henry |350.45|400.32|
|3456 | Sam |75.12 | 50.60|
+------+--------+------+------+
But instead here's my code and here's the error I'm getting with it:
My Code:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col #import col function for column manipulation
#import pyspark.sql.functions as func
spark=SparkSession.builder.getOrCreate()
df = spark.read.csv("/user/cloudera/Default2_Data.csv", header = True, inferSchema = True) \
.withColumn("income",round(df["income"],2)) \
.withColumn("balance",func.col("balance").cast('Float'))
#df.select(col("income").alias("income")),
#col("balance").alias("balance"),
#func.round(df["income"],2).alias("income1")
df.show(15)
Output:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/`spark2`/python/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in `get_return_value`(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
Py4JJavaError: An error occurred while calling o707.withColumn.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) income#1101 missing from student#1144,income#1146,default#1143,RecordID#1142,balance#1145 in operator !Project [RecordID#1142, default#1143, student#1144, balance#1145, round(income#1101, 2) AS income#1152]. Attribute(s) with the same name appear in the operation: income. Please check if the right attribute(s) are used.;;
!Project [RecordID#1142, default#1143, student#1144, balance#1145, round(income#1101, 2) AS income#1152]
+- Relation[RecordID#1142,default#1143,student#1144,balance#1145,income#1146] csv
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:326)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3406)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1334)
at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2252)
at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2219)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
During handling of the above exception, another exception occurred:
AnalysisException Traceback (most recent call last)
<ipython-input-102-13a967925c21> in <module>
1 df = spark.read.csv("/user/cloudera/Default2_Data.csv", header = True, inferSchema = True) \
----> 2 .withColumn("income",round(df["income"],2)) \
3 .withColumn("balance",func.col("balance").cast('Float'))
4 #df.select(col("income").alias("income")),
5 #col("balance").alias("balance"),
/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/sql/dataframe.py in withColumn(self, colName, col)
1987 """
1988 assert isinstance(col, Column), "col should be Column"
-> 1989 return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
1990
1991 #ignore_unicode_prefix
/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/cloudera/parcels/SPARK2-2.4.0.cloude`enter code here`ra2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: 'Resolved attribute(s) income#1101 missing from student#1144,income#1146,default#1143,RecordID#1142,balance#1145 in operator !Project [RecordID#1142, default#1143, student#1144, balance#1145, round(income#1101, 2) AS income#1152]. Attribute(s) with the same name appear in the operation: income. Please check if the right attribute(s) are used.;;\n!Project [RecordID#1142, default#1143, student#1144, balance#1145, round(income#1101, 2) AS income#1152]\n+- Relation[RecordID#1142,default#1143,student#1144,balance#1145,income#1146] csv\n'
Replace the following line with:
withColumn("income",round(df["income"],2))
with the following:
withColumn("income",round(col('income'),2))

Bag of words with pySpark reduceByKey

I am trying to do some text mining tasks with pySpark. I am new to Spark and I've been following this example http://mccarroll.net/blog/pyspark2/index.html to build the bag of words for my data.
Originally my data looked something like this
df.show(5)
+------------+---------+----------------+--------------------+
|Title |Month | Author | Document|
+------------+---------+----------------+--------------------+
| a | Jan| John |This is a document |
| b | Feb| Mary |A book by Mary |
| c | Mar| Luke |Newspaper article |
+------------+---------+----------------+--------------------+
So far I have extracted the terms of each document with
bow0 = df.rdd\
.map( lambda x: x.Document.replace(',',' ').replace('.',' ').replace('-',' ').lower())\
.flatMap(lambda x: x.split())\
.map(lambda x: (x, 1))
Which gives me
[('This', 1),
('is', 1),
('a', 1),
('document', 1)]
But when I try to compute the frequency with reduceByKey and try to see the result
bow0.reduceByKey(lambda x,y:x+y).take(50)
I get this error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-53-966f90775397> in <module>()
----> 1 bow0.reduceByKey(lambda x,y:x+y).take(50)
/usr/local/spark/python/pyspark/rdd.py in take(self, num)
1341
1342 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1343 res = self.context.runJob(self, takeUpToNumLeft, p)
1344
1345 items += res
/usr/local/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
990 # SparkContext#runJob.
991 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 992 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
993 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
994
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 31.0 failed 4 times, most recent failure: Lost task 1.3 in stage 31.0 (TID 84, 9.242.64.15, executor 7): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 346, in func
return f(iterator)
File "/usr/local/spark/python/pyspark/rdd.py", line 1842, in combineLocally
merger.mergeValues(iterator)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
for k, v in iterator:
File "<ipython-input-48-5c0753c6b152>", line 1, in <lambda>
AttributeError: 'NoneType' object has no attribute 'replace'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:455)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/local/spark/python/pyspark/rdd.py", line 346, in func
return f(iterator)
File "/usr/local/spark/python/pyspark/rdd.py", line 1842, in combineLocally
merger.mergeValues(iterator)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
for k, v in iterator:
File "<ipython-input-48-5c0753c6b152>", line 1, in <lambda>
AttributeError: 'NoneType' object has no attribute 'replace'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
To expand on my comment, the error you are receiving is due to the presence of a null value in your Document column. Here's a small example to demonstrate:
data = [
['a', 'Jan', 'John', 'This is a document'],
['b', 'Feb', 'Mary', 'A book by Mary'],
['c', 'Mar', 'Luke', 'Newspaper article'],
['d', 'Apr', 'Mark', None]
]
columns = ['Title', 'Month', 'Author', 'Document']
df = spark.createDataFrame(data, columns)
df.show()
#+-----+-----+------+------------------+
#|Title|Month|Author| Document|
#+-----+-----+------+------------------+
#| a| Jan| John|This is a document|
#| b| Feb| Mary| A book by Mary|
#| c| Mar| Luke| Newspaper article|
#| d| Apr| Mark| null|
#+-----+-----+------+------------------+
For the last row, the value in the Document column is null. When you compute bow0 as in your question, when the map function operates on that row it tries to call x.Document.replace where x is None. This results in AttributeError: 'NoneType' object has no attribute 'replace'.
One way to overcome this is to filter out the bad values before calling map:
bow0 = df.rdd\
.filter(lambda x: x.Document)\
.map( lambda x: x.Document.replace(',',' ').replace('.',' ').replace('-',' ').lower())\
.flatMap(lambda x: x.split())\
.map(lambda x: (x, 1))
bow0.reduceByKey(lambda x,y:x+y).take(50)
#[(u'a', 2),
# (u'this', 1),
# (u'is', 1),
# (u'newspaper', 1),
# (u'article', 1),
# (u'by', 1),
# (u'book', 1),
# (u'mary', 1),
# (u'document', 1)]
Or you can build in the check for None condition inside of your map function. In general, it is good practice to make your map function robust to bad inputs.
As an aside, you can do the same thing using the DataFrame API functions. In this case:
from pyspark.sql.functions import explode, split, regexp_replace, col, lower
df.select(explode(split(regexp_replace("Document", "[,.-]", " "), "\s+")).alias("word"))\
.groupby(lower(col("word")).alias("lower"))\
.count()\
.show()
#+---------+-----+
#| lower|count|
#+---------+-----+
#| document| 1|
#| by| 1|
#|newspaper| 1|
#| article| 1|
#| mary| 1|
#| is| 1|
#| a| 2|
#| this| 1|
#| book| 1|
#+---------+-----+

Getting error while using collect_list function with struct datatype in Spark 1.6.0

While executing below statement I am getting error in Spark 1.6.0. grouped_df statement is not working for me.
from pyspark.sql import functions as F
from pyspark import SQLContext
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
df = sc.parallelize(data).toDF(['id','date','value'])
df.show()
grouped_df = df.groupby("id").agg(F.collect_list(F.struct("date", "value")).alias("list_col"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/group.py", line 91, in agg
_to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but struct<date:string,value:bigint> was passed as parameter 1..;'
You have to use HiveContext instead of SQLContext
from pyspark import SparkContext, HiveContext
sc = SparkContext(appName='my app name')
sql_cntx = HiveContext(sc)
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
rdd = sc.parallelize(data)
df = sql_cntx.createDataFrame(rdd, ['id','date','value'])
# ...

SQLAlchemy cannot connect to Postgresql on localhost

I'm sure this is such an easy error to fix, if I could only find where it is. This is the error from the Flask app:
11:58:18 web.1 | ERROR:xxxxxx.core:Exception on / [GET]
11:58:18 web.1 | Traceback (most recent call last):
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app
11:58:18 web.1 | response = self.full_dispatch_request()
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
11:58:18 web.1 | rv = self.handle_user_exception(e)
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
11:58:18 web.1 | reraise(exc_type, exc_value, tb)
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
11:58:18 web.1 | rv = self.dispatch_request()
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request
11:58:18 web.1 | return self.view_functions[rule.endpoint](**req.view_args)
11:58:18 web.1 | File "xxxxxxx/web.py", line 202, in home
11:58:18 web.1 | d = {'featured': cached_apps.get_featured_front_page(),
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_cache/__init__.py", line 245, in decorated_function
11:58:18 web.1 | rv = f(*args, **kwargs)
11:58:18 web.1 | File "/Users/xxxxxxx/Desktop/PythonProjects/xxxxxx/xxxxxx2/xxxxxxx/cache/apps.py", line 35, in get_featured_front_page
11:58:18 web.1 | results = db.engine.execute(sql)
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 780, in engine
11:58:18 web.1 | return self.get_engine(self.get_app())
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 797, in get_engine
11:58:18 web.1 | return connector.get_engine()
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 470, in get_engine
11:58:18 web.1 | self._sa.apply_driver_hacks(self._app, info, options)
11:58:18 web.1 | File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 739, in apply_driver_hacks
11:58:18 web.1 | if info.drivername.startswith('mysql'):
11:58:18 web.1 | AttributeError: 'NoneType' object has no attribute 'drivername'
From what I've been able to find online, it seems like the problem is that I might not be connecting correctly to the database. The app works fine in Heroku, but not when I run on localhost.
which psql:
/Applications/Postgres.app/Contents/MacOS/bin/psql
which postgres:
/Applications/Postgres.app/Contents/MacOS/bin/postgres
Postgres.app is running on 5432.
I don't know what else to check.
If it's supposed to connect to the same postgres DB on heroku regardless, why would it work on heroku, but not from localhost?
Maybe the app on localhost is using a wrong version of Postgres? I've tried uninstalling them (and only leaving Postgres.app), but I'm not sure if there's anything left on my computer that's causing conflicts. How would I check that? I'd appreciate any help.
EDIT: More info
Segment from the alembic.ini file:
[alembic]
# path to migration scripts
script_location = alembic
# template used to generate migration files
# file_template = %%(rev)s_%%(slug)s
# under Heroku, the line below needs to be inferred from
# the environment
sqlalchemy.url = postgres://xxxxxxxxxx:xxxxxxxxxxxx#xxxxxx.compute-1.amazonaws.com:5432/xxxxxxxxx
# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic
I have a short script that produces the same error:
I run python cli.py db_create
on
#!/usr/bin/env python
import os
import sys
import optparse
import inspect
import xxxxxxx.model as model
from xxxxxx.core import db
import xxxxx.web as web
from alembic.config import Config
from alembic import command
def setup_alembic_config():
if "DATABASE_URL" not in os.environ:
alembic_cfg = Config("alembic.ini")
else:
dynamic_filename = "alembic-heroku.ini"
with file("alembic.ini.template") as f:
with file(dynamic_filename, "w") as conf:
for line in f.readlines():
if line.startswith("sqlalchemy.url"):
conf.write("sqlalchemy.url = %s\n" %
os.environ['DATABASE_URL'])
else:
conf.write(line)
alembic_cfg = Config(dynamic_filename)
command.stamp(alembic_cfg, "head")
def db_create():
'''Create the db'''
db.create_all()
# then, load the Alembic configuration and generate the
# version table, "stamping" it with the most recent rev:
setup_alembic_config()
# finally, add a minimum set of categories: Volunteer Thinking, Volunteer Sensing, Published and Draft
categories = []
categories.append(model.Category(name="Thinking",
short_name='thinking',
description='Volunteer Thinking apps'))
categories.append(model.Category(name="Volunteer Sensing",
short_name='sensing',
description='Volunteer Sensing apps'))
db.session.add_all(categories)
db.session.commit()
and I get:
Traceback (most recent call last):
File "cli.py", line 111, in <module>
_main(locals())
File "cli.py", line 106, in _main
_methods[method](*args[1:])
File "cli.py", line 33, in db_create
db.create_all()
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 856, in create_all
self._execute_for_all_tables(app, bind, 'create_all')
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 848, in _execute_for_all_tables
op(bind=self.get_engine(app, bind), tables=tables)
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 797, in get_engine
return connector.get_engine()
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 470, in get_engine
self._sa.apply_driver_hacks(self._app, info, options)
File "/Library/Python/2.7/site-packages/flask_sqlalchemy/__init__.py", line 739, in apply_driver_hacks
if info.drivername.startswith('mysql'):
AttributeError: 'NoneType' object has no attribute 'drivername'
My guess is that you haven't correctly configured Flask-SQLAlchemy. You have a lot of code that seems like it tries to configure it, but without going through all of it, my guess is that it either is setting up your configuration incorrectly, or setting up your configuration too late.
Make sure that before you call anything that will hit the database (like the db.create_all()) that your app.config["SQLALCHEMY_DATABASE_URI"] is set to the correct URI. It is probably set to None, and that is causing your issue.
agree with Mark Hildreth whose answer is 7 years old, took me two days to get here -
I was getting such errors:
FAILED tests/test_models.py::test_endpoint - AttributeError: 'NoneType' object has no attribute 'drivername'
ERROR tests/test_models.py::test_endpoint2 - AttributeError: 'NoneType' object has no attribute 'response_class'
traceback shows smth like:
self = <[AttributeError("'NoneType' object has no attribute 'drivername'") raised in repr()] SQLAlchemy object at 0x1fb8a622848>, app = <Flask 'app'>, sa_url = None, options = {}
Finaly with tihs answer i remembered that Flask sometimes need to have set config variables in place.
helped a lot thanks - app.config["SQLALCHEMY_DATABASE_URI"] = os.environ.get('mysql://root:1234#127.0.0.1/kvadrat?charset=utf8')