I am trying to run pyspark on my local machine Spark Context is not being imported somehow this throws an error from one of the spark SQL libraries.
(dataplot) name#name:~$ pyspark
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "/home/inkadmin/virtualenvs/dataplot/lib/python3.6/site-packages/pyspark/python/pyspark/shell.py", line 31, in <module>
from pyspark import SparkConf
File "/home/inkadmin/virtualenvs/dataplot/lib/python3.6/site-packages/pyspark/__init__.py", line 51, in <module>
from pyspark.context import SparkContext
File "/home/inkadmin/virtualenvs/dataplot/lib/python3.6/site-packages/pyspark/context.py", line 43, in <module>
from pyspark.profiler import ProfilerCollector, BasicProfiler
File "/home/inkadmin/virtualenvs/dataplot/lib/python3.6/site-packages/pyspark/profiler.py", line 18, in <module>
import cProfile
File "/usr/lib/python3.6/cProfile.py", line 10, in <module>
import profile as _pyprofile
File "/home/inkadmin/profile.py", line 12, in <module>
from pyspark.sql import SparkSession
File "/home/inkadmin/virtualenvs/dataplot/lib/python3.6/site-packages/pyspark/sql/__init__.py", line 45, in <module>
from pyspark.sql.types import Row
File "/home/inkadmin/virtualenvs/dataplot/lib/python3.6/site-packages/pyspark/sql/types.py", line 36, in <module>
from pyspark import SparkContext
ImportError: cannot import name 'SparkContext'
>>>
Related
When I am trying to update my code to the new version of ruamel.yaml, I am running into issues.
code:
import sys
import ruamel.yaml
print('Python', tuple(sys.version_info), ', ruamel.yaml', ruamel.yaml.version_info)
yaml_str = """\
number_to_name:
1: name1
2: name2
"""
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
print('before:', data)
data.update({4: 'name4'})
print('after: ', data)
print('==========')
yaml.dump(data, sys.stdout)
output with ruamel.yaml (0, 17, 4):
Python (3, 6, 13, 'final', 0) , ruamel.yaml (0, 17, 4)
before: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')]))])
Traceback (most recent call last):
File "/home/lib/python3.6/site-packages/ruamel/yaml/comments.py", line 779, in update
self._ok.update(vals.keys()) # type: ignore
AttributeError: 'tuple' object has no attribute 'keys'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bin/runamel.py", line 15, in <module>
data.update({4: 'name4'})
File "/home/lib/python3.6/site-packages/ruamel/yaml/comments.py", line 783, in update
self._ok.add(x[0])
TypeError: 'int' object is not subscriptable
The same code with the old version is working fine.
output with ruamel.yaml (0, 16, 10)
Python (3, 6, 13, 'final', 0) , ruamel.yaml (0, 16, 10)
before: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')]))])
after: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')])), (4, 'name4')])
==========
number_to_name:
1: name1
2: name2
4: name4
What I am doing wrong? (I Also suspect vals.keys() at line 779 is always going to raise Attribute error as vals is a tuple)
This is an issue introduced between ruamel.yaml versions 0.6.12 and 0.6.13. It has been fixed
in version 0.17.9
import sys
import ruamel.yaml
print('Python', tuple(sys.version_info), ', ruamel.yaml', ruamel.yaml.version_info)
yaml_str = """\
number_to_name:
1: name1
2: name2
"""
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
print('before:', data)
data.update({4: 'name4'})
print('after: ', data)
print('==========')
yaml.dump(data, sys.stdout)
which gives:
Python (3, 9, 4, 'final', 0) , ruamel.yaml (0, 17, 9)
before: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')]))])
after: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')])), (4, 'name4')])
==========
number_to_name:
1: name1
2: name2
4: name4
Pyspark does not allow me to create bucket.
(
df
.write
.partitionBy('Source')
.bucketBy(8,'destination')
.saveAsTable('flightdata')
)
AttributeError Traceback (most recent call last)
in ()
----> 1 df.write.bucketBy(2,"Source").saveAsTable("table")
AttributeError: 'DataFrameWriter' object has no attribute 'bucketBy'
It looks like bucketBy is only supported in spark 2.3.0
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.bucketBy
You could try creating a new bucket column
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, float('Inf') ],inputCol="destination", outputCol="buckets")
df_with_buckets = bucketizer.setHandleInvalid("keep").transform(df)
and then using partitionBy(*cols)
df_with_buckets.write.partitionBy('buckets').saveAsTable("table")
I am trying to put together a data pipeline on HDP 2.6.3 sandbox.(docker) I am using pyspark with phoenix (4.7) and HBase.
I have installed phoenix project from maven and successfully created a table with test records. I can see data in Hbase as well.
Now i am trying to read data from the table using pyspark with the following code:
import phoenix
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(appName="Phoenix test")
sqlContext = SQLContext(sc)
table = sqlContext.read.format("org.apache.phoenix.spark").option("table", "INPUT_TABLE").option("zkUrl", "localhost:2181:/hbase-unsecure").load()
phoenix ddl:
CREATE TABLE INPUT_TABLE (id BIGINT NOT NULL PRIMARY KEY, col1 VARCHAR, col2 INTEGER);
UPSERT INTO INPUT_TABLE (id, col1, col2) VALUES (1, 'test_row_1',111);
UPSERT INTO INPUT_TABLE (id, col1, col2) VALUES (2, 'test_row_2',111 );
call:
spark-submit --class org.apache.phoenix.spark --jars /usr/hdp/current/phoenix-server/phoenix-4.7.0.2.5.0.0-1245-client.jar --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/spark2/conf/hbase-site.xml phoenix_test.py
Traceback (most recent call last):
File "/root/hdp/process_data.py", line 42, in
.format(data_source_format)\
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 593, in save
File "/usr/lib/python2.6/site-packages/py4j-0.10.6-py2.6.egg/py4j/java_gateway.py", line 1160, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/python2.6/site-packages/py4j-0.10.6-py2.6.egg/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o55.save.
: java.lang.UnsupportedOperationException: empty.tail
thanks,
clairvoyant
While executing below statement I am getting error in Spark 1.6.0. grouped_df statement is not working for me.
from pyspark.sql import functions as F
from pyspark import SQLContext
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
df = sc.parallelize(data).toDF(['id','date','value'])
df.show()
grouped_df = df.groupby("id").agg(F.collect_list(F.struct("date", "value")).alias("list_col"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/group.py", line 91, in agg
_to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but struct<date:string,value:bigint> was passed as parameter 1..;'
You have to use HiveContext instead of SQLContext
from pyspark import SparkContext, HiveContext
sc = SparkContext(appName='my app name')
sql_cntx = HiveContext(sc)
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
rdd = sc.parallelize(data)
df = sql_cntx.createDataFrame(rdd, ['id','date','value'])
# ...
I am trying to connect my PySpark cluster to Cassandra cluster. I did the following to set the connector from Spark to Cassandra:
./bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 ./exaples/testing.py
I set the following in my python file:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
SPARK_IP = "ip-111-11-1-1.us-west-2.compute.internal"
SPARK_PORT = "7077"
CASSANDRA_PORT = "222.22.2.22"
conf = SparkConf() \
.setMaster("spark://%s:%s" % (SPARK_IP, SPARK_PORT)) \
.set("spark.cassandra.connection.host", CASSANDRA_PORT)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
In my Cassandra cluster I created a keyspace and a table. I then try to read from Cassandra in pyspark and do the following:
sqlContext.read \
.format("org.apache.spark.sql.cassandra") \
.options(table="poop", keyspace="demo") \
.load().show()
I get the following error and I'm not sure how to fix this:
Traceback (most recent call last):
File "/usr/local/spark/examples/testing.py", line 37, in
.options(table="poop", keyspace="demo") \
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 155, in load
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark.apache.org/third-party-projects.html