I am trying to put together a data pipeline on HDP 2.6.3 sandbox.(docker) I am using pyspark with phoenix (4.7) and HBase.
I have installed phoenix project from maven and successfully created a table with test records. I can see data in Hbase as well.
Now i am trying to read data from the table using pyspark with the following code:
import phoenix
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(appName="Phoenix test")
sqlContext = SQLContext(sc)
table = sqlContext.read.format("org.apache.phoenix.spark").option("table", "INPUT_TABLE").option("zkUrl", "localhost:2181:/hbase-unsecure").load()
phoenix ddl:
CREATE TABLE INPUT_TABLE (id BIGINT NOT NULL PRIMARY KEY, col1 VARCHAR, col2 INTEGER);
UPSERT INTO INPUT_TABLE (id, col1, col2) VALUES (1, 'test_row_1',111);
UPSERT INTO INPUT_TABLE (id, col1, col2) VALUES (2, 'test_row_2',111 );
call:
spark-submit --class org.apache.phoenix.spark --jars /usr/hdp/current/phoenix-server/phoenix-4.7.0.2.5.0.0-1245-client.jar --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/spark2/conf/hbase-site.xml phoenix_test.py
Traceback (most recent call last):
File "/root/hdp/process_data.py", line 42, in
.format(data_source_format)\
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 593, in save
File "/usr/lib/python2.6/site-packages/py4j-0.10.6-py2.6.egg/py4j/java_gateway.py", line 1160, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/python2.6/site-packages/py4j-0.10.6-py2.6.egg/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o55.save.
: java.lang.UnsupportedOperationException: empty.tail
thanks,
clairvoyant
Related
I want to create a snowflake table from my pyspark code as below:
import pyspark.sql import SparkSession
import pyspark.sql.context import SQLContext
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
sqlContext.sql("create or replace table NEW_TABLE (id integer, desc varchar)")
I am getting this error
Your code will create Hive table not Snowflake table. You'd have to write by dataframe like this
sfOptions = {
'sfUrl': '...',
'sfUser': '...',
'sfPassword': '...',
...
}
(df
.write
.format('snowflake')
.mode(mode)
.options(**sfOptions)
.save()
)
OR, if you really want to trigger a single Snowflake query from Spark, you can use Snowflake runQuery API
query = "create or replace table NEW_TABLE (id integer, desc varchar)"
spark._jvm.net.snowflake.spark.snowflake.Utils.runQuery(sfOptions, query)
I am quite new in pyspark and I am still trying to figure out who things work. What I am trying to do is after loading a parquet file in memory using pyarrow Itry to make it to pyspark dataframe. But I am getting an error.
I should mention that I am not reading directly through pyspark because the file in in s3 which gives me another error about "no filesystem for scheme s3"
so I am trying to work around. Below I have a reproducible example.
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
parquet_file=pq.ParquetDataset('s3filepath.parquet',filesystem=s3)
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.createDataFrame(parquet_file)
------------------------------------------------------------------
TypeError Traceback (most recent
call last)
<ipython-input-20-0cb2dd287606> in <module>
----> 1 spark.createDataFrame(pandas_dataframe)
/usr/local/spark/python/pyspark/sql/session.py in
createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema =
self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema =
self._createFromLocal(map(prepare, data), schema)
749 jrdd =
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf =
self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),
schema.json())
TypeError: 'ParquetDataset' object is not iterable
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext('local', "retail")
sqlC = SQLContext(sc)
This is how you should read parquet files to spark df:
df = sqlC.read.parquet('path_to_file_or_dir')
You can read data from S3 via Spark as long as you have the public and secret keys for the S3 bucket ... this would be more efficient compared to going though arrow via pandas and then converting to spark dataframe because you would have to parallelize the serial read.
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
df = spark.read.parquet("s3://path/to/parquet/files")
source doc => https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#access-aws-s3-directly
Pyspark does not allow me to create bucket.
(
df
.write
.partitionBy('Source')
.bucketBy(8,'destination')
.saveAsTable('flightdata')
)
AttributeError Traceback (most recent call last)
in ()
----> 1 df.write.bucketBy(2,"Source").saveAsTable("table")
AttributeError: 'DataFrameWriter' object has no attribute 'bucketBy'
It looks like bucketBy is only supported in spark 2.3.0
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.bucketBy
You could try creating a new bucket column
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, float('Inf') ],inputCol="destination", outputCol="buckets")
df_with_buckets = bucketizer.setHandleInvalid("keep").transform(df)
and then using partitionBy(*cols)
df_with_buckets.write.partitionBy('buckets').saveAsTable("table")
While executing below statement I am getting error in Spark 1.6.0. grouped_df statement is not working for me.
from pyspark.sql import functions as F
from pyspark import SQLContext
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
df = sc.parallelize(data).toDF(['id','date','value'])
df.show()
grouped_df = df.groupby("id").agg(F.collect_list(F.struct("date", "value")).alias("list_col"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/group.py", line 91, in agg
_to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but struct<date:string,value:bigint> was passed as parameter 1..;'
You have to use HiveContext instead of SQLContext
from pyspark import SparkContext, HiveContext
sc = SparkContext(appName='my app name')
sql_cntx = HiveContext(sc)
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
rdd = sc.parallelize(data)
df = sql_cntx.createDataFrame(rdd, ['id','date','value'])
# ...
I am trying to connect my PySpark cluster to Cassandra cluster. I did the following to set the connector from Spark to Cassandra:
./bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 ./exaples/testing.py
I set the following in my python file:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
SPARK_IP = "ip-111-11-1-1.us-west-2.compute.internal"
SPARK_PORT = "7077"
CASSANDRA_PORT = "222.22.2.22"
conf = SparkConf() \
.setMaster("spark://%s:%s" % (SPARK_IP, SPARK_PORT)) \
.set("spark.cassandra.connection.host", CASSANDRA_PORT)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
In my Cassandra cluster I created a keyspace and a table. I then try to read from Cassandra in pyspark and do the following:
sqlContext.read \
.format("org.apache.spark.sql.cassandra") \
.options(table="poop", keyspace="demo") \
.load().show()
I get the following error and I'm not sure how to fix this:
Traceback (most recent call last):
File "/usr/local/spark/examples/testing.py", line 37, in
.options(table="poop", keyspace="demo") \
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 155, in load
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark.apache.org/third-party-projects.html