Getting error while using collect_list function with struct datatype in Spark 1.6.0 - pyspark

While executing below statement I am getting error in Spark 1.6.0. grouped_df statement is not working for me.
from pyspark.sql import functions as F
from pyspark import SQLContext
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
df = sc.parallelize(data).toDF(['id','date','value'])
df.show()
grouped_df = df.groupby("id").agg(F.collect_list(F.struct("date", "value")).alias("list_col"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/group.py", line 91, in agg
_to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but struct<date:string,value:bigint> was passed as parameter 1..;'

You have to use HiveContext instead of SQLContext
from pyspark import SparkContext, HiveContext
sc = SparkContext(appName='my app name')
sql_cntx = HiveContext(sc)
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
rdd = sc.parallelize(data)
df = sql_cntx.createDataFrame(rdd, ['id','date','value'])
# ...

Related

separating dates and getting all permutations of products in Pandas UDF

I am trying to get a permutation of all possible couples of dates using a pandas_udf. As I understand the dataframe has to be grouped to be sent to a pandas_udf so I am adding an ID and grouping by that but I get an error. Here is a small example to recreate the error:
import pandas as pd
import itertools
df = spark.createDataFrame([('11/30/15,11/30/18','11/30/18,11/30/18'), ('11/30/15,11/30/18','11/30/15,11/30/18')], ['colname1', 'colname2'])
schema = StructType([StructField('Product', StringType(), True))
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def calculate_courses_final_df(this_row):
this_row_course_date_obj_list = this_row['colname1']
this_row_course_date_obj_list1 = this_row['colname2']
return pd.DataFrame(list(itertools.product(this_row_course_date_obj_list.str.split(','),this_row_course_date_obj_list1.str.split(','))))
df1 = df.withColumn("id", monotonically_increasing_id())
df2 = df1.groupby('id')
df3 = df2.apply(calculate_courses_final_df)
df3.show()
Here is what df1 looks like:
+-----------------+-----------------+-----------+
| colname1| colname2| id|
+-----------------+-----------------+-----------+
|11/30/15,11/30/18|11/30/18,11/30/18|25769803776|
|11/30/15,11/30/18|11/30/15,11/30/18|60129542144|
+-----------------+-----------------+-----------+
So the output should look like the following:
+----------------------------------------------------------------+
|[('11/30/15', '11/30/15'),('11/30/15', '11/30/18'), ('11/30/18',| |'11/30/15'),('11/30/18', '11/30/18')] |
+----------------------------------------------------------------+
Here is the error that I'm getting:
PythonException: An exception was thrown from a UDF: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object'. Full traceback below:
Traceback (most recent call last):

Unable to create rdd from another rdd

Can anybody help?
I am using PySpark on Jupyter Notebook. I want to create an RDD from another RDD.
dataRDD = sc.parallelize("welcome to pySpark by brightrace academy".split(" "))
newDataRDD = dataRDD.map(lambda line: line.upper())
newDataRDD.collect()
I get an error when I try to create an RDD from an existing RDD:
The error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-10-de8c35a40e72> in <module>
2
3 newDataRDD = dataRDD.map(lambda line: line.upper())
----> 4 newDataRDD.collect()
C:\SPARK\python\pyspark\rdd.py in collect(self)

Create pyspark dataframe from parquet file

I am quite new in pyspark and I am still trying to figure out who things work. What I am trying to do is after loading a parquet file in memory using pyarrow Itry to make it to pyspark dataframe. But I am getting an error.
I should mention that I am not reading directly through pyspark because the file in in s3 which gives me another error about "no filesystem for scheme s3"
so I am trying to work around. Below I have a reproducible example.
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
parquet_file=pq.ParquetDataset('s3filepath.parquet',filesystem=s3)
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.createDataFrame(parquet_file)
------------------------------------------------------------------
TypeError Traceback (most recent
call last)
<ipython-input-20-0cb2dd287606> in <module>
----> 1 spark.createDataFrame(pandas_dataframe)
/usr/local/spark/python/pyspark/sql/session.py in
createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema =
self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema =
self._createFromLocal(map(prepare, data), schema)
749 jrdd =
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf =
self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),
schema.json())
TypeError: 'ParquetDataset' object is not iterable
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext('local', "retail")
sqlC = SQLContext(sc)
This is how you should read parquet files to spark df:
df = sqlC.read.parquet('path_to_file_or_dir')
You can read data from S3 via Spark as long as you have the public and secret keys for the S3 bucket ... this would be more efficient compared to going though arrow via pandas and then converting to spark dataframe because you would have to parallelize the serial read.
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
df = spark.read.parquet("s3://path/to/parquet/files")
source doc => https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#access-aws-s3-directly

Pyspark does not allow me to create bucket

Pyspark does not allow me to create bucket.
(
df
.write
.partitionBy('Source')
.bucketBy(8,'destination')
.saveAsTable('flightdata')
)
AttributeError Traceback (most recent call last)
in ()
----> 1 df.write.bucketBy(2,"Source").saveAsTable("table")
AttributeError: 'DataFrameWriter' object has no attribute 'bucketBy'
It looks like bucketBy is only supported in spark 2.3.0
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.bucketBy
You could try creating a new bucket column
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, float('Inf') ],inputCol="destination", outputCol="buckets")
df_with_buckets = bucketizer.setHandleInvalid("keep").transform(df)
and then using partitionBy(*cols)
df_with_buckets.write.partitionBy('buckets').saveAsTable("table")

Connecting pyspark cluster to Cassandra cluster ERROR o64.load

I am trying to connect my PySpark cluster to Cassandra cluster. I did the following to set the connector from Spark to Cassandra:
./bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 ./exaples/testing.py
I set the following in my python file:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
SPARK_IP = "ip-111-11-1-1.us-west-2.compute.internal"
SPARK_PORT = "7077"
CASSANDRA_PORT = "222.22.2.22"
conf = SparkConf() \
.setMaster("spark://%s:%s" % (SPARK_IP, SPARK_PORT)) \
.set("spark.cassandra.connection.host", CASSANDRA_PORT)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
In my Cassandra cluster I created a keyspace and a table. I then try to read from Cassandra in pyspark and do the following:
sqlContext.read \
.format("org.apache.spark.sql.cassandra") \
.options(table="poop", keyspace="demo") \
.load().show()
I get the following error and I'm not sure how to fix this:
Traceback (most recent call last):
File "/usr/local/spark/examples/testing.py", line 37, in
.options(table="poop", keyspace="demo") \
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 155, in load
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark.apache.org/third-party-projects.html