Pyspark does not allow me to create bucket.
(
df
.write
.partitionBy('Source')
.bucketBy(8,'destination')
.saveAsTable('flightdata')
)
AttributeError Traceback (most recent call last)
in ()
----> 1 df.write.bucketBy(2,"Source").saveAsTable("table")
AttributeError: 'DataFrameWriter' object has no attribute 'bucketBy'
It looks like bucketBy is only supported in spark 2.3.0
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.bucketBy
You could try creating a new bucket column
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, float('Inf') ],inputCol="destination", outputCol="buckets")
df_with_buckets = bucketizer.setHandleInvalid("keep").transform(df)
and then using partitionBy(*cols)
df_with_buckets.write.partitionBy('buckets').saveAsTable("table")
Related
I am trying to get a permutation of all possible couples of dates using a pandas_udf. As I understand the dataframe has to be grouped to be sent to a pandas_udf so I am adding an ID and grouping by that but I get an error. Here is a small example to recreate the error:
import pandas as pd
import itertools
df = spark.createDataFrame([('11/30/15,11/30/18','11/30/18,11/30/18'), ('11/30/15,11/30/18','11/30/15,11/30/18')], ['colname1', 'colname2'])
schema = StructType([StructField('Product', StringType(), True))
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def calculate_courses_final_df(this_row):
this_row_course_date_obj_list = this_row['colname1']
this_row_course_date_obj_list1 = this_row['colname2']
return pd.DataFrame(list(itertools.product(this_row_course_date_obj_list.str.split(','),this_row_course_date_obj_list1.str.split(','))))
df1 = df.withColumn("id", monotonically_increasing_id())
df2 = df1.groupby('id')
df3 = df2.apply(calculate_courses_final_df)
df3.show()
Here is what df1 looks like:
+-----------------+-----------------+-----------+
| colname1| colname2| id|
+-----------------+-----------------+-----------+
|11/30/15,11/30/18|11/30/18,11/30/18|25769803776|
|11/30/15,11/30/18|11/30/15,11/30/18|60129542144|
+-----------------+-----------------+-----------+
So the output should look like the following:
+----------------------------------------------------------------+
|[('11/30/15', '11/30/15'),('11/30/15', '11/30/18'), ('11/30/18',| |'11/30/15'),('11/30/18', '11/30/18')] |
+----------------------------------------------------------------+
Here is the error that I'm getting:
PythonException: An exception was thrown from a UDF: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object'. Full traceback below:
Traceback (most recent call last):
Can anybody help?
I am using PySpark on Jupyter Notebook. I want to create an RDD from another RDD.
dataRDD = sc.parallelize("welcome to pySpark by brightrace academy".split(" "))
newDataRDD = dataRDD.map(lambda line: line.upper())
newDataRDD.collect()
I get an error when I try to create an RDD from an existing RDD:
The error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-10-de8c35a40e72> in <module>
2
3 newDataRDD = dataRDD.map(lambda line: line.upper())
----> 4 newDataRDD.collect()
C:\SPARK\python\pyspark\rdd.py in collect(self)
If I "overwrite" a df using the same naming convention in PySpark such as in the example below, am I able to reference it later on using the rdd id?
df = spark.createDataFrame([('Abraham','Lincoln')], ['first_name', 'last_name'])
df.checkpoint()
print(df.show())
print(df.rdd.id())
from pyspark.sql.functions import *
df = df.select(names.first_name,names.last_name,concat_ws(' ', names.first_name, names.last_name).alias('full_name'))
df.checkpoint()
print(df.show())
print(df.rdd.id())
I am quite new in pyspark and I am still trying to figure out who things work. What I am trying to do is after loading a parquet file in memory using pyarrow Itry to make it to pyspark dataframe. But I am getting an error.
I should mention that I am not reading directly through pyspark because the file in in s3 which gives me another error about "no filesystem for scheme s3"
so I am trying to work around. Below I have a reproducible example.
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
parquet_file=pq.ParquetDataset('s3filepath.parquet',filesystem=s3)
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.createDataFrame(parquet_file)
------------------------------------------------------------------
TypeError Traceback (most recent
call last)
<ipython-input-20-0cb2dd287606> in <module>
----> 1 spark.createDataFrame(pandas_dataframe)
/usr/local/spark/python/pyspark/sql/session.py in
createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema =
self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema =
self._createFromLocal(map(prepare, data), schema)
749 jrdd =
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf =
self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),
schema.json())
TypeError: 'ParquetDataset' object is not iterable
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext('local', "retail")
sqlC = SQLContext(sc)
This is how you should read parquet files to spark df:
df = sqlC.read.parquet('path_to_file_or_dir')
You can read data from S3 via Spark as long as you have the public and secret keys for the S3 bucket ... this would be more efficient compared to going though arrow via pandas and then converting to spark dataframe because you would have to parallelize the serial read.
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
df = spark.read.parquet("s3://path/to/parquet/files")
source doc => https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#access-aws-s3-directly
While executing below statement I am getting error in Spark 1.6.0. grouped_df statement is not working for me.
from pyspark.sql import functions as F
from pyspark import SQLContext
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
df = sc.parallelize(data).toDF(['id','date','value'])
df.show()
grouped_df = df.groupby("id").agg(F.collect_list(F.struct("date", "value")).alias("list_col"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/group.py", line 91, in agg
_to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but struct<date:string,value:bigint> was passed as parameter 1..;'
You have to use HiveContext instead of SQLContext
from pyspark import SparkContext, HiveContext
sc = SparkContext(appName='my app name')
sql_cntx = HiveContext(sc)
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
rdd = sc.parallelize(data)
df = sql_cntx.createDataFrame(rdd, ['id','date','value'])
# ...