Create pyspark dataframe from parquet file - pyspark

I am quite new in pyspark and I am still trying to figure out who things work. What I am trying to do is after loading a parquet file in memory using pyarrow Itry to make it to pyspark dataframe. But I am getting an error.
I should mention that I am not reading directly through pyspark because the file in in s3 which gives me another error about "no filesystem for scheme s3"
so I am trying to work around. Below I have a reproducible example.
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
parquet_file=pq.ParquetDataset('s3filepath.parquet',filesystem=s3)
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.createDataFrame(parquet_file)
------------------------------------------------------------------
TypeError Traceback (most recent
call last)
<ipython-input-20-0cb2dd287606> in <module>
----> 1 spark.createDataFrame(pandas_dataframe)
/usr/local/spark/python/pyspark/sql/session.py in
createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema =
self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema =
self._createFromLocal(map(prepare, data), schema)
749 jrdd =
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf =
self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),
schema.json())
TypeError: 'ParquetDataset' object is not iterable

import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext('local', "retail")
sqlC = SQLContext(sc)
This is how you should read parquet files to spark df:
df = sqlC.read.parquet('path_to_file_or_dir')

You can read data from S3 via Spark as long as you have the public and secret keys for the S3 bucket ... this would be more efficient compared to going though arrow via pandas and then converting to spark dataframe because you would have to parallelize the serial read.
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
df = spark.read.parquet("s3://path/to/parquet/files")
source doc => https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#access-aws-s3-directly

Related

How do you have AWS Glue ETL job return a single file with all the results in it using PySpark?

I have a very basic AWS Glue ETL job that I created to select some fields from a data catalog that was built from a crawler I have pointing to an RDS database. Once the dataset is returned I export the results in CSV format. This works, however; the output generates around 20 unique files . The dataset only has two rows in it right now so only two files have data and the rest just show the column headers with no second row. My requirement is to have a single CSV file that contains all of the data selected from the dataset. I have tried both repartition and coalesce functions unsuccessfully. I am able to generate the single file, but my data is missing. I am new to AWS Glue and have been unable to figure this out so any suggestions will be much appreciated.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node PostgreSQL
PostgreSQL_node1644981751584 = glueContext.create_dynamic_frame.from_catalog(
database="newApp",
table_name="database_schema_staging_hdr",
transformation_ctx="PostgreSQL_node1644981751584",
)
# Script generated for node SQL
SqlQuery0 = """
select * from myDataSource
"""
SQL_node1644981807578 = sparkSqlQuery(
glueContext,
query=SqlQuery0,
mapping={"myDataSource": PostgreSQL_node1644981751584},
transformation_ctx="SQL_node1644981807578",
)
# Script generated for node Amazon S3
AmazonS3_node1644981816657 = glueContext.write_dynamic_frame.from_options(
frame=SQL_node1644981807578,
connection_type="s3",
format="csv",
connection_options={"path": "s3://awsglueetloutput/", "partitionKeys": []},
transformation_ctx="AmazonS3_node1644981816657",
)
job.commit()
You have to repartition the DynamicFrame to achieve that.
Example to have 1 file in the end: SQL_node1644981807578 = SQL_node1644981807578.repartition(1)

Is it possible to reference a PySpark DataFrame using it's rdd id?

If I "overwrite" a df using the same naming convention in PySpark such as in the example below, am I able to reference it later on using the rdd id?
df = spark.createDataFrame([('Abraham','Lincoln')], ['first_name', 'last_name'])
df.checkpoint()
print(df.show())
print(df.rdd.id())
from pyspark.sql.functions import *
df = df.select(names.first_name,names.last_name,concat_ws(' ', names.first_name, names.last_name).alias('full_name'))
df.checkpoint()
print(df.show())
print(df.rdd.id())

SparkDataFrame.dtypes fails if a column has special chars..how to bypass and read the csv and inferschema

Inferring Schema of a Spark Dataframe throws error if the csv file has column with special chars..
Test sample
foo.csv
id,comment
1, #Hi
2, Hello
spark = SparkSession.builder.appName("footest").getOrCreate()
df= spark.read.load("foo.csv", format="csv", inferSchema="true", header="true")
print(df.dtypes)
raise ValueError("Could not parse datatype: %s" % json_value)
I found comment from Dat Tran on inferSchema in spark csv package how to resolve this...cann't we still inferschema before dataclean?
Use it like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("test19.csv")
print(df.dtypes)
Output:
[('id', 'int'), ('comment', 'string')]

How to create child dataframe from xml file using Pyspark?

I have all those supporting libraries in pyspark and I am able to create dataframe for parent-
def xmlReader(root, row, filename):
df = spark.read.format("com.databricks.spark.xml").options(rowTag=row,rootTag=root).load(filename)
xref = df.select("genericEntity.entityId", "genericEntity.entityName","genericEntity.entityType","genericEntity.inceptionDate","genericEntity.updateTimestamp","genericEntity.entityLongName")
return xref
df1 = xmlReader("BOBML","entityTransaction","s3://dev.xml")
df1.head()
I am unable to create child dataframe-
def xmlReader(root, row, filename):
df2 = spark.read.format("com.databricks.spark.xml").options(rowTag=row, rootTag=root).load(filename)
xref = df2.select("genericEntity.entityDetail", "genericEntity.entityDetialId","genericEntity.updateTimestamp")
return xref
df3 = xmlReader("BOBML","s3://dev.xml")
df3.head()
I am not getting any output and I was planning to do union between parent and child dataframe. Any help will be truly appreciated!
After more than 24 hours, I am able to solve the problem and thanks to all whoever at least look at my problem.
Solution:
Step 1: Upload couple of libraries
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
Step2 (Parents):Read xml files, print schema, register temp tables, and create dataframe.
Step3 (Child): Repeat step 2.
Step4: Create final Dataframe by joining Child and parent dataframes.
Step5: Load data into S3 (write.csv/S3://Path) or Database.

Spark - UnsupportedOperationException: collect_list is not supported in a window operation

I am using Spark 1.6. I have a dataframe generated from a parquet file with 6 columns. I am trying to group (partitionBy) and order(orderBy) the rows in the dataframe, to later collect those columns in an Array.
I wasn't sure if this actions were possible in Spark 1.6, but in the following answers they show how it can be done:
https://stackoverflow.com/a/35529093/1773841 #zero323
https://stackoverflow.com/a/45135012/1773841 #Ramesh Maharjan
Based on those answers I wrote the following code:
val sqlContext: SQLContext = new HiveContext(sc)
val conf = sc.hadoopConfiguration
val dataPath = "/user/today/*/*"
val dfSource : DataFrame = sqlContext.read.format("parquet").option("dateFormat", "DDMONYY").option("timeFormat", "HH24:MI:SS").load(dataPath)
val w = Window.partitionBy("code").orderBy("date".desc)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val dfCollec = dfData.withColumn("collected", collect_list(struct("col1","col2","col3","col4","col5","col6")).over(w))
So, I followed the pattern written by Ramesh, and I created the sqlContext based on Hive as Zero recommended. But I am still getting the following error:
java.lang.UnsupportedOperationException:
'collect_list(struct('col1,'col2,'col3,'col4,'col5,'col6)) is not
supported in a window operation.
at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1052)
What am I missing still?