Pyarrow parquet can't read dataset with large metadata - metadata

I used Petastorm row_group_indexer to build index for a column in a petastorm dataset. After that, the size of the metadata file increased significantly and Pyarrow can't load the dataset anymore due to this error:
OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
Here is the code I'm using to load the dataset:
from pyarrow import parquet as pq
dataset_path = "path/to/dataset/"
dataset = pq.ParquetDataset(path_or_paths=dataset_path)
Code used for indexing the materialized petastorm dataset:
from pyspark.sql import SparkSession
from petastorm.etl.rowgroup_indexers import SingleFieldIndexer
from petastorm.etl.rowgroup_indexing import build_rowgroup_index
dataset_url = "file:///path/to/dataset"
spark = SparkSession.builder.appName("demo").config("spark.jars").getOrCreate()
indexer = [SingleFieldIndexer(index_name="my_index",index_field="COLUMN1")]
build_rowgroup_index(dataset_url, spark.sparkContext, indexer)

Related

How do you have AWS Glue ETL job return a single file with all the results in it using PySpark?

I have a very basic AWS Glue ETL job that I created to select some fields from a data catalog that was built from a crawler I have pointing to an RDS database. Once the dataset is returned I export the results in CSV format. This works, however; the output generates around 20 unique files . The dataset only has two rows in it right now so only two files have data and the rest just show the column headers with no second row. My requirement is to have a single CSV file that contains all of the data selected from the dataset. I have tried both repartition and coalesce functions unsuccessfully. I am able to generate the single file, but my data is missing. I am new to AWS Glue and have been unable to figure this out so any suggestions will be much appreciated.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node PostgreSQL
PostgreSQL_node1644981751584 = glueContext.create_dynamic_frame.from_catalog(
database="newApp",
table_name="database_schema_staging_hdr",
transformation_ctx="PostgreSQL_node1644981751584",
)
# Script generated for node SQL
SqlQuery0 = """
select * from myDataSource
"""
SQL_node1644981807578 = sparkSqlQuery(
glueContext,
query=SqlQuery0,
mapping={"myDataSource": PostgreSQL_node1644981751584},
transformation_ctx="SQL_node1644981807578",
)
# Script generated for node Amazon S3
AmazonS3_node1644981816657 = glueContext.write_dynamic_frame.from_options(
frame=SQL_node1644981807578,
connection_type="s3",
format="csv",
connection_options={"path": "s3://awsglueetloutput/", "partitionKeys": []},
transformation_ctx="AmazonS3_node1644981816657",
)
job.commit()
You have to repartition the DynamicFrame to achieve that.
Example to have 1 file in the end: SQL_node1644981807578 = SQL_node1644981807578.repartition(1)

Read files with different column order

I have few csv files with headers but I found out that some files have different column orders. Is there a way to handle this with Spark where I can define select order for each file so that the master DF doesn't have mismatch where col x might have values from col y?
My current read -
val masterDF = spark.read.option("header", "true").csv(allFiles:_*)
Extract all file names and store into list variable.
Then define schema of with all the columns in it.
iterate through each file using header true, so we are reading each file separately.
unionAll the new dataframe with the existing dataframe.
Example:
file_lst=['<path1>','<path2>']
from pyspark.sql.functions import *
from pyspark.sql.types import *
#define schema for the required columns
schema = StructType([StructField("column1",StringType(),True),StructField("column2",StringType(),True)])
#create an empty dataframe
df=spark.createDataFrame([],schema)
for i in file_lst:
tmp_df=spark.read.option("header","true").csv(i).select("column1","column2")
df=df.unionAll(tmp_df)
#display results
df.show()

Create pyspark dataframe from parquet file

I am quite new in pyspark and I am still trying to figure out who things work. What I am trying to do is after loading a parquet file in memory using pyarrow Itry to make it to pyspark dataframe. But I am getting an error.
I should mention that I am not reading directly through pyspark because the file in in s3 which gives me another error about "no filesystem for scheme s3"
so I am trying to work around. Below I have a reproducible example.
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
parquet_file=pq.ParquetDataset('s3filepath.parquet',filesystem=s3)
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.createDataFrame(parquet_file)
------------------------------------------------------------------
TypeError Traceback (most recent
call last)
<ipython-input-20-0cb2dd287606> in <module>
----> 1 spark.createDataFrame(pandas_dataframe)
/usr/local/spark/python/pyspark/sql/session.py in
createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema =
self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema =
self._createFromLocal(map(prepare, data), schema)
749 jrdd =
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf =
self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),
schema.json())
TypeError: 'ParquetDataset' object is not iterable
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext('local', "retail")
sqlC = SQLContext(sc)
This is how you should read parquet files to spark df:
df = sqlC.read.parquet('path_to_file_or_dir')
You can read data from S3 via Spark as long as you have the public and secret keys for the S3 bucket ... this would be more efficient compared to going though arrow via pandas and then converting to spark dataframe because you would have to parallelize the serial read.
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
df = spark.read.parquet("s3://path/to/parquet/files")
source doc => https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#access-aws-s3-directly

streamWrite with append option and window function

I'm trying to writeStream using the append option, but I get an error.
Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import window
from pyspark.sql.functions import col, column, count, when
spark = SparkSession\
.builder\
.appName("get_sensor_data")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
Sensor = lines.select(lines.value.alias('Sensor'),
lines.timestamp)
windowedCounts = Sensor.withWatermark('timestamp', '10 seconds').groupBy(
window(Sensor.timestamp, windowDuration, slideDuration)).\
agg(count(when(col('Sensor')=="LR1 On",True)).alias('LR1'),\
count(when(col('Sensor')=="LR2 On",True)).alias('LR2'),\
count(when(col('Sensor')=="LD On",True)).alias('LD')).\
orderBy('window')
query = windowedCounts\
.writeStream\
.outputMode('append')\
.format("console")\
.start()
Error:
Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark
The reason for using the append option is to save as a CSV file later.
I think this problem is caused by the window function, but I don't know how to solve it.

How to create child dataframe from xml file using Pyspark?

I have all those supporting libraries in pyspark and I am able to create dataframe for parent-
def xmlReader(root, row, filename):
df = spark.read.format("com.databricks.spark.xml").options(rowTag=row,rootTag=root).load(filename)
xref = df.select("genericEntity.entityId", "genericEntity.entityName","genericEntity.entityType","genericEntity.inceptionDate","genericEntity.updateTimestamp","genericEntity.entityLongName")
return xref
df1 = xmlReader("BOBML","entityTransaction","s3://dev.xml")
df1.head()
I am unable to create child dataframe-
def xmlReader(root, row, filename):
df2 = spark.read.format("com.databricks.spark.xml").options(rowTag=row, rootTag=root).load(filename)
xref = df2.select("genericEntity.entityDetail", "genericEntity.entityDetialId","genericEntity.updateTimestamp")
return xref
df3 = xmlReader("BOBML","s3://dev.xml")
df3.head()
I am not getting any output and I was planning to do union between parent and child dataframe. Any help will be truly appreciated!
After more than 24 hours, I am able to solve the problem and thanks to all whoever at least look at my problem.
Solution:
Step 1: Upload couple of libraries
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
Step2 (Parents):Read xml files, print schema, register temp tables, and create dataframe.
Step3 (Child): Repeat step 2.
Step4: Create final Dataframe by joining Child and parent dataframes.
Step5: Load data into S3 (write.csv/S3://Path) or Database.