Save pyspark dataframe entires as separate html files on s3 - pyspark

So, if I have a list of file locations on s3, I can build a dataframe with a column containing the contents of each file in a separate row by doing the following (for example):
s3_path_list = list(df.select('path').toPandas()['path']))
df2 = spark.read.format("binaryFile").load(s3_path_list,'path')
which returns:
df2: pyspark.sql.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary
What is the inverse of this operation?
Specifically... I have plotly generating html content stored as a string in an additional 'plot_string' column.
df3: pyspark.sql.dataframe.DataFrame
save_path:string
plot_string:string
How would I go about efficiently saving off each 'plot_string' entry as an html file at some s3 location specified in the 'save_path' column?
Clearly some form of df.write can be used to save off the dataframe (bucketed or partitioned) as parquet, csv, text table, etc... but I can't seem to find any straightforward method to perform a simple parallel write operation without a udf that initializes separate boto clients for each file... which, for large datasets, is a bottleneck (as well as being inelegant). Any help is appreciated.

Related

Spark Dataset - "edit" parquet file for each row

Context
I am trying to use Spark/Scala in order to "edit" multiple parquet files (potentially 50k+) efficiently. The only edit that needs to be done is deletion (i.e. deleting records/rows) based on a given set of row IDs.
The parquet files are stored in s3 as a partitioned DataFrame where an example partition looks like this:
s3://mybucket/transformed/year=2021/month=11/day=02/*.snappy.parquet
Each partition can have upwards of 100 parquet files that each are between 50mb and 500mb in size.
Inputs
We are given a spark Dataset[MyClass] called filesToModify which has 2 columns:
s3path: String = the complete s3 path to a parquet file in s3 that needs to be edited
ids: Set[String] = a set of IDs (rows) that need to be deleted in the parquet file located at s3path
Example input dataset filesToModify:
s3path
ids
s3://mybucket/transformed/year=2021/month=11/day=02/part-1.snappy.parquet
Set("a", "b")
s3://mybucket/transformed/year=2021/month=11/day=02/part-2.snappy.parquet
Set("b")
Expected Behaviour
Given filesToModify I want to take advantage of parallelism in Spark do the following for each row:
Load the parquet file located at row.s3path
Filter so that we exclude any row whose id is in the set row.ids
Count the number of deleted/excluded rows per id in row.ids (optional)
Save the filtered data back to the same row.s3path to overwrite the file
Return the number of deleted rows (optional)
What I have tried
I have tried using filesToModify.map(row => deleteIDs(row.s3path, row.ids)) where deleteIDs is looks like this:
def deleteIDs(s3path: String, ids: Set[String]): Int = {
import spark.implicits._
val data = spark
.read
.parquet(s3path)
.as[DataModel]
val clean = data
.filter(not(col("id").isInCollection(ids)))
// write to a temp directory and then upload to s3 with same
// prefix as original file to overwrite it
writeToSingleFile(clean, s3path)
1 // dummy output for simplicity (otherwise it should correspond to the number of deleted rows)
}
However this leads to NullPointerException when executed within the map operation. If I execute it alone outside of the map block then it works but I can't understand why it doesn't inside it (something to do with lazy evaluation?).
You get a NullPointerException because you try to retrieve your spark session from an executor.
It is not explicit, but to perform spark action, your DeleteIDs function needs to retrieve active spark session. To do so, it calls method getActiveSession from SparkSession object. But when called from an executor, this getActiveSession method returns None as stated in SparkSession's source code:
Returns the default SparkSession that is returned by the builder.
Note: Return None, when calling this function on executors
And thus NullPointerException is thrown when your code starts using this None spark session.
More generally, you can't recreate a dataset and use spark transformations/actions in transformations of another dataset.
So I see two solutions for your problem:
either to rewrite DeleteIDs function's code without using spark, and modify your parquet files by using parquet4s for instance.
or transform filesToModify to a Scala collection and use Scala's map instead of Spark's one.
s3path and ids parameters that are passed to deleteIDs are not actually strings and sets respectively. They are instead columns.
In order to operate over these values you can instead create a UDF that accepts columns instead of intrinsic types, or you can collect your dataset if it is small enough so that you can use the values in the deleteIDs function directly. The former is likely your best bet if you seek to take advantage of Spark's parallelism.
You can read about UDFs here

Adding an additional column containing file name to pyspark dataframe

I am iterating through csv files in a folder using for loop and performing some operations on each csv (getting the count of rows for each unique id and storing all these outputs into a pyspark dataframe). Now my requirement is to add the name of the file as well to the dataframe for each iteration. Can anyone suggest some way to do this
you can get the file name as a column using the function pyspark.sql.functions.input_file_name, and if your files have the same schema, and you want to apply the same processing pipeline, then don't need to loop on these files, you can read them using a regex:
df = spark.read.csv("path/to/the/files/*.csv", header=True, sep=";") \
.withColumn("file_name", input_file_name())

Spark : Dynamic generation of the query based on the fields in s3 file

Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.

Scala - Writing dataframe to a file as binary

I have a hive table of type parquet, with column Content storing various documents as base64 encoded.
Now, I need to read that column and write into a file in HDFS, so that the base64 column will be converted back to a document for each row.
val profileDF = sqlContext.read.parquet("/hdfspath/profiles/");
profileDF.registerTempTable("profiles")
val contentsDF = sqlContext.sql(" select unbase64(contents) as contents from profiles where file_name'file1'")
Now that contentDF is storing the binary format of a document as a row, which I need to write to a file. Tried different options but couldn't get back the dataframe content to a file.
Appreciate any help regarding this.
I would suggest save as parquet:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrameWriter.html#parquet(java.lang.String)
Or convert to RDD and do save as object:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/rdd/RDD.html#saveAsObjectFile(java.lang.String)

How to write csv file into one file by pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?
You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.
You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")
Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.
I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)