How to write csv file into one file by pyspark - pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?

You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.

You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")

Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.

I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)

Related

PySpark - Read CSV and ignore file header (not using pandas)

I have a problem that I hope you can help me with.
The text file that looks like this:
Report Name :
column1,column2,column3
this is row 1,this is row 2, this is row 3
I am leveraging Synapse Notebooks to try to read this file into a dataframe. If I try to read the csv file using spark.read.csv() it thinks that the column name is "Report Name : ", which is obviously incorrect.
I know that the Pandas csv reader has a 'skipRows[1]' function but unfortunately I cannot read the file directly with Pandas, as I am getting some strange networking errors. I can however convert a PySpark dataframe to a Pandas dataframe via: df.toPandas()
I'd like to be able to solve this with straight PySpark dataframes.
Surely someone else has encountered this issue! Help!
I have tried every variation of reading files, and drop, etc. but the schema has already been defined when the first dataframe was created, with 1 column (Report Name : ).
Not sure what to do now..
Copied answer from similar question: How to skip lines while reading a CSV file as a dataFrame using PySpark?
import csv
from pyspark.sql.types import StringType
df = sc.textFile("test.csv")\
.mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'column1')\
.toDF(['column1','column2','column3'])
Microsoft got back to me with an answer that worked! When using pandas csv reader, and you use the path to the source file you want to read. It requires an endpoint to blob storage (not adls gen2). I only had an endpoint that read dfs in the URI and not blob. After I added the endpoint to blob storage, the pandas reader worked great! Thanks for looking at my thread.

Save pyspark dataframe entires as separate html files on s3

So, if I have a list of file locations on s3, I can build a dataframe with a column containing the contents of each file in a separate row by doing the following (for example):
s3_path_list = list(df.select('path').toPandas()['path']))
df2 = spark.read.format("binaryFile").load(s3_path_list,'path')
which returns:
df2: pyspark.sql.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary
What is the inverse of this operation?
Specifically... I have plotly generating html content stored as a string in an additional 'plot_string' column.
df3: pyspark.sql.dataframe.DataFrame
save_path:string
plot_string:string
How would I go about efficiently saving off each 'plot_string' entry as an html file at some s3 location specified in the 'save_path' column?
Clearly some form of df.write can be used to save off the dataframe (bucketed or partitioned) as parquet, csv, text table, etc... but I can't seem to find any straightforward method to perform a simple parallel write operation without a udf that initializes separate boto clients for each file... which, for large datasets, is a bottleneck (as well as being inelegant). Any help is appreciated.

Spark Dataset - "edit" parquet file for each row

Context
I am trying to use Spark/Scala in order to "edit" multiple parquet files (potentially 50k+) efficiently. The only edit that needs to be done is deletion (i.e. deleting records/rows) based on a given set of row IDs.
The parquet files are stored in s3 as a partitioned DataFrame where an example partition looks like this:
s3://mybucket/transformed/year=2021/month=11/day=02/*.snappy.parquet
Each partition can have upwards of 100 parquet files that each are between 50mb and 500mb in size.
Inputs
We are given a spark Dataset[MyClass] called filesToModify which has 2 columns:
s3path: String = the complete s3 path to a parquet file in s3 that needs to be edited
ids: Set[String] = a set of IDs (rows) that need to be deleted in the parquet file located at s3path
Example input dataset filesToModify:
s3path
ids
s3://mybucket/transformed/year=2021/month=11/day=02/part-1.snappy.parquet
Set("a", "b")
s3://mybucket/transformed/year=2021/month=11/day=02/part-2.snappy.parquet
Set("b")
Expected Behaviour
Given filesToModify I want to take advantage of parallelism in Spark do the following for each row:
Load the parquet file located at row.s3path
Filter so that we exclude any row whose id is in the set row.ids
Count the number of deleted/excluded rows per id in row.ids (optional)
Save the filtered data back to the same row.s3path to overwrite the file
Return the number of deleted rows (optional)
What I have tried
I have tried using filesToModify.map(row => deleteIDs(row.s3path, row.ids)) where deleteIDs is looks like this:
def deleteIDs(s3path: String, ids: Set[String]): Int = {
import spark.implicits._
val data = spark
.read
.parquet(s3path)
.as[DataModel]
val clean = data
.filter(not(col("id").isInCollection(ids)))
// write to a temp directory and then upload to s3 with same
// prefix as original file to overwrite it
writeToSingleFile(clean, s3path)
1 // dummy output for simplicity (otherwise it should correspond to the number of deleted rows)
}
However this leads to NullPointerException when executed within the map operation. If I execute it alone outside of the map block then it works but I can't understand why it doesn't inside it (something to do with lazy evaluation?).
You get a NullPointerException because you try to retrieve your spark session from an executor.
It is not explicit, but to perform spark action, your DeleteIDs function needs to retrieve active spark session. To do so, it calls method getActiveSession from SparkSession object. But when called from an executor, this getActiveSession method returns None as stated in SparkSession's source code:
Returns the default SparkSession that is returned by the builder.
Note: Return None, when calling this function on executors
And thus NullPointerException is thrown when your code starts using this None spark session.
More generally, you can't recreate a dataset and use spark transformations/actions in transformations of another dataset.
So I see two solutions for your problem:
either to rewrite DeleteIDs function's code without using spark, and modify your parquet files by using parquet4s for instance.
or transform filesToModify to a Scala collection and use Scala's map instead of Spark's one.
s3path and ids parameters that are passed to deleteIDs are not actually strings and sets respectively. They are instead columns.
In order to operate over these values you can instead create a UDF that accepts columns instead of intrinsic types, or you can collect your dataset if it is small enough so that you can use the values in the deleteIDs function directly. The former is likely your best bet if you seek to take advantage of Spark's parallelism.
You can read about UDFs here

Adding an additional column containing file name to pyspark dataframe

I am iterating through csv files in a folder using for loop and performing some operations on each csv (getting the count of rows for each unique id and storing all these outputs into a pyspark dataframe). Now my requirement is to add the name of the file as well to the dataframe for each iteration. Can anyone suggest some way to do this
you can get the file name as a column using the function pyspark.sql.functions.input_file_name, and if your files have the same schema, and you want to apply the same processing pipeline, then don't need to loop on these files, you can read them using a regex:
df = spark.read.csv("path/to/the/files/*.csv", header=True, sep=";") \
.withColumn("file_name", input_file_name())

Output Sequence while writing to HDFS using Apache Spark

I am working on a project in apache Spark and the requirement is to write the processed output from spark into a specific format like Header -> Data -> Trailer. For writing to HDFS I am using the .saveAsHadoopFile method and writing the data to multiple files using the key as a file name. But the issue is the sequence of the data is not maintained files are written in Data->Header->Trailer or a different combination of three. Is there anything I am missing with RDD transformation?
Ok so after reading from StackOverflow questions, blogs and mail archives from google. I found out how exactly .union() and other transformation works and how partitioning is managed. When we use .union() the partition information is lost by the resulting RDD and also the ordering and that's why My output sequence was not getting maintained.
What I did to overcome the issue is numbering the Records like
Header = 1, Body = 2, and Footer = 3
so using sortBy on RDD which is union of all three I sorted it using this order number with 1 partition. And after that to write to multiple file using key as filename I used HashPartitioner so that same key data should go into separate file.
val header: RDD[(String,(String,Int))] = ... // this is my header RDD`
val data: RDD[(String,(String,Int))] = ... // this is my data RDD
val footer: RDD[(String,(String,Int))] = ... // this is my footer RDD
val finalRDD: [(String,String)] = header.union(data).union(footer).sortBy(x=>x._2._2,true,1).map(x => (x._1,x._2._1))
val output: RDD[(String,String)] = new PairRDDFunctions[String,String](finalRDD).partitionBy(new HashPartitioner(num))
output.saveAsHadoopFile ... // and using MultipleTextOutputFormat save to multiple file using key as filename
This might not be the final or most economical solution but it worked. I am also trying to find other ways to maintain the sequence of output as Header->Body->Footer. I also tried .coalesce(1) on all three RDD's and then do the union but that was just adding three more transformation to RDD's and .sortBy function also take partition information which I thought will be same, but coalesceing the RDDs first also worked. If Anyone has some another approach please let me know, or add more to this will be really helpful as I am new to Spark
References:
Write to multiple outputs by key Spark - one Spark job
Ordered union on spark RDDs
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-2-RDD-s-only-returns-the-first-one-td766.html -- this one helped a lot