Fetch distinct values from column in file to create RDD - pyspark

I am new to Pyspark. I need to find distinct values from a certain column in an RDD.
I have a comma delimited .txt file with no column headers on S3.
rddDistinct = sc.textFile(fileLocation).map(lambda x: x[2])
print rddDistinct.take(10)
What am I doing wrong? Eventually, I would like to store the resulting RDD in S3 (haven't gotten there yet). If the file exists in S3, I would like to overwrite it.

You need to add .distinct() after your map function.
rddDistinct = sc.textFile(fileLocation).map(lambda x: x[2]).distinct()
print rddDistinct.take(10)

Related

Save pyspark dataframe entires as separate html files on s3

So, if I have a list of file locations on s3, I can build a dataframe with a column containing the contents of each file in a separate row by doing the following (for example):
s3_path_list = list(df.select('path').toPandas()['path']))
df2 = spark.read.format("binaryFile").load(s3_path_list,'path')
which returns:
df2: pyspark.sql.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary
What is the inverse of this operation?
Specifically... I have plotly generating html content stored as a string in an additional 'plot_string' column.
df3: pyspark.sql.dataframe.DataFrame
save_path:string
plot_string:string
How would I go about efficiently saving off each 'plot_string' entry as an html file at some s3 location specified in the 'save_path' column?
Clearly some form of df.write can be used to save off the dataframe (bucketed or partitioned) as parquet, csv, text table, etc... but I can't seem to find any straightforward method to perform a simple parallel write operation without a udf that initializes separate boto clients for each file... which, for large datasets, is a bottleneck (as well as being inelegant). Any help is appreciated.

How to read file path values as columns in Spark?

I'm working in Azure Synapse Notebooks and reading reading file(s) into a Dataframe from a well-formed folder path like so:
Given there are many folders references by that wildcard, how do I capture the "State" value as a column in the resulting Dataframe?
Use input_file_name function to get the full input path and then apply regexp_extract to extract the part that you want.
Example:
df.withColumn("filepath", F.input_file_name())
df.withColum("filepath", F.regexp_extract("filepath", "State=(.+)\.snappy\.parquet", 1)
No need to use the wildcard *.
try : df = spark.read.load("abfss://....dfs.core.windows.net/")
Spark can read partitionned folders directly, and df should then contains the column state with its different values.

Adding an additional column containing file name to pyspark dataframe

I am iterating through csv files in a folder using for loop and performing some operations on each csv (getting the count of rows for each unique id and storing all these outputs into a pyspark dataframe). Now my requirement is to add the name of the file as well to the dataframe for each iteration. Can anyone suggest some way to do this
you can get the file name as a column using the function pyspark.sql.functions.input_file_name, and if your files have the same schema, and you want to apply the same processing pipeline, then don't need to loop on these files, you can read them using a regex:
df = spark.read.csv("path/to/the/files/*.csv", header=True, sep=";") \
.withColumn("file_name", input_file_name())

how to assign column names available in csv file as header to orc file

I have column names in one .csv file and want to assign these as column headers to Data Frame in scala. Since it is generic script, I don't want to hard code in the script rather pass the values from csv file.
You can do it:
val columns = spark.read.option("header","true").csv("path_to_csv").schema.fieldNames
val df: DataFrame = ???
df.toDF(columns:_*).write.format("orc").save("your_orc_dir")
in pyspark:
columns = spark.read.option("header","true").csv("path_to_csv").columns
df.toDF(columns).write.format("orc").save("your_orc_dir")
but store data schema separately from data is bad idea

How to write csv file into one file by pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?
You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.
You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")
Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.
I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)