Read last 3 months of data, transform and output - pyspark

I am new to pyspark and need some help.
I have data sitting in below partition.
each time I run the script, I want it to only process pass 3 months of data.
drop certain fields and only select few.
Rename the fields.
output to another s3 bucket in same partition name which was used to read.
How do I achieve the above.
I am new to pyspark need help to get started.

The following snippet should work. Replace the base_path with your path.
import datetime
from dateutil.relativedelta import relativedelta
# Function to generate the last X months
def get_last_months(start_date, months):
for i in range(months):
yield (start_date.year,start_date.month)
start_date += relativedelta(months = -1)
rollback=3
months=[i for i in get_last_months(datetime.datetime.today(), rollback)]
# Create paths required
base_path = "{y}/{m}/filename"
paths=[]
for i in months:
paths.append(base_path.format(y=i[0],m=i[1])
df = spark.read.parquet(*paths)
The above snippet will help you in reading from multiple paths. The remaining logic is something you have to implement.

Related

How to specify schema for the folder structure when reading parquet file into a dataframe [duplicate]

This question already has an answer here:
Reading partition columns without partition column names
(1 answer)
Closed 2 years ago.
I have to read parquet files that are stored in the following folder structure
/yyyy/mm/dd/ (eg: 2021/01/31)
If I read the files like this, it works:
unPartitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/*/*/*/*.parquet")
Unfortunately, the folder structure is not stored in the typical partitioned format /yyyy=2021/mm=01/dd=31/ and I don't have the luxury of converting it to that format.
I was wondering if there is a way I can provide Spark a hint as to the folder structure so that it would make "2021/01/31" available as yyyy, mm, dd in my dataframe.
I have another set of files, which are stored in the /yyyy=aaaa/mm=bb/dd=cc format and the following code works:
partitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/")
Things I have tried
I have specified the schema, but it just returned nulls
customSchema = StructType([
StructField("yyyy",LongType(),True),
StructField("mm",LongType(),True),
StructField("dd",LongType(),True),
StructField("id",LongType(),True),
StructField("a",LongType(),True),
StructField("b",LongType(),True),
StructField("c",TimestampType(),True)])
partitionDF = spark.read.option("mergeSchema", "true").schema(customSchema).parquet("abfss://xxx#abc.dfs.core.windows.net/Address/")
display(partitionDF)
the above returns no data!. If I change the path to: "abfss://xxx#abc.dfs.core.windows.net/Address////.parquet", then I get data, but yyyy,mm,dd columns are empty.
Another option would be to load the folder path as a column, but I cant seem to find a way to do that.
TIA
Databricks N00B!
I suggest you load the data without the partitioned folders as you mentioned
unPartitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/*/*/*/*.parquet")
Then add a column with the input_file_name function value in:
import pyspark.sql.functions as F
unPartitionedDF = unPartitionedDF.withColumn('file_path', F.input_file_name())
Then you could split the values of the new file_path column into three separate columns.
df = unPartitionedDF.withColumn('year', F.split(df['file_path'], '/').getItem(3)) \
.withColumn('month', F.split(df['file_path'], '/').getItem(4)) \
.withColumn('day', F.split(df['file_path'], '/').getItem(5))
The input value of getItem function is based on the exact folder structure you have.
I hope it could resolve your proble.

Long execution time when running PySpark-SQL on hadoop cluster?

I have a data set of weather data and I am trying to query it to get average lows and average highs for each year. I have no problem submitting the job and getting the desired result but it is taking hours to run. I thought it would run much faster, Am I doing something wrong or is it just not as fast as I'm thinking it should be?
The data is a csv file with over 100,000,000 entries.
THe columns are date, weather station, measurement(TMAX or TMIN), and value
I am running the job on my university's hadoop cluster, I don't have much more information than that about the cluster.
Thanks in advance!
import sys
from random import random
from operator import add
from pyspark.sql import SQLContext, Row
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonPi")
sqlContext = SQLContext(sc)
file = sys.argv[1]
lines = sc.textFile(file)
parts = lines.map(lambda l: l.split(","))
obs = parts.map(lambda p: Row(station=p[0], date=int(p[1]) , measurement=p[2] , value=p[3] ) )
weather = sqlContext.createDataFrame(obs)
weather.registerTempTable("weather")
#AVERAGE TMAX/TMIN PER YEAR
query2 = sqlContext.sql("""select SUBSTRING(date,1,4) as Year, avg(value)as Average, measurement
from weather
where value<130 AND value>-40
group by measurement, SUBSTRING(date,1,4)
order by SUBSTRING(date,1,4) """)
query2.show()
query2.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("hdfs:/user/adduccij/tmax_tmin_year.csv")
sc.stop()
Make sure that spark job in fact started in cluster (and not local) mode. e.g. If you're using yarn, then job is launched in 'yarn-client' mode.
If that's true, then make sure you've provided enough #executors/cores/ executor and driver memory. You can get the actual cluster/job information from either the resource manager (e.g. yarn) page or from spark context (sqlContext.getAllConfs).
100Mil records is not that small. Let's say each record is 30 bytes, still the overall size is 3gb and that can take a while if you only have a handful of executors.
Let's say that the above suggestions do not help, then try to find out which part of the query is taking long. Few speed up tips are:
Cache the weather dataframe
Break the query into 2 parts: 1st part does group by, and output is cached
2nd part does order by
instead of coalesce, write the rdd with default shards and then do a mergeFrom to get your csv output from shell.

write dataframe to csv file took too much time to write spark

I want to aggregate data based on intervals on timestamp columns.
I saw that it takes 53 seconds for computation, but 5 minutes to write result in the CSV file. It seems like df.csv() takes too much to write.
How can I optimize the code please ?
Here is my code snippet :
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\inputDataSet.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("_c0"))/300).cast("long") * 300).cast("timestamp")).drop("_c0")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map(mean(_)).head,result2.drop("new_time").columns.map(mean(_)).tail: _*).sort("new_time")
finalresult.coalesce(1).write.option("header", "true").csv("C:/result_with_time.csv")//<= it took to much to write
Here are some thoughts on optimization based on your code.
inferSchema: it will be faster to have a predefined schema rather than using inferSchema.
Instead of writing into your local, you can try writing it in hdfs and then scp the file into local.
df.coalesce(1).write will take more time than just df.write. But you will get multiple files which can be combined using different techniques. or else you can just let it be in one directory with with multiple parts of the file.

Split Spark DataFrame into parts

I have a table of distinct users, which has 400,000 users. I would like to split it into 4 parts, and expected each user located in one part only.
Here is my code:
val numPart = 4
val size = 1.0 / numPart
val nsizes = Array.fill(numPart)(size)
val data = userList.randomSplit(nsizes)
Then I write each data(i), i from 0 to 3, into parquet files. Select the directory, group by user id and count by part, there are some users that located in two or more parts.
I still have no idea why?
I have found the solution: cache the DataFrame before you split it.
Should be
val data = userList.cache().randomSplit(nsizes)
Still have no idea why. My guess, each time the randomSplit function "fill" the data, it reads records from userList which is re-evaluate from the parquet file(s), and give a different order of rows, that's why some users are lost and some users are duplicated.
That's what I thought. If some one have any answer or explanation, I will update.
References:
(Why) do we need to call cache or persist on a RDD
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
http://159.203.217.164/using-sparks-cache-for-correctness-not-just-performance/
If your goal is to split it to different files you can use the functions.hash to calculate a hash, then mod 4 to get a number between 0 to 4 and when you write the parquet use partitionBy which would create a directory for each of the 4 values.

How to write csv file into one file by pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?
You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.
You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")
Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.
I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)