I have a CSV file stored in hdfs with the following format:
Business Line,Requisition (Job Title),Year,Month,Actual (# of Days)
Communications,1012_Com_Specialist,2017,February,150
Information Technology,5781_Programmer_Associate,2017,March,80
Information Technology,2497_Programmer_Senior,2017,March,120
Services,6871_Business_Analyst_Jr,2018,May,33
I would like to get the Average for Actual (# of Days) by Year and Month. Could someone please help me how I can do this using Pyspark and save the output in Parquet file?
you can convert csv to DF and run spark-sql as below:
csvRDD.map(rec => {
val i = rec.split(',');
(i(0).toString, i(1).toString, i(2).toString, i(3).toString, i(4).toInt)
}).toDF("businessline","jobtitle","year","month","actual").registerTempTable("input")
val resDF = sqlContext.sql("Select year, month, avg(actual) as avgactual from input group by year, month")
resDF.write.parquet("/user/path/solution1")
Related
I have a bus_date column. which has multiple records with different date i.e 2021-03-15, 2021-05-12, 2021-01-15 etc.
I want to calculate previous year end for all given dates. my expected output is 2020-12-31 for all three dates.
However, I can use function date_sub(start_date, num_days).
but I don't want to manually pass num_days. since there are million of rows with diff dates.
Can we write a view from a table or create dataframe, which will calculate previous year end?
You can use date_add and date_trunc to achieve this.
import pyspark.sql.functions as F
......
data = [
('2021-03-15',),
('2021-05-12',),
('2021-01-15',)
]
df = spark.createDataFrame(data, ['bus_date'])
df = df.withColumn('pre_year_end', F.date_add(F.date_trunc('yyyy', 'bus_date'), -1))
df.show()
I have a scala / spark dataframe, with one column named "utcstamp" with values of the following format: 2018-12-12 21:15:00
I want to obtain a new column with the week day, and inspired by this question in the forum, used the following code:
import java.util.Calendar
import java.text.SimpleDateFormat
val dowText = new SimpleDateFormat("E")
df = df.withColumn("weekday" , dowText.format(df.select(col("utcstamp"))))
However, I get the following error:
<console>:58: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
When I try this applied to a specific date (like in the link provided) it works, I just can't apply it to the whole column.
Can anyone help me with this? If you have an alternative way of converting an utc column into weekday that'll also do for me.
You can use dayofweek function of Spark SQL, which gives you a number from 1-7, for Sunday to Saturday:
val df2 = df.withColumn("weekday", dayofweek(col("utcstamp").cast("timestamp")))
Or if you want words (Sun-Sat) instead,
val df2 = df.withColumn("weekday", date_format(col("utcstamp").cast("timestamp"), "EEE"))
You can simply get the day of week with date format as "E" or EEEE (eg. for Sun and Sunday)
df.withColumn("weekday", date_format(to_timestamp($"utcstamp"), "E"))
If you want day of week as numeric value use dayofweek function which is availabe from spark 2.3+
I want to create a timestamp column to create a line chart from two columns containing month and year respectively.
The df looks like this:
I know I can create a string concat and then convert it to a datetime column:
df.select('*',
concat('01', df['month'],
df['year']).alias('date')).withColumn("date",
df['date'].cast(TimestampType()))
But I wanted a cleaner approach using an inbuilt PySpark functionality that can also help me create other date parts, like week number, quarters, etc. Any suggestions?
You will have to concatenate the string once, make the timestamp type column and then you can easily extract week, quarter etc.
You can use this function (and edit it to create whatever other columns you need as well):
def spark_date_parsing(df, date_column, date_format):
"""
Parses the date column given the date format in a spark dataframe
NOTE: This is a Pyspark implementation
Parameters
----------
:param df: Spark dataframe having a date column
:param date_column: Name of the date column
:param date_format: Simple Date Format (Java-style) of the dates in the date column
Returns
-------
:return: A spark dataframe with a parsed date column
"""
df = df.withColumn(date_column, F.to_timestamp(F.col(date_column), date_format))
# Spark returns 'null' if the parsing fails, so first check the count of null values
# If parse_fail_count = 0, return parsed column else raise error
parse_fail_count = df.select(
([F.count(F.when(F.col(date_column).isNull(), date_column))])
).collect()[0][0]
if parse_fail_count == 0:
return df
else:
raise ValueError(
f"Incorrect date format '{date_format}' for date column '{date_column}'"
)
Usage (with whatever is your resultant date format):
df = spark_date_parsing(df, "date", "dd/MM/yyyy")
Hdfs blob stores the json data in the below format on a daily basis. I will need to read the json data using spark.read.json() on a day wise. Ex: Today i want to read day=01 day's files and tomorrow i want to read day=02 day's files. Is there a logic i can write in Scala which auto increments the date consider month and year also. Any help would me much appreciated.
/signals/year=2019/month=08/day=01
/signals/year=2019/month=08/day=01/*****.json
/signals/year=2019/month=08/day=01/*****.json
/signals/year=2019/month=08/day=02
/signals/year=2019/month=08/day=02/*****_.json
/signals/year=2019/month=08/day=02/*****_.json
Looks like data stored in partitioned format, and for read only one date such function can be used:
def readForDate(year: Int, month: Int, day: Int): DataFrame = {
spark.read.json("/signals")
.where($"year" === year && $"month" === month && $"day" === day)
}
For use this function, take current date and split on parts, with regular Scala code, not related to Spark.
If there is any relation between current date and the date you want to process the JSON file, you can get the current date (you can add/minus any number of days) using below Scala code and use it in your Spark application as #pasha701 suggested.
scala> import java.time.format.DateTimeFormatter
scala> import java.time.LocalDateTime
scala> val dtf = DateTimeFormatter.ofPattern("dd") // you can get the Year and Month like this.
scala> val now = LocalDateTime.now()
scala> println(dtf.format(now))
02
scala> println(dtf.format(now.plusDays(2))) // Added two days on the current date
04
Just a thought: If you are using Azure's Databricks then you can run shell command in notebook to get the current day (again if there is any relation on the partition's files you are trying to fetch with the current date) using "%sh" command.
Hope this may help any of you in future. Below code helps to read the data available in blob where the files are stored inside date folders which auto increments everyday. I wanted to read the data of previous day's data so adding now.minusDays(1)
val dtf = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val now = LocalDateTime.now()
val date = dtf.format(now.minusDays(1))
val currentDateHold = date.split("-").toList
val year = currentDateHold(0)
val month = currentDateHold(1)
val day = currentDateHold(2)
val path = "/signals/year="+year+"/month="+month+"/day="+day
// Read JSON data from the Azure Blob`enter code here`
var initialDF = spark.read.format("json").load(path)
I have one table with dates and another table where there is rather weekly data. My weeks start at Tuesday and the second table's date is supposed to determine the week (basically the Tuesday before the date is the start of the week; alternatively that date is an example day in that week).
How can I join the dates to information about weeks?
Here is the setup:
from datetime import datetime as dt
import pandas as pd
df=pd.DataFrame([dt(2016,2,3), dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)])
df_week=pd.DataFrame([(dt(2016,2,4),"a"), (dt(2016,2,11),"b")], columns=["week", "val"])
# note the actual start of the weeks are the Tuesdays: 2.2., 9.2.
# I expect a new column df["val"]=["a", "a", "b", "b"]
I've seen pandas date_range, but I cannot see how to do that from there.
You're looking for DatetimeIndex.asof:
This will give you the closest index up to the day in df:
df_week.set_index('week', inplace=True)
df_week.index.asof(df['day'][1])
You can now use it to select the corresponding value:
df_week.loc[df_week.index.asof(df['day'][1])]
Finally, apply it to the entire dataframe:
df = pd.DataFrame([dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)], columns=['day'])
df['val'] = df.apply(lambda row: df_week.loc[df_week.index.asof(row['day'])]['val'], axis=1)
I removed the first value from df because I didn't want to deal with edge cases.
Result:
day val
0 2016-02-08 a
1 2016-02-09 a
2 2016-02-15 b