pyspark substring and aggregation - substring

I am new to Spark and I've got a csv file with such data:
date, accidents, injured
2015/20/03 18:00 15, 5
2015/20/03 18:30 25, 4
2015/20/03 21:10 14, 7
2015/20/02 21:00 15, 6
I would like to aggregate this data by a specific hour of when it has happened. My idea is to Substring date to 'year/month/day hh' with no minutes so I can make it a key. I wanted to give average of accidents and injured by each hour. Maybe there is a different, smarter way with pyspark?
Thanks guys!

Well, it depends on what you're going to do afterwards, I guess.
The simplest way would be to do as you suggest: substring the date string and then aggregate:
data = [('2015/20/03 18:00', 15, 5),
('2015/20/03 18:30', 25, 4),
('2015/20/03 21:10', 14, 7),
('2015/20/02 21:00', 15, 6)]
df = spark.createDataFrame(data, ['date', 'accidents', 'injured'])
df.withColumn('date_hr',
df['date'].substr(1, 13)
).groupby('date_hr')\
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()
If you, however, want to do some more computation later on, you can parse the data to a TimestampType() and then extract the date and hour from that.
import pyspark.sql.types as typ
from pyspark.sql.functions import col, udf
from datetime import datetime
parseString = udf(lambda x: datetime.strptime(x, '%Y/%d/%m %H:%M'), typ.TimestampType())
getDate = udf(lambda x: x.date(), typ.DateType())
getHour = udf(lambda x: int(x.hour), typ.IntegerType())
df.withColumn('date_parsed', parseString(col('date'))) \
.withColumn('date_only', getDate(col('date_parsed'))) \
.withColumn('hour', getHour(col('date_parsed'))) \
.groupby('date_only', 'hour') \
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()

Related

Pyspark - extract first monday of week

Given Dataframe :
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James",202245),
("Michael",202133),
("Robert",202152),
("Maria",202252),
("Jen",202201)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("Week", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
The week column denotes the week number and year ie; 202245 is the 45th week of 2022. I would like to extract the date on which Monday falls for the 45th week of 2022 which is 7th Nov 2022.
What I tried, :
Tried using datetime and udf:
def get_monday_from_week(x: int) -> datetime.date:
"""
Converts fiscal week to datetime of first Monday from that week
Args:
x (int): fiscal week
Returns:
datetime.date: datetime of first Monday from that week
"""
x = str(x)
r = datetime.datetime.strptime(x + "-1", "%Y%W-%w")
return r
How can I use spark functions to implement this, I am trying to avoid the use of udf here.

'list' object has no attribute 'map' in pyspark error

""" df = sc.textFile("/content/Shakespeare.txt")
llist = df.collect()
for line in llist:
t= simple_tokenize(line)
rdd2 = t.map(lambda word: (word,1)) # error on this line
rdd3 = rdd2.reduceByKey(lambda a,b: a+b)
"""
I am facing an error on rdd2. Can someone please help?
I think you would like a simple word count using rdd. You can do it by
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkFiles
# read file from url
url="https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt"
spark.sparkContext.addFile(url)
df=spark.read.csv(SparkFiles.get("shakespeare.txt"), header=True)
df.show(4)
+------------------------------------------+
|THE SONNETS |
+------------------------------------------+
|by William Shakespeare |
|From fairest creatures we desire increase |
|That thereby beauty's rose might never die|
|But as the riper should by time decease |
+------------------------------------------+
only showing top 4 rows
# convert to rdd taking only the strings of the row
rdd=df.rdd.map(lambda x: x["THE SONNETS"])
rdd.take(4)
['by William Shakespeare',
'From fairest creatures we desire increase',
"That thereby beauty's rose might never die",
'But as the riper should by time decease']
# you can also parallelize a Python list of strings
data=["From fairest creatures we desire increase",
"That thereby beauty's rose might never die",
"But as the riper should by time decease",
"His tender heir might bear his memory",
]
rdd=spark.sparkContext.parallelize(data)
Now run the basic three steps
split by words
count the words
reduce by key
rdd1=rdd.flatMap(lambda x: x.split(" "))
rdd2=rdd1.map(lambda word: (word,1))
rdd3=rdd2.reduceByKey(lambda a,b: a+b)
rdd3.take(20)
[('by', 66),
('William', 1),
('Shakespeare', 1),
('From', 14),
('fairest', 5),
('creatures', 2),
('we', 11),
('desire', 6),
('increase', 3),
('That', 83),
('thereby', 1),
("beauty's", 16),
('rose', 5),
('might', 19),
('never', 10),
('die', 5),
('But', 89),
('as', 66),
('the', 311),
('riper', 2)]

Adding float column to TimestampType column (seconds+miliseconds)

I am trying to add a float column to a TimestampType column in pyspark, but there does not seem to be a way to do this while maintaining the fractional seconds. example of float_seconds is 19.702300786972046, example of timestamp is 2021-06-17 04:31:32.48761
what I want:
calculated_df = beginning_df.withColumn("calculated_column", float_seconds_col + TimestampType_col)
I have tried the following methods, but neither completely solves the problem:
#method 1 adds a single time, but cannot be used to add an entire column to the timestamp column.
calculated_df = beginning_df.withColumn("calculated_column",col("TimestampType_col") + F.expr('INTERVAL 19.702300786 seconds'))
#method 2 converts the float column to unixtime, but cuts off the decimals (which are important)
timestamp_seconds = beginning_df.select(from_unixtime("float_seconds"))
Image of the two columns in question
You could achieve it using a UDF as follows:
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StructType, StructField, FloatType, TimestampType
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
schema = (StructType([
StructField('dt', TimestampType(), nullable=True),
StructField('sec', FloatType(), nullable=True),
]))
item1 = {
"dt": datetime.fromtimestamp(1611859271.516),
"sec": 19.702300786,
}
item2 = {
"dt": datetime.fromtimestamp(1611859271.517),
"sec": 19.702300787,
}
item3 = {
"dt": datetime.fromtimestamp(1611859271.518),
"sec": 19.702300788,
}
df = spark.createDataFrame([item1, item2, item3], schema=schema)
df.printSchema()
#udf(returnType=TimestampType())
def add_time(dt, sec):
return dt + timedelta(seconds=sec)
df = df.withColumn("new_dt", add_time(col("dt"), col("sec")))
df.printSchema()
df.show(truncate=False)
Timestamp data type supports nanoseconds (max 9 digits precision). Your float_seconds_col has precision > 9 digits (15 in your example, it is femto-seconds), it will be lost if converted to Timestamp anyway.
Plain vanilla Hive:
select
timestamp(cast(concat(cast(unix_timestamp(TimestampType_col) as string), --seconds
'.',
regexp_extract(TimestampType_col,'\\.(\\d+)$')) --fractional
as decimal (30, 15)
) + float_seconds_col --round this value to nanos to get better timestamp conversion (round(float_seconds_col,9))
) as result --max precision is 9 (nanoseconds)
from
(
select 19.702300786972046 float_seconds_col,
timestamp('2021-06-17 04:31:32.48761') TimestampType_col
) s
Result:
2021-06-17 04:31:52.189910786

Get sum of a column in a DF1 based on date range from DF2 in Spark

I have two dataframes and I want to get sum of value in the dataframe 1 based on the date range from dataframe 2 (startDate and endDate) and sort the results from maximum to minimum in Spark
import org.apache.spark.sql.functions.to_date
val df = sc.parallelize(Seq(
("2019-01-01", 100), ("2019-01-02", 150),
("2019-01-03", 120), ("2019-01-04", 38),
("2019-01-05", 200), ("2019-01-06", 381),
("2019-01-07", 220), ("2019-01-08", 183),
("2019-01-09", 160), ("2019-01-10", 109),
("2019-01-11", 130), ("2019-01-12", 282),
("2019-01-13", 10), ("2019-01-14", 348),
("2019-01-15", 20), ("2019-01-16", 190)
)).toDF("date", "value").withColumn("date", to_date($"date"))
val df_dates = sc.parallelize(Seq(
("2019-01-01", "2019-01-04"),
("2019-01-05", "2019-01-08"),
("2019-01-09", "2019-01-12"),
("2019-01-13", "2019-01-16")
)).toDF("startDate", "endDate").withColumn("startDate", to_date($"startDate")).withColumn("endDate", to_date($"endDate"))
The resulting output will add a column to the df_date dataframe sum_value. I really do not know where to start. I searched web and couldm't find a solution.
You first have to join date values to the date ranges, then aggregate:
df_dates
.join(df, $"date".between($"startDate", $"endDate"), "left")
.groupBy($"startDate", $"endDate").agg(
sum($"value").as("sum_value")
)
.orderBy($"sum_value".desc)
.show()
+----------+----------+---------+
| startDate| endDate|sum_value|
+----------+----------+---------+
|2019-01-05|2019-01-08| 984|
|2019-01-09|2019-01-12| 681|
|2019-01-13|2019-01-16| 568|
|2019-01-01|2019-01-04| 408|
+----------+----------+---------+

Aggregate data in scala

I have data in a file like :
2005, 08, 20, 50
2005, 08, 21, 52
2005, 08, 22, 38
2005, 08, 23, 70
Data is : Year, Month, Date, temperature.
I want to read this data and output data year and month wise temperatures.
example : 2015-08: 38, 50, 52, 70.
temperature is sorted in ascending order.
What should be the spark scala code for the same? Answer in RDD transformations would appreciate a lot.
Until now I have done this so far :
val conf= new SparkConf().setAppName("demo").setMaster("local[*]")
val spark = new SparkContext(conf)
val input = spark.textFile("src/main/resources/someFile.txt")
val fields = input.flatMap(_.split(","))
What I am thinking is, to have year-month as a key and then list of temperatures as values. But I am not able to get this into the code.
val myData = sc.parallelize(Array((2005, 8, 20, 50), (2005, 8, 21, 52), (2005, 8, 22, 38), (2005, 8, 23, 70)))
myData.sortBy(_._4).collect
returns:
res1: Array[(Int, Int, Int, Int)] = Array((2005,8,22,38), (2005,8,20,50), (2005,8,21,52), (2005,8,23,70))
Leave you to do the concat function
From file
val filesRDD = sc.textFile("/FileStore/tables/Weather2.txt",1)
val linesRDD = filesRDD.map(line => (line.trim.split(","))).map(entries=>(entries(0).toInt,entries(1).toInt,entries(2).toInt,entries(3).toInt))
linesRDD.sortBy(_._4).collect
returns:
res13: Array[(Int, Int, Int, Int)] = Array((2005,7,22,7), (2005,7,15,10), (2005,8,22,38), (2005,8,20,50), (2005,7,19,50), (2005,8,21,52), (2005,7,21,52), (2005,8,23,70))
You can think of the concat yourself, and, what if sort values are common? Multiple sorts, but this I think answers your first less well-formed question.