I am new to Spark and I've got a csv file with such data:
date, accidents, injured
2015/20/03 18:00 15, 5
2015/20/03 18:30 25, 4
2015/20/03 21:10 14, 7
2015/20/02 21:00 15, 6
I would like to aggregate this data by a specific hour of when it has happened. My idea is to Substring date to 'year/month/day hh' with no minutes so I can make it a key. I wanted to give average of accidents and injured by each hour. Maybe there is a different, smarter way with pyspark?
Thanks guys!

Well, it depends on what you're going to do afterwards, I guess.
The simplest way would be to do as you suggest: substring the date string and then aggregate:
data = [('2015/20/03 18:00', 15, 5),
('2015/20/03 18:30', 25, 4),
('2015/20/03 21:10', 14, 7),
('2015/20/02 21:00', 15, 6)]
df = spark.createDataFrame(data, ['date', 'accidents', 'injured'])
df['date'].substr(1, 13)
.agg({'accidents': 'avg', 'injured': 'avg'})\
If you, however, want to do some more computation later on, you can parse the data to a TimestampType() and then extract the date and hour from that.
import pyspark.sql.types as typ
from pyspark.sql.functions import col, udf
from datetime import datetime
parseString = udf(lambda x: datetime.strptime(x, '%Y/%d/%m %H:%M'), typ.TimestampType())
getDate = udf(lambda x:, typ.DateType())
getHour = udf(lambda x: int(x.hour), typ.IntegerType())
df.withColumn('date_parsed', parseString(col('date'))) \
.withColumn('date_only', getDate(col('date_parsed'))) \
.withColumn('hour', getHour(col('date_parsed'))) \
.groupby('date_only', 'hour') \
.agg({'accidents': 'avg', 'injured': 'avg'})\


Pyspark - extract first monday of week

Given Dataframe :
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James",202245),
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("Week", IntegerType(), True) \
df = spark.createDataFrame(data=data2,schema=schema)
The week column denotes the week number and year ie; 202245 is the 45th week of 2022. I would like to extract the date on which Monday falls for the 45th week of 2022 which is 7th Nov 2022.
What I tried, :
Tried using datetime and udf:
def get_monday_from_week(x: int) ->
Converts fiscal week to datetime of first Monday from that week
x (int): fiscal week
Returns: datetime of first Monday from that week
x = str(x)
r = datetime.datetime.strptime(x + "-1", "%Y%W-%w")
return r
How can I use spark functions to implement this, I am trying to avoid the use of udf here.

'list' object has no attribute 'map' in pyspark error

""" df = sc.textFile("/content/Shakespeare.txt")
llist = df.collect()
for line in llist:
t= simple_tokenize(line)
rdd2 = word: (word,1)) # error on this line
rdd3 = rdd2.reduceByKey(lambda a,b: a+b)
I am facing an error on rdd2. Can someone please help?
I think you would like a simple word count using rdd. You can do it by
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkFiles
# read file from url
spark.sparkContext.addFile(url)"shakespeare.txt"), header=True)
|by William Shakespeare |
|From fairest creatures we desire increase |
|That thereby beauty's rose might never die|
|But as the riper should by time decease |
only showing top 4 rows
# convert to rdd taking only the strings of the row x: x["THE SONNETS"])
['by William Shakespeare',
'From fairest creatures we desire increase',
"That thereby beauty's rose might never die",
'But as the riper should by time decease']
# you can also parallelize a Python list of strings
data=["From fairest creatures we desire increase",
"That thereby beauty's rose might never die",
"But as the riper should by time decease",
"His tender heir might bear his memory",
Now run the basic three steps
split by words
count the words
reduce by key
rdd1=rdd.flatMap(lambda x: x.split(" ")) word: (word,1))
rdd3=rdd2.reduceByKey(lambda a,b: a+b)
[('by', 66),
('William', 1),
('Shakespeare', 1),
('From', 14),
('fairest', 5),
('creatures', 2),
('we', 11),
('desire', 6),
('increase', 3),
('That', 83),
('thereby', 1),
("beauty's", 16),
('rose', 5),
('might', 19),
('never', 10),
('die', 5),
('But', 89),
('as', 66),
('the', 311),
('riper', 2)]

Adding float column to TimestampType column (seconds+miliseconds)

I am trying to add a float column to a TimestampType column in pyspark, but there does not seem to be a way to do this while maintaining the fractional seconds. example of float_seconds is 19.702300786972046, example of timestamp is 2021-06-17 04:31:32.48761
what I want:
calculated_df = beginning_df.withColumn("calculated_column", float_seconds_col + TimestampType_col)
I have tried the following methods, but neither completely solves the problem:
#method 1 adds a single time, but cannot be used to add an entire column to the timestamp column.
calculated_df = beginning_df.withColumn("calculated_column",col("TimestampType_col") + F.expr('INTERVAL 19.702300786 seconds'))
#method 2 converts the float column to unixtime, but cuts off the decimals (which are important)
timestamp_seconds ="float_seconds"))
Image of the two columns in question
You could achieve it using a UDF as follows:
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StructType, StructField, FloatType, TimestampType
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
schema = (StructType([
StructField('dt', TimestampType(), nullable=True),
StructField('sec', FloatType(), nullable=True),
item1 = {
"dt": datetime.fromtimestamp(1611859271.516),
"sec": 19.702300786,
item2 = {
"dt": datetime.fromtimestamp(1611859271.517),
"sec": 19.702300787,
item3 = {
"dt": datetime.fromtimestamp(1611859271.518),
"sec": 19.702300788,
df = spark.createDataFrame([item1, item2, item3], schema=schema)
def add_time(dt, sec):
return dt + timedelta(seconds=sec)
df = df.withColumn("new_dt", add_time(col("dt"), col("sec")))
Timestamp data type supports nanoseconds (max 9 digits precision). Your float_seconds_col has precision > 9 digits (15 in your example, it is femto-seconds), it will be lost if converted to Timestamp anyway.
Plain vanilla Hive:
timestamp(cast(concat(cast(unix_timestamp(TimestampType_col) as string), --seconds
regexp_extract(TimestampType_col,'\\.(\\d+)$')) --fractional
as decimal (30, 15)
) + float_seconds_col --round this value to nanos to get better timestamp conversion (round(float_seconds_col,9))
) as result --max precision is 9 (nanoseconds)
select 19.702300786972046 float_seconds_col,
timestamp('2021-06-17 04:31:32.48761') TimestampType_col
) s
2021-06-17 04:31:52.189910786

Get sum of a column in a DF1 based on date range from DF2 in Spark

I have two dataframes and I want to get sum of value in the dataframe 1 based on the date range from dataframe 2 (startDate and endDate) and sort the results from maximum to minimum in Spark
import org.apache.spark.sql.functions.to_date
val df = sc.parallelize(Seq(
("2019-01-01", 100), ("2019-01-02", 150),
("2019-01-03", 120), ("2019-01-04", 38),
("2019-01-05", 200), ("2019-01-06", 381),
("2019-01-07", 220), ("2019-01-08", 183),
("2019-01-09", 160), ("2019-01-10", 109),
("2019-01-11", 130), ("2019-01-12", 282),
("2019-01-13", 10), ("2019-01-14", 348),
("2019-01-15", 20), ("2019-01-16", 190)
)).toDF("date", "value").withColumn("date", to_date($"date"))
val df_dates = sc.parallelize(Seq(
("2019-01-01", "2019-01-04"),
("2019-01-05", "2019-01-08"),
("2019-01-09", "2019-01-12"),
("2019-01-13", "2019-01-16")
)).toDF("startDate", "endDate").withColumn("startDate", to_date($"startDate")).withColumn("endDate", to_date($"endDate"))
The resulting output will add a column to the df_date dataframe sum_value. I really do not know where to start. I searched web and couldm't find a solution.
You first have to join date values to the date ranges, then aggregate:
.join(df, $"date".between($"startDate", $"endDate"), "left")
.groupBy($"startDate", $"endDate").agg(
| startDate| endDate|sum_value|
|2019-01-05|2019-01-08| 984|
|2019-01-09|2019-01-12| 681|
|2019-01-13|2019-01-16| 568|
|2019-01-01|2019-01-04| 408|

Aggregate data in scala

I have data in a file like :
2005, 08, 20, 50
2005, 08, 21, 52
2005, 08, 22, 38
2005, 08, 23, 70
Data is : Year, Month, Date, temperature.
I want to read this data and output data year and month wise temperatures.
example : 2015-08: 38, 50, 52, 70.
temperature is sorted in ascending order.
What should be the spark scala code for the same? Answer in RDD transformations would appreciate a lot.
Until now I have done this so far :
val conf= new SparkConf().setAppName("demo").setMaster("local[*]")
val spark = new SparkContext(conf)
val input = spark.textFile("src/main/resources/someFile.txt")
val fields = input.flatMap(_.split(","))
What I am thinking is, to have year-month as a key and then list of temperatures as values. But I am not able to get this into the code.
val myData = sc.parallelize(Array((2005, 8, 20, 50), (2005, 8, 21, 52), (2005, 8, 22, 38), (2005, 8, 23, 70)))
res1: Array[(Int, Int, Int, Int)] = Array((2005,8,22,38), (2005,8,20,50), (2005,8,21,52), (2005,8,23,70))
Leave you to do the concat function
From file
val filesRDD = sc.textFile("/FileStore/tables/Weather2.txt",1)
val linesRDD = => (line.trim.split(","))).map(entries=>(entries(0).toInt,entries(1).toInt,entries(2).toInt,entries(3).toInt))
res13: Array[(Int, Int, Int, Int)] = Array((2005,7,22,7), (2005,7,15,10), (2005,8,22,38), (2005,8,20,50), (2005,7,19,50), (2005,8,21,52), (2005,7,21,52), (2005,8,23,70))
You can think of the concat yourself, and, what if sort values are common? Multiple sorts, but this I think answers your first less well-formed question.