For loop to fill NA with dates - date

I have a DF with consecutive dates with some NA's following, also worth mentioning is [1] will always have a Date.
I wish to fill the proceeding NA's with consecutive dates.
I have : "2016-01-01", "2016-01-02", "2016-01-03", NA, NA
Expected outcome within the DF will be:
"2016-01-01", "2016-01-02", "2016-01-03", "2016-01-04", "2016-01-05"
I attempted to use a for loop but I am fairly amateur at this:
Date <-as.Date(c("2016-01-01", "2016-01-02", "2016-01-03", NA, NA))
DF <- as.data.frame(Date)
i <- which.max(is.na(DF$Date))
for (i in which.max(is.na(DF$Date)):max(length(DF$Date))){
if(is.na(DF$Date)){
DF$Date[i]= as.Date(max(DF$Date, na.rm=TRUE) + 1)
}
}
It returns this:
Warning messages:
1: In if (is.na(DF$Date)) { :
the condition has length > 1 and only the first element will be used
2: In if (is.na(DF$Date)) { :
the condition has length > 1 and only the first element will be used

Found it out myself, the if statement wasn't a single value it was listing the whole DF if true or false NA.
Date <-as.Date(c("2016-01-01", "2016-01-02", "2016-01-03", NA, NA))
DF <- as.data.frame(Date)
i <- which.max(is.na(DF$Date))
for (i in which.max(is.na(DF$Date)):max(length(DF$Date))){
if(which.max(is.na(DF$Date))){
DF$Date[i]= as.Date(max(DF$Date, na.rm=TRUE) + 1)
}
}

Related

Is it bad to use `GroupBy` multiple times in pyspark?

This is an educational question.
I have a text file containing several records of power consumption of factories - identified by a unique id -. The file contains the following columns
factory_id, city, country, date, consumption
where date is in the format mm/YYYY. I want to compute which countries have less than 20 cities (including those with 0) which experienced a decrease in factories' consumption in two consecutive years. This is nothing but the total yearly consumption of the factories located in that city.
To do this, I used multiple times a groupBy + agg as follows
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("year", F.split("Date", "/")[1])
# compute for each city the yearly consumption
df_consump = df.groupBy("Country", "City", "year").agg(
F.sum("consumption").alias("consumption")
)
#F.udf(returnType=T.IntegerType())
def had_a_decrease(structs):
structs = sorted(structs, key=lambda s: s.year)
# retrieve 0 if list is monotonically growing, 1 otherwise
cur_cons = pairs[0].consumption
for struct in structs[1:]:
cons = struct.consumption
if cons <= cur_cons:
return 1
cur_cons = cons
return 0
df_cons_decrease = df_consump.groupBy("Country", "City").agg(
# here I collect a list of structs containing (year, consumption)
# which is needed because collect_list doesn't guarantee the order
# is respected so I keep the info on the year to sort this (small)
# list first in the udf "had_a_decrease" defined above.
# eventually this yields a column with a 1 if we had a decrease, 0 otherwise,
# which I sum afterwards.
had_a_decrease(F.collect_list(F.struct("year", "consumption"))).alias("had_decrease")
)
df_cons_decrease.groupBy("Country").agg(
F.sum("had_decrease").alias("num_cities_with_decrease")
).filter("num_cities_with_decrease < 20")\
.write.csv(outputFolder)
however I was wondering:
is this a bad practice (e.g. inefficient) ?
are dataframe better suited than RDDs for this ?
would you recommend a better approach than grouping this many times ?
Compare the consumption with the consomption 1 year and 2 year ago by using Window and lag function without udf and then group by.
data = [
[1, 1, 1, '01/2022', 100],
[1, 1, 1, '01/2021', 90],
[1, 1, 1, '01/2020', 80],
[1, 1, 2, '01/2022', 100],
[1, 1, 2, '01/2021', 110],
[1, 1, 2, '01/2020', 120]
]
cols = ['factory_id', 'city', 'country', 'date', 'consumption']
df = spark.createDataFrame(data, cols) \
.withColumn('year', f.split('date', '/')[1])
w = Window.partitionBy('country', 'city').orderBy('year')
df.groupBy('country', 'city', 'year') \
.agg(f.sum('consumption').alias('consumption')) \
.withColumn('consumption-1', f.lag('consumption', 1).over(w)) \
.withColumn('consumption-2', f.lag('consumption', 2).over(w)) \
.withColumn('is_decreased', f.expr('if(`consumption` < `consumption-1` and `consumption-1` < `consumption-2`, true, false)')) \
.filter('is_decreased = true') \
.select('country', 'city').distinct() \
.groupBy('country').count() \
.filter('count < 20') \
.select('country') \
.show()
+-------+
|country|
+-------+
| 2|
+-------+

Apply groupby in udf from a increase function Pyspark

I have the follow function:
import copy
rn = 0
def check_vals(x, y):
global rn
if (y != None) & (int(x)+1) == int(y):
return rn + 1
else:
# Using copy to deepcopy and not forming a shallow one.
res = copy.copy(rn)
# Increment so that the next value with start form +1
rn += 1
# Return the same value as we want to group using this
return res + 1
return 0
#pandas_udf(IntegerType(), functionType=PandasUDFType.GROUPED_AGG)
def check_final(x, y):
return lambda x, y: check_vals(x, y)
I need apply this function in a follow df:
index initial_range final_range
1 1 299
1 300 499
1 500 699
1 800 1000
2 10 99
2 100 199
So I need that follow output:
index min_val max_val
1 1 699
1 800 1000
2 10 199
See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken, applying the groupBy.
I tried:
w = Window.partitionBy('index').orderBy(sf.col('initial_range'))
df = (df.withColumn('nextRange', sf.lead('initial_range').over(w))
.fillna(0,subset=['nextRange'])
.groupBy('index')
.agg(check_final("final_range", "nextRange").alias('check_1'))
.withColumn('min_val', sf.min("initial_range").over(Window.partitionBy("check_1")))
.withColumn('max_val', sf.max("final_range").over(Window.partitionBy("check_1")))
)
But, don't worked.
Anyone can help me?
I think pure Spark SQL API can solve your question and it doesn't need to use any UDF, which might be an impact of your Spark performance. Also, I think two window function is enough to solve this question:
df.withColumn(
'next_row_initial_diff', func.col('initial_range')-func.lag('final_range', 1).over(Window.partitionBy('index').orderBy('initial_range'))
).withColumn(
'group', func.sum(
func.when(func.col('next_row_initial_diff').isNull()|(func.col('next_row_initial_diff')==1), func.lit(0))
.otherwise(func.lit(1))
).over(
Window.partitionBy('index').orderBy('initial_range')
)
).groupBy(
'group', 'index'
).agg(
func.min('initial_range').alias('min_val'),
func.max('final_range').alias('max_val')
).drop(
'group'
).show(100, False)
Column next_row_initial_diff: Just like the lead you use to shift/lag the row and check if it's in sequence.
Column group: To group the sequence in index partition.

Extracting Specific Field from String in Scala

My dataframe returns the below result as String.
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":0}], signature={"cbcnt":"number"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":"2021-07-30T00:00:00-04:00"}], signature={"cbcnt":"String"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}
I just want
"cbcnt":0 <-- Numeric part of this
Expected Output
col
----
0
2021-07-30
Tried:
.withColumn("CbRes",regexp_extract($"Col", """"cbcnt":(\S*\d+)""", 1))
Output
col
----
0
"2021-07-30 00:00:00 --<--additional " is coming
Using the Pyspark function regexp_extract:
from pyspark.sql import functions as F
df = <dataframe with a column "text" that contains the input data">
df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d+)""", 1)).show()
Extract via regex:
val value = "QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{\"cbcnt\":0}], signature={\"cbcnt\":\"number\"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |"
val regex = """"cbcnt":(\d+)""".r.unanchored
val s"${regex(result)}" = value
println(result)
Output:
0

Group_by returns just one row while aggregate returns the expected outcome

I am currently stuck at the post-processing of some EddyData. Following an example (https://github.com/bgctw/REddyProc/blob/master/vignettes/aggUncertainty.md) I came up with an unexpected outcome of group_by which is reproducible but I don't understand why.
Group_by returns just one row while aggregate gives the expected outcome.
Here is a minimal example:
library(tidyverse)
#create example data frame
date.time <- seq(from=as.POSIXct("2015-01-01 00:30:00"), to=as.POSIXct("2015-01-03 00:30:00"),by="30 mins")
nee <- runif(length(date.time),-200,200)
df <- data.frame(date.time, nee)
#calculate day of the year
df <- df %>% mutate(
date.time = df$date.time
, DoY = as.POSIXlt(date.time - 15*60)$yday # midnight belongs to the previous
)
#trying to summarise nee for each day
aggDay <- df %>% group_by(DoY) %>% summarise(nee=sum(nee))
aggDay
nee
1 322.1195
aggDay just returns one row while aggregate would work in this case
aggregate(df$nee, by=list(df$DoY), sum)
Group.1 x
1 0 -25.15698
2 1 448.13960
3 2 -100.86310
Unfortunately, the original code involves some further calculations which is the reason why I'd like to stay with group_by.
#original code, not reproducible here
aggDay <- df %>% group_by(DoY) %>%
summarise(
DateTime = first(DateTime)
, nRec = sum( NEE_uStar_fqc == 0, na.rm = TRUE)
, nEff = computeEffectiveNumObs(
resid, effAcf = !!autoCorr, na.rm = TRUE)
, NEE = mean(NEE_uStar_f, na.rm = TRUE)
, sdNEE = if (nEff <= 1) NA_real_ else sqrt(
mean(NEE_uStar_fsd^2, na.rm = TRUE) / (nEff - 1))
, sdNEEuncorr = if (nRec == 0) NA_real_ else sqrt(
mean(NEE_uStar_fsd^2, na.rm = TRUE) / (nRec - 1))
)
I restarted RStudio and now it works. Don't ask me. There must have been a problem with another loaded package.

Aggregate in Julia like R or pandas

I want to aggregate a monthly series at the quarterly frequency, for which R has ts and aggregate() (see the first answer on this thread) and pandas has df.resample("Q").sum() (see this question). Does Julia offer something similar?
Appendix: my current solution uses a function to convert a data to the first quarter and split-apply-combine:
"""
month_to_quarter(date)
Returns the date corresponding to the first day of the quarter enclosing date
# Examples
```jldoctest
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 1, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 25))
true
```
"""
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
"""
monthly_to_quarterly(monthly_df)
Aggregates a monthly data frame to the quarterly frequency. The data frame should have a :DATE column.
# Examples
```jldoctest
julia> monthly = convert(DataFrame, hcat(collect([Dates.Date(1990, m, 1) for m in 1:3]), [1; 2; 3]));
julia> rename!(monthly, :x1 => :DATE);
julia> rename!(monthly, :x2 => :value);
julia> quarterly = RED.monthly_to_quarterly(monthly);
julia> quarterly[:value][1]
2.0
julia> length(quarterly[:value])
1
```
"""
function monthly_to_quarterly(monthly::DataFrame)
# quarter months: 1, 4, 7, 10
quarter_months = collect(1:3:10)
# Deep copy the data frame
monthly_copy = deepcopy(monthly)
# Drop initial rows until it starts on a quarter
while !in(Dates.month(monthly_copy[:DATE][1]), quarter_months)
# Verify that something is left to pop
#assert 1 <= length(monthly_copy[:DATE])
monthly_copy = monthly_copy[2:end, :]
end
# Drop end rows until it finishes before a quarter
while !in(Dates.month(monthly_copy[:DATE][end]), 2 + quarter_months)
monthly_copy = monthly_copy[1:end-1, :]
end
# Change month of each date to the nearest quarter
monthly_copy[:DATE] = month_to_quarter.(monthly_copy[:DATE])
# Split-apply-combine
quarterly = by(monthly_copy, :DATE, df -> mean(df[:value]))
# Rename
rename!(quarterly, :x1 => :value)
return quarterly
end
I couldn't find such a function in the docs. Here's a more DataFrames.jl-ish and more succint version of your own answer
using DataFrames
# copy-pasted your own function
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
# the data
r=collect(1:6)
monthly = DataFrame(date=[Dates.Date(1990, m, 1) for m in r],
val=r);
# the functionality
monthly[:quarters] = month_to_quarter.(monthly[:date])
_aggregated = by(monthly, :quarters, df -> DataFrame(S = sum(df[:val])))
#show monthly
#show _aggregated