Non-sense in pySpark calculation using a lambda over a Row - pyspark

I have a df in pyspark representing KM run by cars by day.
Explaination of fields:
x[3] is the day of the week in Italian, i.e:
LUN (MON)
MAR (TUE)
MER (WED)
GIO (THU)
VEN (FRI)
SAB (SAT)
DOM (SUN)
X[22] are the kilometers in night hours
X[23] are the kilometers in day time
The objective is to insert 3 new columns. In pseudo code:
km_totali (total km): x[22]+x[23]
km_festivo (week-end km): x[22]+x[23] if x[3] in [SAB, DOM] i.e. SAT or SUN; else 0
km_feriale (weekday km): x[22]+x[23] if x[3] in [LUN-VEN]. i.e. MON-FRI; else 0
In other words, km_totali should always be calculated.
km_festivo should be equal to km_totali on the weekend days, 0 otherwise
km_feriale should be equal to km_totali on the week-days, 0 otherwise.
This apparently easy task is driving me crazy.
This is my code:
new_df=df.rdd.map(lambda x: Row(
giorno_della_settimana=x[3],
km_totali=x[22]+x[23],
km_feriale=x[22]+x[23] if x[3] in ["LUN","MAR", "MER", "GIO", "VEN] else 0,
km_festivo=x[22]+x[23] if x[3] in ["SAB","DOM"] else 0
)).toDF()
As you can see in the below output, km_totali is ALWAYS calculated (and it is correct, I made my checks). It also correctly caluclates km_festivo and km_feriale during week-days (MER and MON in the below example). But it ALWAYS fails to calulate km_festivo and km_feriale during week-end days (Nones in the DOM line in the below example).
OUTPUT:
Row(giorno_della_settimana=u'DOM', km_feriale=None, km_festivo=None, km_totali=106.5),
Row(giorno_della_settimana=u'MER', km_feriale=8.2, km_festivo=0, km_totali=8.2),
Row(giorno_della_settimana=u'LUN', km_feriale=3.0, km_festivo=0, km_totali=3.0),
Notice that km_totali is correctly calculated also in the "DOM" case. In facts, th desired output for the DOM line would be:
Row(giorno_della_settimana=u'DOM', km_feriale=0, km_festivo=106.5, km_totali=106.5)
What it drives me compleately crazy is that If I extend the list of days in the km_festivo's condition list to the whole week (meaningless from a semantic poit of view), the km_festivo field gets correctly calculated also in the DOM line:
new_df=df.map(lambda x: Row(
giorno_della_settimana=x[3],
km_totali=x[22]+x[23],
km_feriale=x[22]+x[23] if x[3] in ["LUN","MAR", "MER","GIO", "VEN"] else 0,
extended-> km_festivo=x[22]+x[23] if x[3] in ["LUN","MAR", "MER","GIO", "VEN","SAB","DOM"] else 0
)).toDF()
OUTPUT for the DOM line:
Row(giorno_della_settimana=u'DOM', km_feriale=None, km_festivo=106.5, km_totali=106.5),
As you can see, the km_festivo is now calculated for the DOM line (i.e a week-end day) ONLY because I have included Week-days in the condition list. This is a non-sense!!!!
I feel this is is not a matter of coding but I cannot figure out what this could possibly due to!!
HELP

SOLVED!
In the else clause, I have to return 0.0 instead of 0.
Otherwise I have a column with mixed types. Somtimes floats and somtimes integers
This is the right code:
new_df=df.rdd.map(lambda x: Row(
giorno_della_settimana=x[3],
km_totali=x[22]+x[23],
km_feriale=x[22]+x[23] if x[3] in ["LUN","MAR", "MER", "GIO", "VEN] else 0.0,
km_festivo=x[22]+x[23] if x[3] in ["SAB","DOM"] else 0.0
)).toDF()

Could you provide sample data?
I don't see anything wrong with your code apart from the fact that you're doing redundant computations (x[22] +x[23] at least twice) instead of computing it afterwards.
Could you try:
import pyspark.sql.functions as psf
new_df = df.map(lambda x: Row(
giorno_della_settimana=x[3],
km_totali=x[22]+x[23]
)).toDF()
new_df2 = new_df.select(
"*",
psf.when(new_df.giorno_della_settimana.isin(["LUN","MAR", "MER","GIO", "VEN"]), new_df.km_totali).otherwise(0).alias("km_feriale"),
psf.when(new_df.giorno_della_settimana.isin(["SAB","DOM"]), new_df.km_totali).otherwise(0).alias("km_festivo"),
)

Related

CDO - Resample netcdf files from monthly to daily timesteps

I have a netcdf file that has monthly global data from 1991 to 2000 (10 years).
Using CDO, how can I modify the netcdf from monthly to daily timesteps by repeating the monthly values each day of each month?
for eaxample,
convert from
Month 1, value = 0.25
to
Day 1, value = 0.25
Day 2, value = 0.25
Day 3, value = 0.25
....
Day 31, value = 0.25
convert from
Month 2, value = 0.87
to
Day 1, value = 0.87
Day 2, value = 0.87
Day 3, value = 0.87
....
Day 28, value = 0.87
Thanks
##############
Update
my monthly netcdf has the monthly values not on the first day of each month, but in sparse order. e.g. on the 15th, 7th, 9th, etc.. however one value for each month.
The question is perhaps ambiguously worded. Adrian Tompkins' answer is correct for interpolation. However, you are actually asking to set the value for each day of the month to that for the first day of the month. You could do this by adding a second CDO call as follows:
cdo -inttime,1991-01-01,00:00:00,1day in.nc temp.nc
cdo -monadd -gtc,100000000000000000 temp.nc in.nc out.nc
Just set the value after gtc to something much higher than anything in your data.
You can use inttime which interpolates in time at the interval required, but this is not exactly what you asked for as it doesn't repeat the monthly values and your series will be smoothed by the interpolation.
If we assume your dataset starts on the 1st January at time 00:00 (you don't state in the question) then the command would be
cdo inttime,1991-01-01,00:00:00,1day in.nc out.nc
This performs a simple linear interpolation between steps.
Note: This is fine for fields like temperature and seems to be want you ask for, but readers should note that one has to be more careful with flux fields such as rainfall, where one might want to scale and/or change the units appropriately.
I could not find a solution with CDO but I solved the issue with R, as follows:
library(dplyr)
library(ncdf4)
library(reshape2)
## Read ncfile
ncpath="~/my/path/"
ncname="my_monthly_ncfile"
ncfname=paste(ncpath, ncname, ".nc", sep="")
ncin=nc_open(ncfname)
var=ncvar_get(ncin, "nc_var")
## melt ncfile
var=melt(var)
var=var[complete.cases(var), ] ## remove any NA
## split ncfile by gridpoint (lat and lon) into a list
var=split(var, list(var$lat, var$lon))
var=var[lapply(var,nrow)>0] ## remove any empty list element
## create new list and replicate, for each gridpoint, each monthly value n=30 times
var_rep=list()
for (i in 1:length(var)) {
var_rep[[i]]=data.frame(value=rep(var[[i]]$value, each=30))
}

Mean of values before and after a specific element

I have an array of 1 x 400, where all element values are above 1500. However, I have some elements that have values<50 which are wrong measures and I would like to have the mean of the elements before and after the wrong measured data points and replace it in the main array.
For instance, element number 17 is below 50 so I want to take the mean of elements 16 and 18 and replace element 17 with the new mean.
Can someone help me, please? many thanks in advance.
No language is specified in the question, but for Python you could work with List Comprehension:
# array with 400 values, some of which are incorrect
arr = [...]
arr = [arr[i] if arr[i] >= 50 else (arr[i-1]+arr[i+1])/2 for i in range(len(arr))]
That is, if arr[i] is less than 50, it'll be replaced by the average value of the element before and after it. There are two issues with this approach.
If i is the first or last element, then one of the two values will be undefined, and no mean can be obtained. This can be fixed by just using the value of the available neighbour, as specified below
If two values in a row are very low, the leftmost one will use the rightmost one to calculate its value, which will result in a very low value. This is a problem that may not occur for you in practice, but it is an inherent result of the way you wish to recalculate values, and you might want to keep it in mind.
Improved version, keeping in mind the edge cases:
# don't alter the first and last item, even if they're low
arr = [arr[i] if arr[i] >= 50 or i == 0 or i+1 == len(arr) else (arr[i-1]+arr[i+1])/2 for i in range(len(arr))]
# replace the first and last element if needed
if arr[0] < 50:
arr[0] = arr[1]
if arr[len(arr)-1] < 50:
arr[len(arr)-1] = arr[len(arr)-2]
I hope this answer was useful for you, even if you intend to use another language or framework than python.

truncatingRemainder(dividingBy: ) returning nonZero remainder even if number is completely divisible

I am trying to get remainder using swift's truncatingRemainder(dividingBy:) method.
But I am getting a non zero remainder even if value I am using is completely divisible by deviser. I have tried number of solutions available here but none worked.
P.S. values I am using are Double (Tried Float also).
Here is my code.
let price = 0.5
let tick = 0.05
let remainder = price.truncatingRemainder(dividingBy: tick)
if remainder != 0 {
return "Price should be in multiple of tick"
}
I am getting 0.049999999999999975 as remainder which is clearly not the expected result.
As usual (see https://floating-point-gui.de), this is caused by the way numbers are stored in a computer.
According to the docs, this is what we expect
let price = //
let tick = //
let r = price.truncatingRemainder(dividingBy: tick)
let q = (price/tick).rounded(.towardZero)
tick*q+r == price // should be true
In the case where it looks to your eye as if tick evenly divides price, everything depends on the inner storage system. For example, if price is 0.4 and tick is 0.04, then r is vanishingly close to zero (as you expect) and the last statement is true.
But when price is 0.5 and tick is 0.05, there is a tiny discrepancy due to the way the numbers are stored, and we end up with this odd situation where r, instead of being vanishingly close to zero, is vanishing close to tick! And of course the last statement is then false.
You'll just have to compensate in your code. Clearly the remainder cannot be the divisor, so if the remainder is vanishingly close to the divisor (within some epsilon), you'll just have to disregard it and call it zero.
You could file a bug on this but I doubt that much can be done about it.
Okay, I put in a query about this and got back that it behaves as intended, as I suspected. The reply (from Stephen Canon) was:
That's the correct behavior. 0.05 is a Double with the value 0.05000000000000000277555756156289135105907917022705078125. Dividing 0.5 by that value in exact arithmetic gives 9 with a remainder of 0.04999999999999997501998194593397784046828746795654296875, which is exactly the result you're seeing.
The only rounding error that occurs in your example is in the division price/tick, which rounds up to 10 before your .rounded(.towardZero) has a chance to take effect. We'll add an API to let you do something like price.divided(by: tick, rounding: .towardZero) at some point, which will eliminate this rounding, but the behavior of truncatingRemainder is precisely as intended.
You really want to have either a decimal type (also on the list of things to do) or to scale the problem by a power of ten so that your divisor become exact:
1> let price = 50.0
price: Double = 50
2> let tick = 5.0
tick: Double = 5
3> let r = price.truncatingRemainder(dividingBy: tick)
r: Double = 0

Finding equinox and solstice times with astropy

I would like to use the astropy package to compute the time of equinoxes and solstices. I have worked before with the pyephem package, and it provides easy functions exactly for this: one can, for example, say
>>> print(ephem.next_equinox(ephem.now()))
2019/9/23 07:50:14
and get the time of the next equinox. However, there are no such functions in astropy, so I thought I might try to compute the times by the definition: the vernal equinox is the moment when the ecliptic longitude of the Sun is zero; the summer solstice is the moment when the ecliptic longitude of the Sun is 90° etc.
So it seems that getting the ecliptic longitude of the Sun would be the essential step, and then I could somehow solve that function for time:
def sunEclipticLongitude(t):
sun = astropy.coordinates.get_body('sun', t)
eclipticOfDate = astropy.coordinates.GeocentricTrueEcliptic(equinox=t)
sunEcliptic = sun.transform_to(eclipticOfDate)
return sunEcliptic.lon.deg
My first thought was to use something from scipy.optimize to solve this function for time, but at this point, I got stuck. The Sun's longitude is an angle, so there are obviously many solutions for lon=0 (this year's equinox, next year's equinox ...) How do I find the next time (from a particular origin, for example now) when the Sun's longitude is zero? How do I find the previous time when it was zero? Also, the vernal equinox seems to be a particularly nasty case for solving, since the function has a discontinuity at that point – it jumps from 360 to 0. How to handle that?
To answer your first question, the optimization strategy will still work for periodic functions, but the solution will depend on your optimization starting point. Just pick a point that's already pretty close. You know that equinoxes are Mar 21 and Sep 20 (or something like that), and solstices are June 21 and Dec 21. Pick an arbitrary time of day on those dates, and you'll find the right solution.
As for the discontinuity in the angle, it isn't really there... It's just that in this particular convention you elect to represent angles as numbers between 0 and 360 degrees. But mathematically, the angle of 361 degrees means exactly the same thing as 1 degrees and has just as much right to exist.
As a result, continuous periodic functions like sin(), cos() etc. will not show any discontinuity at that value of the angle (or any other value of the angle that you pick as minimum or maximum).
As you expected, the syzygys can be located using scipy.optimize, via a root-finding function, as in the code snippet below. But rather than looking for the next syzygy, it looks for the nearest one, by picking points 44 days before and after the given date and expecting that there will be one syzygy in between those dates. Since Earth doesn't have seasons shorter than 88 days within plus-or-minus 50,000 years, that's hopefully a pretty safe bet, and the astropy ephemerides aren't accurate for that long anyway. I employed sin(angle*2) in the (poorly named) linearize(angle) code below to convert ecliptic longitudes to a function which crosses zero at each quadrant.
The code is also in a gist, which might get refined: see find syzygy / equinox / solstice with astropy
# Find the nearest syzygy (solstice or equinox) to a given date.
# Solution for https://stackoverflow.com/questions/55838712/finding-equinox-and-solstice-times-with-astropy
# TODO: need to ensure we're using the right sun position functions for the syzygy definition....
import math
from astropy.time import Time, TimeDelta
import astropy.coordinates
from scipy.optimize import brentq
from astropy import units as u
# We'll usually find a zero crossing if we look this many days before and after
# given time, except when it is is within a few days of a cross-quarter day.
# But going over 50,000 years back, season lengths can vary from 85 to 98 days!
# https://individual.utoronto.ca/kalendis/seasons.htm#seasons
delta = 44.0
def mjd_to_time(mjd):
"Return a Time object corresponding to the given Modified Julian Date."
return Time(mjd, format='mjd', scale='utc')
def sunEclipticLongitude(mjd):
"Return ecliptic longitude of the sun in degrees at given time (MJD)"
t = mjd_to_time(mjd)
sun = astropy.coordinates.get_body('sun', t)
# TODO: Are these the right functions to call for the definition of syzygy? Geocentric? True?
eclipticOfDate = astropy.coordinates.GeocentricTrueEcliptic(equinox=t)
sunEcliptic = sun.transform_to(eclipticOfDate)
return sunEcliptic.lon.deg
def linearize(angle):
"""Map angle values in degrees near the quadrants of the circle
into smooth functions crossing zero, for root-finding algorithms.
Note that for angles near 90 or 270, increasing angles yield decreasing results
>>> linearize(5) > 0 > linearize(355)
True
>>> linearize(95) > 0 > linearize(85)
False
"""
return math.sin(math.radians(angle * 2))
def map_syzygy(t):
"Map times into linear functions crossing zero at each syzygy"
return linearize(sunEclipticLongitude(t))
def find_nearest_syzygy(t):
"""Return the precise Time of the nearest syzygy to the given Time,
which must be within 43 days of one syzygy"
"""
syzygy_mjd = brentq(map_syzygy, t.mjd - delta, t.mjd + delta)
syzygy = mjd_to_time(syzygy_mjd)
syzygy.format = 'isot'
return syzygy
if __name__ == '__main__':
import doctest
doctest.testmod()
t0 = Time('2019-09-23T07:50:10', format='isot', scale='utc')
td = TimeDelta(1.0 * u.day)
seq = t0 + td * range(0, 365, 15)
for t in seq:
try:
syzygy = find_nearest_syzygy(t)
except ValueError as e:
print(f'{e=}, {t.value=}, {t.mjd-delta=}, {map_syzygy(t.mjd-delta)=}')
continue
print(f'{t.value=}, {syzygy.value=}, {sunEclipticLongitude(syzygy)=}')

Problem looking at data between 0 and -1

I'm trying to write a program that cleans data, using Matlab. This program takes in the max and min that the data can be, and throws out data that is less than the min or greater than the max. There looks like a small issue with the cleaning part. This case ONLY happens when the minimum range of the variable being checked is 0. If this is the case, for one reason or another, the program won't throw away data points that are between 0 and -1. I've been trying to fix this for some time now, and noticed that this is the only case where this happens, and if you try to run a SQL query selecting data that is < 0, it will leave out data between 0 and -1, so effectively the same error as what's happening to me. Wondering if anyone might recognize this and know what it could be.
I would write such a function as:
function data = cleanseData(data, limits)
limits = sort(limits);
data = data( limits(1) <= data & data <= limits(2) );
end
an example usage:
a = rand(100,1)*10;
b = cleanseData(a, [-2 5]);
c = cleanseData(a, [0 -1]);
-1 is less than 0, so 0 should be the max value. And if this is the case it will keep points between -1 and 0 by your definition of the cleaning operation:
and throws out data that is less than the min or greater than the max.
If you want to throw away (using the above definition)
data points that are between 0 and -1
then you need to set 0 as the min value and -1 as the max value --- which does not make sense.
Also, I think you mean
and throws out data that is less than the min AND greater than the max.
It may be that the floats are getting casted to ints before the comparison. I don't know matlab, but in python int(-0.5)==0, which could explain the extra data points getting in. You can test this by setting the min to -1, if you then also get values from -1 to -2 then you'll need to make sure casting isn't being done.
If I try to mimic your situation with SQL, and run the following query against a datatable that has 1.00, 0.00, -0.20, -0.80. -1.00, -1.20 and -2.00 in the column SomeVal, it correctly returns -0.20 and -0.80, which is as expected.
SELECT SomeVal
FROM SomeTable
WHERE (SomeVal < 0) AND (SomeVal > - 1)
The same is true for MatLab. Perhaps there's an error in your code. Dheck the above statement with your own SELECT statement to see if something's amiss.
I can imagine such a bug if you do something like
minimum = 0
if minimum and value < minimum