R: Why does st_join give invalid times error? - left-join

I am trying to join 2 SpatialPointsDataFrames by nearest neighbour analysis using sf::st_join(). Both files have been converted using st_as_sf() but when I try the join I get the error
Error in rep(seq_len(nrow(x)), lengths(i)) : invalid 'times' argument
At this point I have tried swapping the x and y arguments, and adjusting countless variations of the arguments but nothing seems to work. I have checked the help file for sf::st_join(), but don't see anything about a times argument? So I am unsure where from and why it keeps throwing this error...
below is a sample of my data set which produces the same error using the code further down
> head(sf.eSPDF[[1]])
Simple feature collection with 6 features and 8 fields
geometry type: POINT
dimension: XY
bbox: xmin: 35.9699 ymin: -3.74514 xmax: 35.97065 ymax: -3.74474
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
# A tibble: 6 x 9
TIME ELEVATION LATITUDE LONGITUDE DATE V1 V2 Survey geometry
<dttm> <chr> <dbl> <dbl> <date> <dttm> <dttm> <dbl> <POINT [°]>
1 2012-01-20 07:26:05 1018 m -3.74 36.0 2012-01-20 2012-01-20 00:00:00 2012-01-31 00:00:00 1 (35.97047 -3.74474)
2 2012-01-20 07:27:35 1018 m -3.74 36.0 2012-01-20 2012-01-20 00:00:00 2012-01-31 00:00:00 1 (35.97057 -3.74486)
3 2012-01-20 07:27:39 1019 m -3.74 36.0 2012-01-20 2012-01-20 00:00:00 2012-01-31 00:00:00 1 (35.9706 -3.74489)
4 2012-01-20 07:27:47 1020 m -3.74 36.0 2012-01-20 2012-01-20 00:00:00 2012-01-31 00:00:00 1 (35.97065 -3.74489)
5 2012-01-20 07:28:05 1020 m -3.74 36.0 2012-01-20 2012-01-20 00:00:00 2012-01-31 00:00:00 1 (35.97035 -3.74498)
6 2012-01-20 07:28:26 1019 m -3.75 36.0 2012-01-20 2012-01-20 00:00:00 2012-01-31 00:00:00 1 (35.9699 -3.74514)
> head(sf.plt.centr)
Simple feature collection with 6 features and 1 field
geometry type: POINT
dimension: XY
bbox: xmin: 35.75955 ymin: -3.91594 xmax: 36.0933 ymax: -3.401
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
PairID geometry
1 1 POINT (36.0933 -3.6731)
42 92 POINT (36.02593 -3.91594)
83 215 POINT (36.06496 -3.75837)
124 225 POINT (35.83156 -3.401)
165 251 POINT (35.75955 -3.54388)
206 2 POINT (36.08752 -3.69128)
Below is the code that I am using to check for working solutions
sf.eSPDF<-lapply(eSPDF, function(x){
st_as_sf(as(x, "SpatialPointsDataFrame"))
})
sf.plt.centr<-st_as_sf(as(plt.centr, "SpatialPointsDataFrame"))
x1<-head(sf.eSPDF[[1]])
x2<-head(sf.plt.centr)
check<-st_join(x1, x2, join=st_nn, maxdist = Inf, returnDist = T, progress = TRUE)
As you can see, the file I want to join to is an object within a list. All objects within that list have identical structure to the example given. Eventually I want to get a code that joins the sf.plt.centr file to each of the files within the list. Something like...
big.join<-lapply(sf.eSPDF, function(x){
st_join('['(x), sf.plt.centr, st_nn, maxdist = Inf, returnDist = T, progress = TRUE)
}

Related

KDB: weighted median

How can one compute weighted median in KDB?
I can see that there is a function med for a simple median but I could not find something like wmed similar to wavg.
Thank you very much for your help!
For values v and weights w, med v where w gobbles space for larger values of w.
Instead, sort w into ascending order of v and look for where cumulative sums reach half their sum.
q)show v:10?100
17 23 12 66 36 37 44 28 20 30
q)show w:.001*10?1000
0.418 0.126 0.077 0.829 0.503 0.12 0.71 0.506 0.804 0.012
q)med v where "j"$w*1000
36f
q)w iasc v / sort w into ascending order of v
0.077 0.418 0.804 0.126 0.506 0.012 0.503 0.12 0.71 0.829
q)0.5 1*(sum;sums)#\:w iasc v / half the sum and cumulative sums of w
2.0525
0.077 0.495 1.299 1.425 1.931 1.943 2.446 2.566 3.276 4.105
q).[>]0.5 1*(sum;sums)#\:w iasc v / compared
1111110000b
q)v i sum .[>]0.5 1*(sum;sums)#\:w i:iasc v / weighted median
36
q)\ts:1000 med v where "j"$w*1000
18 132192
q)\ts:1000 v i sum .[>]0.5 1*(sum;sums)#\:w i:iasc v
2 2576
q)wmed:{x i sum .[>]0.5 1*(sum;sums)#\:y i:iasc x}
Some vector techniques worth noticing:
Applying two functions with Each Left (sum;sums)#\: and using Apply . and an operator on the result, rather than setting a variable, e.g. (0.5*sum yi)>sums yi:y i or defining an inner lambda {sums[x]<0.5*sum x}y i
Grading one list with iasc to sort another
Multiple mappings through juxtaposition: v i sum ..
You could effectively weight the median by duplicating (using where):
q)med 10 34 23 123 5 56 where 4 1 1 1 1 1
10f
q)med 10 34 23 123 5 56 where 1 1 1 1 1 4
56f
q)med 10 34 23 123 5 56 where 1 2 1 3 2 1
34f
If your weights are percentages (e.g. 0.15 0.10 0.20 0.30 0.25) then convert them to equivalent whole/counting numbers
q)med 1 2 3 4 5 where "i"$100*0.15 0.10 0.20 0.30 0.25
4f

remove description lines and add time to the first column

AWk experts, I have a file as descried below and I wonder if it is possible to easily convert it to the form that I want:
The file containing multiple variables over one month (one observance ONLY in one day, but some days may be missing). The format for each day is the same except the date/value. However there is some description lines (containing words and numbers) at the end of each day, and the number of description lines varies among different days.
KBO BTA Observations at 12Z 01 Feb 2020
-----------------------------------------------------------------------------
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
1000.0 92
925.0 765
850.0 1516
754.0 2546 13.0 9.3 78 9.85 150 2 310.2 340.6 312.0
752.0 2569 14.0 9.2 73 9.80 149 2 311.5 342.0 313.4
700.0 3173 -9.20 7.5 89 9.38 120 6 312.6 341.9 314.4
Station information and sounding indices
Station elevation: 2546.0
Lifted index: 1.83
Pres [hPa] of the Lifted Condensation Level: 693.42
1000 hPa to 500 hPa thickness: 5798.00
Precipitable water [mm] for entire sounding: 21.64
8022 KBO BTA Observations at 00Z 02 Feb 2020
-----------------------------------------------------------------------------
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
1000.0 97
925.0 758
850.0 1515
753.0 2546 10.8 6.8 76 8.30 190 3 307.9 333.4 309.5
750.0 2580 12.6 7.9 73 8.99 186 3 310.2 338.1 311.9
Here is what I want: remove all the description lines and read the date/time information and put it as the first column.
Time PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
20200201t12Z 754.0 2546 13.0 9.3 78 9.85 150 2 310.2 340.6 312.0
20200201t12Z 752.0 2569 14.0 9.2 73 9.80 149 2 311.5 342.0 313.4
20200201t12Z 700.0 3173 -9.2 7.5 89 9.38 120 6 312.6 341.9 314.4
20200202t00Z 753.0 2546 10.8 6.8 76 8.30 190 3 307.9 333.4 309.5
20200202t00Z 750.0 2580 12.6 7.9 73 8.99 186 3 310.2 338.1 311.9
Any help is appreciated.
Kelly
something like this...
$ awk 'function m(x)
{return sprintf("%02d",int(index("JanFebMarAprMayJunJulAugSepOctNovDec",x)-1)/3+1)}
NR==1 {print "time PRES TEMP WDIR WSPD RELH"}
/^-+$/ {f=!f}
f {date=p[n] m(p[n-1]) p[n-2]}
!f {n=split($0,p)}
NF==11 && !/[^ 0-9.-]/ {print date,$0}' file | column -t
time PRES TEMP WDIR WSPD RELH
20200201 1000 10 230 5 90
20200201 900 9 200 6 85
20200201 800 9 100 6 87
20200202 1000 9.2 233 5 90
20200202 900 9.1 200 4 80
20200202 800 9 176 2 80
Explanation
function just returns the month number from the month string by looking up the index of and converting to formatted number
f keeps track of the dashed lines so that from the previous line we can parse the date,
finally to find the data lines the heuristic is number of fields and no non-number signs (digits, spaces, dots or negative signs).
$ cat tst.awk
/^-+$/ && ( ((++dashCnt) % 2) == 1 ) {
mthNr = (index("JanFebMarAprMayJunJulAugSepOctNovDec",p[n-1])+2)/3
time = sprintf("%04d%02d%02d", p[n], mthNr, p[n-2])
}
/^[[:upper:][:space:]]+$/ && !doneHdr++ { print "Time", $0 }
/^[0-9.[:space:]]+$/ { print time, $0 }
{ n = split($0,p) }
.
$ awk -f tst.awk file | column -t
Time PRES TEMP WDIR WSPD RELH
20200001 1000 10 230 5 90
20200001 900 9 200 6 85
20200001 800 9 100 6 87
20200002 1000 9.2 233 5 90
20200002 900 9.1 200 4 80
20200002 800 9 176 2 80

Plot histogram where x axis of the plot is a date

I know how to make a plot, but the data would be better represented as a histogram, Is there anyway I can easily convert this to a histogram?
figure();
plot(two_weeks,xAxis);
This is a datetime data type
disp(two_weeks)
21-Nov-2018 00:00:00 22-Nov-2018 00:00:00 23-Nov-2018 00:00:00 24-Nov-2018 00:00:00 25-Nov-2018 00:00:00 26-Nov-2018 00:00:00 27-Nov-2018 00:00:00 28-Nov-2018 00:00:00
Columns 9 through 14
29-Nov-2018 00:00:00 30-Nov-2018 00:00:00 01-Dec-2018 00:00:00 02-Dec-2018 00:00:00 03-Dec-2018 00:00:00 04-Dec-2018 00:00:00
disp(xAxis) =
5
12
1
7
13
24
2
27
62
0
3
17
74
4
Again I want something to look like this plot, except that it would be a histogram, I've tried looking through the histogram documentation and the MatLab helper form, but nothing answers my question, or helps me make the histogram in the desired way

PostgreSQL: Summing info from Two Aggregated Tables

There is something wrong with my method or my logic here.
I am trying to sum all the data from both tables. If the two correspond, add them up, if either doesn't correspond, still show the individual query total, ending up with estimates per year in sequence.
I have tried LEFT JOINS, FULL JOINS, (UNIONS). Nothing comes close to just summing where possible and supplying the data otherwise.
The key point here is pb and th_year information are years when the results are needed.
The error must be obvious in my code.
The separate aggregate queries produce the correct results.
Its the combining of the two queries where I am going wrong.
Would appreciate advice on this.
I thought it would be simple.
I think it probably is simple. Just stupidity on my side.
CREATE VIEW public.cf_th_data_totals_by_year_by_wc_2
AS SELECT
a.owner,
a.region,
a.district,
a.plantation,
b.th_year,
a.pb,
a.wc,
sum(a.tcf_calcarea + b.tth_calcarea) AS area,
sum(a.tcf_total + b.tth_total) AS total,
sum(a.tcf_ws + b.tth_ws) AS ws,
sum(a.tcf_util + b.tth_util) AS util,
sum(a.tcf_s + b.tth_s) AS s,
sum(a.tcf_a + b.tth_a) AS a,
sum(a.tcf_b + b.tth_b) AS b,
sum(a.tcf_c + b.tth_c) AS c,
sum(a.tcf_d + b.tth_d) AS d
FROM
(SELECT
cfdata.owner,
cfdata.region,
cfdata.district,
cfdata.plantation,
cfdata.pb,
cfdata.wc,
sum(cfdata.calcarea)AS tcf_calcarea,
sum(cfdata._ba) AS tcf_ba,
sum(cfdata._total) AS tcf_total,
sum( cfdata._ws) AS tcf_ws,
sum( cfdata._util) AS tcf_util,
sum( cfdata._s) AS tcf_s,
sum( cfdata._a) AS tcf_a,
sum( cfdata._b) AS tcf_b,
sum( cfdata._c) AS tcf_c,
sum( cfdata._d) AS tcf_d
FROM cfdata
GROUP BY cfdata.owner, cfdata.region, cfdata.district, cfdata.plantation, cfdata.pb, cfdata.wc
ORDER BY cfdata.owner, cfdata.region, cfdata.district, cfdata.plantation, cfdata.pb, cfdata.wc) a
JOIN
(SELECT
thdata.owner,
thdata.region,
thdata.district,
thdata.plantation,
thdata.th_year,
thdata.wc,
sum(thdata.calcarea)AS tth_calcarea,
sum(thdata.th_ba) AS tth_ba,
sum(thdata.th_total) AS tth_total,
sum(thdata.th_ws) AS tth_ws,
sum(thdata.th_util) AS tth_util,
sum(thdata.th_s) AS tth_s,
sum(thdata.th_a) AS tth_a,
sum(thdata.th_b) AS tth_b,
sum(thdata.th_c) AS tth_c,
sum(thdata.th_d) AS tth_d
FROM thdata
GROUP BY thdata.owner, thdata.region, thdata.district, thdata.plantation, thdata.th_year, thdata.wc
ORDER BY thdata.owner, thdata.region, thdata.district, thdata.plantation, thdata.th_year, thdata.wc) b
ON a.owner = b.owner AND a.region = b.region AND a.district = b.district and a.plantation = b.plantation AND a.pb = b.th_year AND a.wc = b.wc
GROUP BY a.owner, a.region, a.district, a.plantation, a.pb, b.th_year, a.wc
ORDER BY a.owner, a.region, a.district, a.plantation, a.pb, b.th_year, a.wc
thdata sample:
owner region district plantation compartment calcarea wc plantdate th_year th_age th_dbh th_ht th_vtree th_sph th_ba th_total th_ws th_util th_s th_a th_b th_c th_d thdata_id
KeyProjects Northern Marshlands River Glen A27 14.02 PFN 01/08/2009 2017 8 12.3 7.3 0.0289 179 28 70 14 56 42 14 0 0 0 1
KeyProjects Northern Marshlands River Glen A28 2.1 ESN 01/12/2010 2012 2 4.5 4.2 0 479 2 0 0 0 0 0 0 0 0 2
KeyProjects Northern Marshlands River Glen A28 2.1 ESN 01/12/2010 2014 4 10.2 9.6 0.0188 250 4 11 0 8 4 6 0 0 0 3
KeyProjects Northern Marshlands River Glen A29 2.71 ESN 01/08/2009 2011 2 4.5 4.2 0 479 3 0 0 0 0 0 0 0 0 4
KeyProjects Northern Marshlands River Glen A29 2.71 ESN 01/08/2009 2013 4 10.2 9.6 0.0188 250 5 14 0 11 5 8 0 0 0 5
thdata sample:
owner region district plantation compartment wc pb calcarea cfage dbh ht vtree sph _ba _total _ws _util _s _a _b _c _d tmai umai smai cfdata_id
KeyProjects Northern Marshlands River Glen A01 EF1 2021 5.27 10 14.5 20.4 0.1109 1004 90 585 21 564 84 401 79 0 0 11.1 10.7 1.5 1
KeyProjects Northern Marshlands River Glen A02 EF1 2021 36.1 10 14.5 20.4 0.1109 1004 614 4007 144 3863 578 2744 542 0 0 11.1 10.7 1.5 2
KeyProjects Northern Marshlands River Glen A03 EF1 2021 5.5 10 14.5 20.4 0.1109 1004 94 611 22 589 88 418 83 0 0 11.1 10.7 1.5 3
KeyProjects Northern Marshlands River Glen A04 EF1 2021 11.91 10 14.5 20.4 0.1109 1004 202 1322 48 1274 191 905 179 0 0 11.1 10.7 1.5 4
KeyProjects Northern Marshlands River Glen A05 EF1 2022 39.17 11 14.9 21.8 0.1286 1000 705 5053 157 4857 666 3486 744 0 0 11.7 11.3 1.7 5
expected result:
owner region district plantation th_year pb wc area total ws util s a b c d
KeyProjects Northern Marshlands River Glen 2008 2008 EF1 620.49 44176 1788 42389 7562 31953 2852 0 0
KeyProjects Northern Marshlands River Glen 2009 2009 EF1 635.65 44319 1778 42476 7634 31993 2852 0 0
KeyProjects Northern Marshlands River Glen 2010 2010 EF1 1202.31 87980 3453 84487 14906 63883 5704 0 0
KeyProjects Northern Marshlands River Glen 2011 2011 EF1 1948.37 132378 5275 127104 22662 95895 8556 0 0
KeyProjects Northern Marshlands River Glen 2012 2012 EF1 1378.61 87928 3429 84477 14878 63922 5704 0 0
Ok, you have a few issues with your query:
In the main query, do not use sum(a.tcf_calcarea + b.tth_calcarea) AS area. You can simply add but you should make sure to substitute any NULL values with 0 first: write coalesce(a.tcf_calcarea, 0) + coalesce(b.tth_calcarea, 0) AS area instead, for all sum()s. This also means you are not aggregating anymore at this level, so you should drop the final GROUP BY clause.
Now make a FULL OUTER JOIN between the two sub-queries. This means you get all rows from both sub-queries joined and where a corresponding row does not exist for either side, there are NULLs for column values.
It makes no sense to ORDER BY in a sub-query, the planner will process the row set in the way it sees best. You should order at the outer level only.
By definition (join condition) b.th_year = a.pb so you can drop one of the two columns.
Some syntactical pointers:
Your sub-queries use only one table so there is no need to work with table aliases, saves you a lot a typing.
More savings: Use positional parameters in your GROUP BY clause, so you can write GROUP BY 1, 2, 3, 4, 5, 6. Same with ORDER BY.
On the JOIN clause you can write USING (owner, region, district, plantation, wc) and then add WHERE a.pb = b.th_year. Other than that being shorter, you do not need sub-query aliases in the main query anymore for any of the USING columns. However, the fact that one join condition does not have corresponding column names does make things slightly more confused; up to you.
All in all, this is what you get:
CREATE VIEW public.cf_th_data_totals_by_year_by_wc_2 AS
SELECT owner, region, district, plantation, b.th_year, wc,
coalesce(a.tcf_calcarea, 0) + coalesce(b.tth_calcarea, 0) AS area,
...
FROM (
SELECT owner, region, district, plantation, pb, wc,
sum(calcarea) AS tcf_calcarea,
...
FROM cfdata
GROUP BY 1, 2, 3, 4, 5, 6) a
FULL JOIN (
SELECT owner, region, district, plantation, th_year, wc,
sum(calcarea) AS tth_calcarea,
...
FROM thdata
GROUP BY 1, 2, 3, 4, 5, 6) b
USING (owner, region, district, plantation, wc)
WHERE a.pb = b.th_year
ORDER BY 1, 2, 3, 4, 5, 6;

Matlab beginner median , mode and binning

I am a beginner with MATLAB and I am struggling with this assignment. Can anyone guide me through it?
Consider the data given below:
x = [ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , ...
34, 47, 455, 21, , 22, 100 ];
Once the data is loaded, see if you can find any:
Outliers or
Missing data in the data file
Correct the missing values using median, mode and noisy data using median binning, mean binning and bin boundaries.
This isn't so bad. First off, take a look at the distribution of your data. You can see that the majority of your data has double digits. The outliers are those with single digits, or those that are way larger than double digits. Mind you, this is totally subjective so someone else may tell you that the single digits are part of your data too. Also, the missing data are those numbers that are spaces in between the commas. Let's write some MATLAB code and change these to NaN (or not-a-number), because if you try copying and pasting this code directly into MATLAB, it will give you a syntax error because if you are explicitly defining numbers this way, you have to be sure all of them are there.
To do this, use regexprep so that any parts of this string that have a comma, space, then another comma, put a NaN in between. To do this, we need to put this statement as a string first. We then use eval to convert this string to an actual MATLAB statement:
x = '[ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , 34, 47, 455, 21, , 22, 100 ];'
y = eval(regexprep(x, ', ,', ', NaN, '));
If we display this data, we get:
y =
Columns 1 through 6
1 48 81 2 10 25
Columns 7 through 12
NaN 14 18 53 41 56
Columns 13 through 18
89 0 1000 NaN 34 47
Columns 19 through 23
455 21 NaN 22 100
As such, to answer our first question, any values that are missing are denoted as NaN and those numbers that are bigger than double digits are outliers.
For the next question, we simply extract those values that are not missing, calculate the mean and median of what is not missing, and fill in those NaN values with the mean and median. For the bin boundaries, this is the same thing as using the values to the left (or right... depends on your definition, but let's use left) of the missing value and fill those in. As such:
yMissing = isnan(y); %// Which values are missing?
y_noNaN = y(~yMissing); %// Extract the non-missing values
meanY = mean(y_noNaN); %// Get the mean
medianY = median(y_noNaN); %// Get the median
%// Output - Fill in missing values with median
yMedian = y;
yMedian(yMissing) = medianY;
%// Same for mean
yMean = y;
yMean(yMissing) = meanY;
%// Bin boundaries
yBinBound = y;
yBinBound(yMissing) = y(find(yMissing)-1);
The mean and median for the data of the non-missing values is:
meanY =
105.8500
medianY =
37.5000
The outputs for each of these, in addition to the original data with the missing values looks like:
format bank; %// Do this to show just the first two decimal places for compact output
format compact;
y =
Columns 1 through 5
1 48 81 2 10
Columns 6 through 10
25 NaN 14 18 53
Columns 11 through 15
41 56 89 0 1000
Columns 16 through 20
NaN 34 47 455 21
Columns 21 through 23
NaN 22 100
yMean =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 105.85 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
105.85 34.00 47.00 455.00 21.00
Columns 21 through 23
105.85 22.00 100.00
yMedian =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 37.50 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
37.50 34.00 47.00 455.00 21.00
Columns 21 through 23
37.50 22.00 100.00
yBinBound =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 25.00 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
1000.00 34.00 47.00 455.00 21.00
Columns 21 through 23
21.00 22.00 100.00
If you take a look at each of the output values, this fills in our data with the mean, median and also the bin boundaries as per the question.