remove description lines and add time to the first column - date

AWk experts, I have a file as descried below and I wonder if it is possible to easily convert it to the form that I want:
The file containing multiple variables over one month (one observance ONLY in one day, but some days may be missing). The format for each day is the same except the date/value. However there is some description lines (containing words and numbers) at the end of each day, and the number of description lines varies among different days.
KBO BTA Observations at 12Z 01 Feb 2020
-----------------------------------------------------------------------------
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
1000.0 92
925.0 765
850.0 1516
754.0 2546 13.0 9.3 78 9.85 150 2 310.2 340.6 312.0
752.0 2569 14.0 9.2 73 9.80 149 2 311.5 342.0 313.4
700.0 3173 -9.20 7.5 89 9.38 120 6 312.6 341.9 314.4
Station information and sounding indices
Station elevation: 2546.0
Lifted index: 1.83
Pres [hPa] of the Lifted Condensation Level: 693.42
1000 hPa to 500 hPa thickness: 5798.00
Precipitable water [mm] for entire sounding: 21.64
8022 KBO BTA Observations at 00Z 02 Feb 2020
-----------------------------------------------------------------------------
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
1000.0 97
925.0 758
850.0 1515
753.0 2546 10.8 6.8 76 8.30 190 3 307.9 333.4 309.5
750.0 2580 12.6 7.9 73 8.99 186 3 310.2 338.1 311.9
Here is what I want: remove all the description lines and read the date/time information and put it as the first column.
Time PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
20200201t12Z 754.0 2546 13.0 9.3 78 9.85 150 2 310.2 340.6 312.0
20200201t12Z 752.0 2569 14.0 9.2 73 9.80 149 2 311.5 342.0 313.4
20200201t12Z 700.0 3173 -9.2 7.5 89 9.38 120 6 312.6 341.9 314.4
20200202t00Z 753.0 2546 10.8 6.8 76 8.30 190 3 307.9 333.4 309.5
20200202t00Z 750.0 2580 12.6 7.9 73 8.99 186 3 310.2 338.1 311.9
Any help is appreciated.
Kelly

something like this...
$ awk 'function m(x)
{return sprintf("%02d",int(index("JanFebMarAprMayJunJulAugSepOctNovDec",x)-1)/3+1)}
NR==1 {print "time PRES TEMP WDIR WSPD RELH"}
/^-+$/ {f=!f}
f {date=p[n] m(p[n-1]) p[n-2]}
!f {n=split($0,p)}
NF==11 && !/[^ 0-9.-]/ {print date,$0}' file | column -t
time PRES TEMP WDIR WSPD RELH
20200201 1000 10 230 5 90
20200201 900 9 200 6 85
20200201 800 9 100 6 87
20200202 1000 9.2 233 5 90
20200202 900 9.1 200 4 80
20200202 800 9 176 2 80
Explanation
function just returns the month number from the month string by looking up the index of and converting to formatted number
f keeps track of the dashed lines so that from the previous line we can parse the date,
finally to find the data lines the heuristic is number of fields and no non-number signs (digits, spaces, dots or negative signs).

$ cat tst.awk
/^-+$/ && ( ((++dashCnt) % 2) == 1 ) {
mthNr = (index("JanFebMarAprMayJunJulAugSepOctNovDec",p[n-1])+2)/3
time = sprintf("%04d%02d%02d", p[n], mthNr, p[n-2])
}
/^[[:upper:][:space:]]+$/ && !doneHdr++ { print "Time", $0 }
/^[0-9.[:space:]]+$/ { print time, $0 }
{ n = split($0,p) }
.
$ awk -f tst.awk file | column -t
Time PRES TEMP WDIR WSPD RELH
20200001 1000 10 230 5 90
20200001 900 9 200 6 85
20200001 800 9 100 6 87
20200002 1000 9.2 233 5 90
20200002 900 9.1 200 4 80
20200002 800 9 176 2 80

Related

How can I get the cumulative sum between a predefined number of days in SAS EG

I would like to find a way to calculate the cumulative sum between a predefined number of days.
Just to clarify, I am looking to calculate the cumulative sum between days. Those can be consecutive (e.g. 03AUG and 04AUG) or not (e.g. 04AUG and 06AUG).
Below the data I have:
data have;
input ID $ DT:date9. Amount;
format DT date9.;
datalines;
A 09JUL2021 3600
A 03AUG2021 456
A 04AUG2021 33
A 06AUG2021 235
A 07AUG2021 100
A 09AUG2021 86
A 12AUG2021 456
A 24AUG2021 22
A 25AUG2021 987
A 26AUG2021 916
A 27AUG2021 81
;
run;
I want to be able to create a new variable that shows the cumulative amount between 2 or more days.
I should be able every time to select if I want the cumulative amount between 2 days, or 3 days and so on.
Below the data I want, when I select to calculate the sum between two days:
data what_I_want;
input ID $ DT:date9. Amount Sum_Between_Days;
format DT date9.;
datalines;
A 09JUL2021 3600 0
A 03AUG2021 456 0
A 04AUG2021 33 489
A 06AUG2021 235 268
A 07AUG2021 100 335
A 09AUG2021 86 186
A 12AUG2021 456 0
A 24AUG2021 22 0
A 25AUG2021 987 0
A 26AUG2021 916 1925
A 27AUG2021 81 1984
;
run;
Below the data I want when I select to calculate the sum between 3 days:
data what_I_want;
input ID $ DT:date9. Amount Sum_Between_Days;
format DT date9.;
datalines;
A 09JUL2021 3600 0
A 03AUG2021 456 0
A 04AUG2021 33 0
A 06AUG2021 235 724
A 07AUG2021 100 0
A 09AUG2021 86 186
A 12AUG2021 456 0
A 24AUG2021 22 0
A 25AUG2021 987 0
A 26AUG2021 916 0
A 27AUG2021 81 2006
;
run;
Hopefully, I make sense, but please let me know if not..
Thanks in advance.

Error in post hoc test for lmer(): both multcomp() and emmeans()

I have a dataset of measurements of "Y" at different locations, and I am trying to determine how variable Y is influenced by variables A, B, and D by running a lmer() model and analyzing the results. However, when I reach the post hoc step, I receive an error when trying to analyze.
Here is an example of my data:
table <- " ID location A B C D Y
1 1 AA 0 0.6181587 -29.67 14.14 168.041
2 2 AA 1 0.5816176 -29.42 14.21 200.991
3 3 AA 2 0.4289670 -28.57 13.55 200.343
4 4 AA 3 0.4158891 -28.59 12.68 215.638
5 5 AA 4 0.3172721 -28.74 12.28 173.299
6 6 AA 5 0.1540603 -27.86 14.01 104.246
7 7 AA 6 0.1219355 -27.18 14.43 128.141
8 8 AA 7 0.1016643 -26.86 13.75 179.330
9 9 BB 0 0.6831649 -28.93 17.03 210.066
10 10 BB 1 0.6796935 -28.54 18.31 280.249
11 11 BB 2 0.5497743 -27.88 17.33 134.023
12 12 BB 3 0.3631052 -27.48 16.79 142.383
13 13 BB 4 0.3875498 -26.98 17.81 136.647
14 14 BB 5 0.3883785 -26.71 17.56 142.179
15 15 BB 6 0.4058061 -26.72 17.71 109.826
16 16 CC 0 0.8647298 -28.53 11.93 220.464
17 17 CC 1 0.8664036 -28.39 11.59 326.868
18 18 CC 2 0.7480748 -27.61 11.75 322.745
19 19 CC 3 0.5959143 -26.81 13.27 170.064
20 20 CC 4 0.4849077 -26.77 14.68 118.092
21 21 CC 5 0.3584687 -26.65 15.65 95.512
22 22 CC 6 0.3018285 -26.33 16.11 71.717
23 23 CC 7 0.2629121 -26.39 16.16 60.052
24 24 DD 0 0.8673077 -27.93 12.09 234.244
25 25 DD 1 0.8226558 -27.96 12.13 244.903
26 26 DD 2 0.7826429 -27.44 12.38 252.485
27 27 DD 3 0.6620447 -27.23 13.84 150.886
28 28 DD 4 0.4453213 -27.03 15.73 102.787
29 29 DD 5 0.3720257 -27.13 16.27 109.201
30 30 DD 6 0.6040217 -27.79 16.41 101.509
31 31 EE 0 0.8770987 -28.62 12.72 239.036
32 32 EE 1 0.8504547 -28.47 12.92 220.600
33 33 EE 2 0.8329484 -28.45 12.94 174.979
34 34 EE 3 0.8181102 -28.37 13.17 138.412
35 35 EE 4 0.7942685 -28.32 13.69 121.330
36 36 EE 5 0.7319724 -28.22 14.62 111.851
37 37 EE 6 0.7014828 -28.24 15.04 110.447
38 38 EE 7 0.7286984 -28.15 15.18 121.831"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
# Make sure location is a factor
df$location<-as.factor(df$location)
Here is my model:
# Load libraries
library(ggplot2)
library(pscl)
library(lmtest)
library(lme4)
library(car)
mod = lmer(Y ~ A * B * poly(D, 2) * (1|location), data = df)
summary(mod)
plot(mod)
I now need to determine what variables significantly influence Y. Thus I ran Anova() from the package car (output pasted here).
Anova(mod)
# Analysis of Deviance Table (Type II Wald chisquare tests)
#
# Response: Y
# Chisq Df Pr(>Chisq)
# A 8.2754 1 0.004019 **
# B 0.0053 1 0.941974
# poly(D, 2) 40.4618 2 1.636e-09 ***
# A:B 0.1709 1 0.679348
# A:poly(D, 2) 1.6460 2 0.439117
# B:poly(D, 2) 5.2601 2 0.072076 .
# A:B:poly(D, 2) 0.6372 2 0.727175
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This suggests that:
A significantly influences Y
B does not significantly influence Y
D significantly influences Y
So next I would run a post hoc test for each of these variables, but this is where I run into issues. I have tried using both emmeans and multcomp packages below:
library(emmeans)
emmeans(mod, list(pairwise ~ A), adjust = "tukey")
# NOTE: Results may be misleading due to involvement in interactions
# Error in if ((misc$estType == "pairs") && (paste(c("", by), collapse = ",") != :
# missing value where TRUE/FALSE needed
pairs(emmeans(mod, "A"))
# NOTE: Results may be misleading due to involvement in interactions
# Error in if ((misc$estType == "pairs") && (paste(c("", by), collapse = ",") != :
# missing value where TRUE/FALSE needed
library(multcomp)
summary(glht(mod, linfct = mcp(A = "Tukey")), test = adjusted("fdr"))
# Error in h(simpleError(msg, call)) :
# error in evaluating the argument 'object' in selecting a method for function 'summary': Variable(s) ‘depth’ of class ‘integer’ is/are not contained as a factor in ‘model’.
This is the first time I've run an ANOVA/post hoc test on a lmer() model, and though I've read a few introductory sites for this model, I'm not sure I am testing it correctly. Any help would be appreciated.
If I am looking at the data correctly, A is the variable that has values of 0, 1, ..., 7. Now look at your anova table, where you see that A has only 1 d.f., not 7 as it should for a factor having 8 levels. That means your model is taking A to be a numerical predictor -- which is rather meaningless. Make A into a factor and re-fit he model. You'll have better luck.
I also think you meant to have + (1|location) at the end of the model formula, rather than having the random effects interacting with some of the polynomial effects.

kdb - KDB Apply logic where column exists - data validation

I'm trying to perform some simple logic on a table but I'd like to verify that the columns exists prior to doing so as a validation step. My data consists of standard table names though they are not always present in each data source.
While the following seems to work (just validating AAA at present) I need to expand to ensure that PRI_AAA (and eventually many other variables) is present as well.
t: $[`AAA in cols `t; temp: update AAA_VAL: AAA*AAA_PRICE from t;()]
Two part question
This seems quite tedious for each variable (imagine AAA-ZZZ inputs and their derivatives). Is there a clever way to leverage a dictionary (or table) to see if a number of variables exists or insert a place holder column of zeros if they do not?
Similarly, can we store a formula or instructions to to apply within a dictionary (or table) to validate and return a calculation (i.e. BBB_VAL: BBB*BBB_PRICE.) Some calculations would be dependent on others (i.e. BBB_Tax_Basis = BBB_VAL - BBB_COSTS costs for example so there could be iterative issues.
Thank in advance!
A functional update may be the best way to achieve this if your intention is to update many columns of a table in a similar fashion.
func:{[t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")];
t;
];
};
This function will update t with *_VAL columns for any column you pass as an argument, while first also adding a zero column for any missing columns passed as an argument.
q)t:([]AAA:10?100;BBB:10?100;CCC:10?100;AAA_PRICE:10*10?10;BBB_PRICE:10*10?10;CCC_PRICE:10*10?10;DDD_PRICE:10*10?10)
q)func/[t;`AAA`BBB`CCC`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL CCC_VAL DDD DDD_VAL
---------------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0 0
39 17 97 50 90 40 10 1950 1530 3880 0 0
76 11 11 0 0 50 10 0 0 550 0 0
26 55 99 20 60 80 90 520 3300 7920 0 0
91 51 3 30 20 0 60 2730 1020 0 0 0
83 81 7 70 60 40 90 5810 4860 280 0 0
76 68 98 40 80 90 70 3040 5440 8820 0 0
88 96 30 70 0 80 80 6160 0 2400 0 0
4 61 2 70 90 0 40 280 5490 0 0 0
56 70 15 0 50 30 30 0 3500 450 0 0
As you've already mentioned, to cover point 2, a dictionary of functions might be the best way to go.
q)dict:raze{(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")}each`AAA`BBB`DDD
q)dict
AAA_VAL| * `AAA `AAA_PRICE
BBB_VAL| * `BBB `BBB_PRICE
DDD_VAL| * `DDD `DDD_PRICE
And then a slightly modified function...
func:{[dict;t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(dict`$string[x],"_VAL")];
t;
];
};
yields a similar result.
q)func[dict]/[t;`AAA`BBB`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL DDD DDD_VAL
-------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0
39 17 97 50 90 40 10 1950 1530 0 0
76 11 11 0 0 50 10 0 0 0 0
26 55 99 20 60 80 90 520 3300 0 0
91 51 3 30 20 0 60 2730 1020 0 0
83 81 7 70 60 40 90 5810 4860 0 0
76 68 98 40 80 90 70 3040 5440 0 0
88 96 30 70 0 80 80 6160 0 0 0
4 61 2 70 90 0 40 280 5490 0 0
56 70 15 0 50 30 30 0 3500 0 0
Here's another approach which handles dependent/cascading calculations and also figures out which calculations are possible or not depending on the available columns in the table.
q)show map:`AAA_VAL`BBB_VAL`AAA_RevenueP`AAA_RevenueM`BBB_Other!((*;`AAA;`AAA_PRICE);(*;`BBB;`BBB_PRICE);(+;`AAA_Revenue;`AAA_VAL);(%;`AAA_RevenueP;1e6);(reciprocal;`BBB_VAL));
AAA_VAL | (*;`AAA;`AAA_PRICE)
BBB_VAL | (*;`BBB;`BBB_PRICE)
AAA_RevenueP| (+;`AAA_Revenue;`AAA_VAL)
AAA_RevenueM| (%;`AAA_RevenueP;1000000f)
BBB_Other | (%:;`BBB_VAL)
func:{c:{$[0h=type y;.z.s[x]each y;-11h<>type y;y;y in key x;.z.s[x]each x y;y]}[y]''[y];
![x;();0b;where[{all in[;cols x]r where -11h=type each r:(raze/)y}[x]each c]#c]};
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB AAA_VAL AAA_RevenueP AAA_RevenueM
---------------------------------------------------------------
1 1 10 4 1 11 1.1e-05
2 2 20 5 4 24 2.4e-05
3 3 30 6 9 39 3.9e-05
/if the right columns are there
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6;BBB_PRICE:4 5 6f);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB BBB_PRICE AAA_VAL BBB_VAL AAA_RevenueP AAA_RevenueM BBB_Other
--------------------------------------------------------------------------------------------
1 1 10 4 4 1 16 11 1.1e-05 0.0625
2 2 20 5 5 4 25 24 2.4e-05 0.04
3 3 30 6 6 9 36 39 3.9e-05 0.02777778
The only caveat is that your map can't have the same column name as both the key and in the value of your map, aka cannot re-use column names. And it's assumed all symbols in your map are column names (not global variables) though it could be extended to cover that
EDIT: if you have a large number of column maps then it will be easier to define it in a more vertical fashion like so:
map:(!). flip(
(`AAA_VAL; (*;`AAA;`AAA_PRICE));
(`BBB_VAL; (*;`BBB;`BBB_PRICE));
(`AAA_RevenueP;(+;`AAA_Revenue;`AAA_VAL));
(`AAA_RevenueM;(%;`AAA_RevenueP;1e6));
(`BBB_Other; (reciprocal;`BBB_VAL))
);

Print all entries after a pattern without including the pattern - awk

Consider the following input file
* Z1 Z2 A1pre A2pre A1post A2post I1pre I2pre I1gs I2gs Eexc1 Eexc2 n1 n2 TKEpre TKEpost
* Z1: Atomic number of first fragment
* Z2: Atomic number of second fragment
* A1pre: Pre-neutron mass number of first fragment
* A2pre: Pre-neutron mass number of second fragment
* A1post: Post-neutron mass number of first fragment
* A2post: Post-neutron mass number of second fragment
* I1pre: Spin of first fragment after scission
* I2pre: Spin of second fragment after scission
* I1gs: Ground-state spin of first fragment
* I2gs: Ground-state spin of second fragment
* Eexc1: Excitation energy of first fragment [MeV]
* Eexc2: Excitation energy of second fragment [MeV]
* n1: Prompt neutrons emitted from first fragment
* n2: Primpt neutrons emitted from second fragment
* TKEpre: Pre-neutron total kinetic energy [MeV]
* TKEpost: Post-neutron total kinetic energy [MeV]
* Calculation with nominal model parameters
38 54 97 138 94 136 5.5 9.0 0.0 0.0 21.72 14.78 3 2 159.43 156.00
39 53 101 134 99 133 2.5 4.0 2.5 3.5 17.12 12.93 2 1 166.45 161.75
36 56 92 143 91 143 3.0 11.5 2.5 2.5 12.27 7.81 1 0 170.00 168.44
38 54 93 142 92 141 3.5 7.0 0.0 2.5 13.94 9.81 1 1 168.95 163.96
40 52 99 136 98 135 2.5 6.0 0.0 3.5 9.28 13.10 1 1 177.04 172.75
40 52 100 135 98 134 0.0 6.5 0.0 0.0 14.74 10.13 2 1 176.61 173.55
What I want to do is to print specific columns following the Calculation with nominal model parameters pattern.
So far I've tried
awk '{/Calculation with nominal model parameters/;line=FNR}{for(i=0;i<=NR-19;++i){getline;print $5 "\t" $6 "\t" $16 "\t" $16*$6/($5+$6) "\t" $16*$5/($5+$6)}}{print line}' input
and my output is more or less what I want
88 144 159.10 98.7517 60.3483
87 146 164.87 103.309 61.5609
92 141 163.96 99.2204 64.7396
98 135 172.75 100.091 72.6588
98 134 173.55 100.24 73.3099
98 134 173.55 100.24 73.3099
19
As you can see the two last lines are identical. My goal is to avoid calculating the same thing twice. To achieve that I thought that I should finish the loop after NR-NFR_pattern, that's why I used line variable.
The weird thing is that even if I hardcode the pattern's NFR I don't get the desired output.
The weirdest thing is that if I replace the hardcoded NFR with the variable that has the NFR information i.e.
awk '{/Calculation with nominal model parameters/;line=FNR}{for(i=0;i<=NR-line;++i){getline;print $5 "\t" $6 "\t" $16 "\t" $16*$6/($5+$6) "\t" $16*$5/($5+$6)}}' input
I get the error
awk: (FILENAME=input FNR=2) fatal: division by zero attempted
Any idea how to solve this?
What you are missing, I think, is that getline changes NR:
getline Set $0 from next input record; set NF, NR, FNR, RT.
I wouldn't use getline at all.
awk 'isdata{print $5 "\t" $6 "\t" $16 "\t" $16*$6/($5+$6) "\t" $16*$5/($5+$6)}
/Calculation with nominal model parameters/{isdata="yes"}' input
I get:
94 136 156.00 92.2435 63.7565
99 133 161.75 92.7274 69.0226
91 143 168.44 102.936 65.5044
92 141 163.96 99.2204 64.7396
98 135 172.75 100.091 72.6588
98 134 173.55 100.24 73.3099

Matlab beginner median , mode and binning

I am a beginner with MATLAB and I am struggling with this assignment. Can anyone guide me through it?
Consider the data given below:
x = [ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , ...
34, 47, 455, 21, , 22, 100 ];
Once the data is loaded, see if you can find any:
Outliers or
Missing data in the data file
Correct the missing values using median, mode and noisy data using median binning, mean binning and bin boundaries.
This isn't so bad. First off, take a look at the distribution of your data. You can see that the majority of your data has double digits. The outliers are those with single digits, or those that are way larger than double digits. Mind you, this is totally subjective so someone else may tell you that the single digits are part of your data too. Also, the missing data are those numbers that are spaces in between the commas. Let's write some MATLAB code and change these to NaN (or not-a-number), because if you try copying and pasting this code directly into MATLAB, it will give you a syntax error because if you are explicitly defining numbers this way, you have to be sure all of them are there.
To do this, use regexprep so that any parts of this string that have a comma, space, then another comma, put a NaN in between. To do this, we need to put this statement as a string first. We then use eval to convert this string to an actual MATLAB statement:
x = '[ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , 34, 47, 455, 21, , 22, 100 ];'
y = eval(regexprep(x, ', ,', ', NaN, '));
If we display this data, we get:
y =
Columns 1 through 6
1 48 81 2 10 25
Columns 7 through 12
NaN 14 18 53 41 56
Columns 13 through 18
89 0 1000 NaN 34 47
Columns 19 through 23
455 21 NaN 22 100
As such, to answer our first question, any values that are missing are denoted as NaN and those numbers that are bigger than double digits are outliers.
For the next question, we simply extract those values that are not missing, calculate the mean and median of what is not missing, and fill in those NaN values with the mean and median. For the bin boundaries, this is the same thing as using the values to the left (or right... depends on your definition, but let's use left) of the missing value and fill those in. As such:
yMissing = isnan(y); %// Which values are missing?
y_noNaN = y(~yMissing); %// Extract the non-missing values
meanY = mean(y_noNaN); %// Get the mean
medianY = median(y_noNaN); %// Get the median
%// Output - Fill in missing values with median
yMedian = y;
yMedian(yMissing) = medianY;
%// Same for mean
yMean = y;
yMean(yMissing) = meanY;
%// Bin boundaries
yBinBound = y;
yBinBound(yMissing) = y(find(yMissing)-1);
The mean and median for the data of the non-missing values is:
meanY =
105.8500
medianY =
37.5000
The outputs for each of these, in addition to the original data with the missing values looks like:
format bank; %// Do this to show just the first two decimal places for compact output
format compact;
y =
Columns 1 through 5
1 48 81 2 10
Columns 6 through 10
25 NaN 14 18 53
Columns 11 through 15
41 56 89 0 1000
Columns 16 through 20
NaN 34 47 455 21
Columns 21 through 23
NaN 22 100
yMean =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 105.85 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
105.85 34.00 47.00 455.00 21.00
Columns 21 through 23
105.85 22.00 100.00
yMedian =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 37.50 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
37.50 34.00 47.00 455.00 21.00
Columns 21 through 23
37.50 22.00 100.00
yBinBound =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 25.00 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
1000.00 34.00 47.00 455.00 21.00
Columns 21 through 23
21.00 22.00 100.00
If you take a look at each of the output values, this fills in our data with the mean, median and also the bin boundaries as per the question.