trying to understand missing values - trend

When I add a trend smooth line to a multiple time series plot I keep getting missing values
Removed 2 row(s) containing missing values (geom_path)
although the data looks fine to me,
date cost inc month
1 2019-07-11 50.00 0.00 2019-07-01
2 2019-07-11 50.00 0.00 2019-07-01
3 2019-07-15 1743.48 0.00 2019-07-01
4 2019-07-26 1000.00 0.00 2019-07-01
5 2019-07-01 0.00 2000.00 2019-07-01
6 2019-09-01 0.00 2500.00 2019-09-01
7 2019-10-01 0.00 1973.96 2019-10-01
I "gather" the variables with,
df <- a %>% select(date, cost, inc) %>% gather(key = "variable", value = "value", -date)
and make an area plot with,
> ggplot(df, aes(x = date, y = value)) +
+ geom_area(aes(color = variable, fill = variable),
+ alpha = 0.5, position = position_dodge(0.8)) +
+ scale_color_manual(values = c("#00AFBB", "#E7B800")) +
+ scale_fill_manual(values = c("#00AFBB", "#E7B800"))
but when I add the trend smooth line I keep getting, "Removed 2 row(s) containing missing values (geom_path)" (as well as a bunch of other singularity and reciprocal condition errors) no matter what data I delete or change.
> p + stat_smooth(color = "#FC4E07", fill = "#FC4E07",method = "loess")
graph with trend smooth line

Okay, so I just forgot to define "p" first as the multiple time series plot.
p <- ggplot(df, aes(x = date, y = value)) +
geom_area(aes(color = variable, fill = variable),
alpha = 0.5, position = position_dodge(0.8)) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
scale_fill_manual(values = c("#00AFBB", "#E7B800"))
values are okay, and graph looks like this:
multiple times series area plot with smooth trend line

Related

Make a timeline from year and day numbers

I want to make a timeline.
The below code extracts information from columns A and B of some Excel workbooks. In Column A are years, column B contains the day number (for that year) when an event happened.
My question is: How can I plot this with Station1, Station2 ect. on the Y-axis, and year on X-axis? I want the graph to make a point on the day (and the right year) where my Excel sheet has data.
num = xlsread('station1.xlsx', 1, 'A:B');
num3 = xlsread('station2.xlsx', 1, 'A:B');
num4 = xlsread('station3.xlsx', 1, 'A:B');
num5 = xlsread('station5.xlsx', 1, 'A:B');
Example data:
num = 2000 193
2000 199
2000 220
2000 228
2000 241
2000 244
2000 250
2000 257
2016 287
2016 292
2016 294
2016 300
Use datetime and caldays to convert your year / day of year data into actual dates:
dnum = datetime(num(:,1),1,1) + caldays(num(:,2));
% dnum = '12-Jul-2000'
% '18-Jul-2000'
% '08-Aug-2000'
% ...
Plot a line with marks on every date:
hold on % to plot multiple lines
plot(dnum, 1*ones(size(dnum)), 'x-') % Change the 1 to the y-axis value
plot(dnum2, 2*ones(size(dnum2)), 'x-') % Line at y=2 with other dates dnum2
hold off
Output (zoomed in on x-axis to show year 2000 dates):
If your files are named as in your example, then you can replace your whole code with a loop to avoid declaring loads of num variables and calling plot over many lines:
figure; hold on;
for ii = 1:5
num = xlsread(['station', num2str(ii), '.xlsx'], 1, 'A:B');
dnum = datetime(num(:,1),1,1) + caldays(num(:,2));
plot(dnum, 1*ones(size(dnum)), 'x-');
end
hold off

What is the issue in my calculation of Multivariate Kernel Estimation?

My intention is to find its class through Bayes Classifier Algorithm.
Suppose, the following training data describes heights, weights, and feet-lengths of various sexes
SEX HEIGHT(feet) WEIGHT (lbs) FOOT-SIZE (inches)
male 6 180 12
male 5.92 (5'11") 190 11
male 5.58 (5'7") 170 12
male 5.92 (5'11") 165 10
female 5 100 6
female 5.5 (5'6") 150 8
female 5.42 (5'5") 130 7
female 5.75 (5'9") 150 9
trans 4 200 5
trans 4.10 150 8
trans 5.42 190 7
trans 5.50 150 9
Now, I want to test a person with the following properties (test data) to find his/her sex,
HEIGHT(feet) WEIGHT (lbs) FOOT-SIZE (inches)
4 150 12
This may also be a multi-row matrix.
Suppose, I am able to isolate only the male portion of the data and arrange it in a matrix,
and, I want to find its Parzen Density Function against the following row matrix that represents same data of another person(male/female/transgender),
(dataPoint may have multiple rows.)
so that we can find how closely matches this data with those males.
my attempted solution:
(1) I am unable to calculate the secondPart because of the dimensional mismatch of the matrices. How can I fix this?
(2) Is this approach correct?
MATLAB Code
male = [6.0000 180 12
5.9200 190 11
5.5800 170 12
5.9200 165 10];
dataPoint = [4 150 2]
variance = var(male);
parzen.m
function [retval] = parzen (male, dataPoint, variance)
clc
%male
%dataPoint
%variance
sub = male - dataPoint
up = sub.^2
dw = 2 * variance;
sqr = sqrt(variance*2*pi);
firstPart = sqr.^(-1);
e = dw.^(-1)
secPart = exp((-1)*e*up);
pdf = firstPart.* secPart;
retval = mean(pdf);
bayes.m
function retval = bayes (train, test, aprori)
clc
classCounts = rows(unique(train(:,1)));
%pdfmx = ones(rows(test), classCounts);
%%Parzen density.
%pdf = parzen(train(:,2:end), test(:,2:end), variance);
maxScore = 0;
pdfProduct = 1;
for type = 1 : classCounts
%if(type == 1)
clidxTrain = train(:,1) == type;
%clidxTest = test(:,1) == type;
trainMatrix = train(clidxTrain,2:end);
variance = var(trainMatrix);
pdf = parzen(trainMatrix, test, variance);
%dictionary{type, 1} = type;
%dictionary{type, 2} = prod(pdf);
%pdfProduct = pdfProduct .* pdf;
%end
end
for type=1:classCounts
end
retval = 0;
endfunction
First, your example person has a tiny foot!
Second, it seems you are mixing together kernel density estimation and naive Bayes. In a KDE, you estimate a pdf a sum of kernels, one kernel per data point in your sample. So if you wanted to do a KDE of the height of males, you would add together four Gaussians, each one centered at the height of a different male.
In naive Bayes, you assume that the features (height, foot size, etc.) are independent and that each one is normal distributed. You estimate the parameters of a single Gaussian per feature from your training data, then use their product to get the joint probability of a new example belonging to a certain class. The first page that you link explains this fairly well.
In code:
clear
human = [6.0000 180 12
5.9200 190 11
5.5800 170 12
5.9200 165 10];
tiger = [
2 2000 17
3 1980 16
3.5 2100 18
3 2020 18
4.1 1800 20
];
dataPoints = [
4 150 12
3 2500 20
];
sigSqH = var(human);
muH = mean(human);
sigSqT = var(tiger);
muT = mean(tiger);
for i = 1:size(dataPoints, 1)
i
probHuman = prod( 1./sqrt(2*pi*sigSqH) .* exp( -(dataPoints(i,:) - muH).^2 ./ (2*sigSqH) ) )
probTiger = prod( 1./sqrt(2*pi*sigSqT) .* exp( -(dataPoints(i,:) - muT).^2 ./ (2*sigSqT) ) )
end
Comparing the probability of tiger vs. human lets us conclude that dataPoints(1,:) is a person while dataPoints(2,:) is a tiger. You can make this model more complicated by, e.g., adding prior probabilities of being one class or the other, which would then multiply probHuman or probTiger.

Line chart with means and CIs in Stata

I have a data set with effect estimates for different points in time (for 1 month, 2 months, 6 months, 12 months and 18 months) and its standard errors. Now I want to plot the means for each period and the corresponding CIs around the means.
My sample looks like:
effect horizon se
0.03 1 0.2
0.02 6 0.01
0.01 6 0.3
0.00 1 0.4
0.04 18 0.2
0.02 2 0.05
0.01 2 0.02
... ...
The means of the effects for each horizon lead to 5 data points that I want to plot in a line chart together with the confidence intervals. I tried this:
egen means = mean(effect), by(horizon)
line means horizon
But how can I add the symmetric confidence bands? Such that I get something that looks like this:
Not entirely certain that this makes sense statistically, but here's how I might do this:
gen variance = se^2
collapse (mean) effect (sum) SV = variance (count) noobs = effect, by(horizon)
gen se_mean = sqrt(SV*(1/noobs)^2)
gen LB = effect - 1.96*se_mean
gen UB = effect + 1.96*se_mean
twoway (rline LB UB horizon, lpattern(dash dash)) (line effect horizon, lpattern(solid)), yline(0, lcolor(gray))
Which yields:
To get the SE of the mean effects T̅, I am using the formula
V(T̅) = 1/(n2) Σin V(Ti)
(which assumes the covariances of the effects are all zero). I then take the square root to get the SE of T̅.

Setting a matrix to one on a list of incomplete rectangular grid points

I have a set of data in a text file which shows coordinates of a rectangular grid in size of 81x61. each row shows longitude and latitude of a grid point. Longitudes change from 50.00 to 80.00 (61 Values), and Latitudes change from 30.00 to 70.00 (81 Values) in an order like this:
50.00 30.00
51.50 30.00
52.00 30.00
.
.
.
79.50 30.00
80.00 30.00
50.00 31.00
50.50 31.00
51.00 31.00
.
.
.
79.00 70.00
79.50 70.00
80.00 70.00
I also have another text file consists of some random coordinates which are from the rectangular grid mentioned above.
I want to create a matrix of size 81x61 with elements of 0 and 1 in a way that 1s correspond to coordinates from second text file.
How can I write the code which does that in Matlab?
Example of what I need in a small scale:
Text File:
1 1
1 2
1 3
.
.
.
4 3
4 4
4 5
Corresponding Rectangular Grid of the above text file:
1,1 1,2 1,3 1,4 1,5
2,1 2,2 2,3 2,4 2,5
3,1 3,2 3,3 3,4 3,5
4,1 4,2 4,3 4,4 4,5
2nd Text File:
1 1
1 3
2 4
2 5
3 4
4 1
4 5
Corresponding Matrix of the above text file:
1 0 1 0 0
0 0 0 1 1
0 0 0 1 0
1 0 0 0 1
I myself found a way;
Assumptions: The minimum value among longitude values is 50.00 & the minimum value among latitude values is 30.00
%// First column of the files is Longitude and the second column is Latitude
fr = fopen('second_file.txt','r');
lon = textscan(fr,'%f %*[^\n]');
lon = lon{:};
fclose(fr);
fr = fopen('second_file.txt','r');
lat = textscan(fr,'%*f %f %*[^\n]');
lat = lat{:};
fclose(fr);
%// We know that the overall size of the target matrix is 81x61
overall = zeros(81,61);
%// We assume that the total number of lines in the second file is 1000 (don't know if there is a built-in command to determine that!)
for k = 1:1000
i = int16(( lat(k) - 30.00 ) / 0.5 + 1);
j = int16(( lon(k) - 50.00 ) / 0.5 + 1);
overall(i,j) = 1;
end
I want to create a matrix of size 81x61 with elements of 0 and 1 in a
way that 1s correspond to coordinates from second text file.
That's the relevant information in the question. The answer is the Matlab function sub2ind(documentation). It converts a list of x, y coordinates to a list of array indices which you can then conveniently set to one.
Suppose you have read the content of the second file in a Nx2 matrix called second_file and the size of the result matrix you have given in variable matrix_size (81x61). You then do:
x = second_file(:, 1);
y = second_file(:, 2);
result = zeros(matrix_size);
index = sub2ind(matrix_size, x, y);
result(index) = 1;
I am assuming the the minimum and maximum values in your first file in respective columns are the ranges of latitude and longitude. Also, in both the files the first column is Longitude values.
%// calculating the min and max of latitude and longitude
id1 = fopen('first_file.txt');
A = fscanf(id1,'%f %f',[2 Inf]);
long_min = min(A(1,:));
long_max = max(A(1,:));
lat_min = min(A(2,:));
lat_max = max(A(2,:));
%// calculating output matrix size
no_row = (lat_max-lat_min)*2+1;
no_col = (long_max-long_min)*2+1;
output = zeros(no_row,no_col);
id2 = fopen('second_file.txt');
A = fscanf(id2,'%f %f',[2 Inf]);
% // converting the values into corresponding indices
A(1,:) = A(1,:) - long_min;
A(2,:) = A(2,:) - lat_min;
A = A*2 +1;
linear_ind = sub2ind([no_row no_col],A(2,:),A(1,:));
output(linear_ind) = 1;
2nd approach
I am assuming that in your second text file first column entries are latitude and second column entries are longitude. You will have to hard code the following variables:
long_min : the minimum value among longitude values
lat_min : the minimum value among latitude values
long_max : maximum value among longitude values
lat_max : maximum value among latitude values
And here is the code (only considering the second text file)
no_row = (lat_max-lat_min)*2+1;
no_col = (long_max-long_min)*2+1;
output = zeros(no_row,no_col);
id2 = fopen('second_file.txt');
A = fscanf(id2,'%f %f',[2 Inf]);
% // converting the values into corresponding indices
A(1,:) = A(1,:) - long_min;
A(2,:) = A(2,:) - lat_min;
A = A*2 +1;
linear_ind = sub2ind([no_row no_col],A(2,:),A(1,:));
output(linear_ind) = 1;

Matlab: Count till sum equals 360 > insert event1, next 360 >insert event 2 etc

I have been trying to solve this problem for a while now and I would appreciate a push in the right direction.
I have a matrix called Turn. This matrix contains 1 column of data, somewhere between 10000 and 15000 rows (is variable). What I like to do is as follows:
start at row 1 and add values of row 2, row 3 etc till sum==360. When sum==360 insert in column 2 at that specific row 'event 1'.
Start counting at the next row (after 'event 1') till sum==360. When sum==360 insert in column 2 at that specific row 'event 2'. etc
So I basically want to group my data in partitions of sum==360
these will be called events.
The row number at which sum==360 is important to me as well (every row is a time point so it will tells me the duration of an event). I want to put those row numbers in a new matrix in which on row 1: rownr event 1 happened, row 2: rownr event 2 happened etc.
You can find the row indices where events occur using the following code. Basically you're going to use the modulo operator to find where the sum of the first column of Turn is a multiple of 360.
mod360 = mod(cumsum(Turn(:,1)),360);
eventInds = find(mod360 == 0);
You could then loop over eventInds to place whatever values you'd like in the appropriate rows in the second column of Turn.
I don't think you'll be able to place the string 'event 1' in the column though as a string array is acts like a vector and will result in a dimension mismatch. You could just store the numerical value 1 for the first event and 2 for the second event and so on.
Ryan's answer looks like the way to go. But if your condition is such that you need to find row numbers where the cumulative sum is not exactly 360, then you would be required to do a little more work. For that case, you may use this -
Try this vectorized (and no loops) code to get the row IDs where the 360 grouping occurs -
threshold = 360;
cumsum_val = cumsum(Turn);
ind1 = find(cumsum_val>=threshold,1)
num_events = floor(cumsum_val(end)/threshold);
[x1,y1] = find(bsxfun(#gt,cumsum_val,threshold.*(1:num_events)));
[~,b,~] = unique(y1,'first');
row_nums = x1(b)
After that you can get the event data, like this -
event1 = Turn(1:row_nums(1));
event2 = Turn(row_nums(1)+1:row_nums(2));
event3 = Turn(row_nums(2)+1:row_nums(3));
...
event21 = Turn(row_nums(20)+1:row_nums(21));
...
eventN = Turn(row_nums(N-1)+1:row_nums(N));
Edit 1
Sample case:
We create a small data of 20 random integer numbers instead of 15000 as used for the original problem. Also, we are using a threshold of 30 instead of 360 to account for the small datasize.
Code
Turn = randi(10,[20 1]);
threshold = 30;
cumsum_val = cumsum(Turn);
ind1 = find(cumsum_val>=threshold,1)
num_events = floor(cumsum_val(end)/threshold);
[x1,y1] = find(bsxfun(#gt,cumsum_val,threshold.*(1:num_events)));
[~,b,~] = unique(y1,'first');
row_nums = x1(b);
Run
Turn =
7
6
3
4
5
3
9
2
3
2
3
5
4
10
5
2
10
10
5
2
threshold =
30
row_nums =
7
14
18
The run results shows the row_nums as 7, 14, 18, which mean that the second grouping starts with the 7th index in Turn, third grouping starts at 14th index and so on. Of course, you can append 1 at the beginning of row_nums to indicate that the first grouping starts at the 1st index.
Given a column vector x, say,
x = randi(100,10,1)
the following would give you the index of the first row where the cumulative sum off all the items above that row adds up to 360:
i = max( find( cumsum(x) <= 360) )
Then, you would have to use that index to find the next set of cumulative sums that add up to 360, something like
offset = max( find( cumsum(x(i+1:end)) <= 360 ) )
i_new = i + offset
You might need to add +1/-1 to the offset and the index.
>> x = randi(100,10,1)'
x =
90 47 47 44 8 79 45 9 91 6
>> cumsum(x)
ans =
90 137 184 228 236 315 360 369 460 466
>> i = max(find(cumsum(x)<=360))
i =
7