r: boxplot names="" with single vector - boxplot

I cannot get names to work in box plot. Is it not working because I only have a single vector? Heres what I have tried.
tmp = c(1,1,1,1,2,2,2,2,5,5,5,5,5,6,5,4,7)
boxplot(tmp)
boxplot(tmp, names=c("today"))
boxplot(tmp, names="today")
boxplot(tmp, labels="today")
boxplot(tmp, labels=c("today")

boxplot(tmp, names=c("today"),show.names=TRUE)

I was having the same issue and found an alternative at the below link
https://stackoverflow.com/a/25590238/4092304
So could be coded as ...
boxplot(tmp, xlab = "today")

Related

How to overcome indefinite matrix error (NbClust)?

I'm getting the following error when calling NbClust():
Error in NbClust(data = ds[, sapply(ds, is.numeric)], diss = NULL, distance = "euclidean", : The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.
I've called ds <- ds[complete.cases(ds),] just before running NbClust so there's no missing values.
Any idea what's behind this error?
Thanks
I had same issue in my research.
So, I had mailed to Nadia Ghazzali, who is the package maintainer, and got an answer.
I'll attached my mail and her reply.
my e-mail:
Dear Nadia Ghazzali. Hello Nadia. I have some questions about
NbClust function in R library. I have tried googling but could not
find satisfying answers. First, I’m so grateful for you to making
this awsome R library. It is very helpful for my reasearch. I tested
NbClust function in NbClust library with my own data like below.
> clust <- NbClust(data, distance = “euclidean”,
min.nc = 2, max.nc = 10, method = ‘kmeans’, index =”all”)
But soon, an error has occurred. Error: division by zero! Error in
Indices.WBT(x = jeu, cl = cl1, P = TT, s = ss, vv = vv) : object
'scott' not found So, I tried NbClust function line by line and
found that some indices, like CCC, Scott, marriot, tracecovw,
tracew, friedman, and rubin, were not calculated because of object
vv = 0. I’m not very familiar with argebra so I don’t know meaning
of eigen value. But it seems to me that object ss(which is squart of
eigenValues) should not be 0 after prodected.
So, here is my questions.
I assume that my data is so sparse(a lot of zero values) that sqrt(eigenValues) becomes too small, is that right? I’m sorry I
can’t attach my data but I can attach some part of eigenValues and
squarted eigenValues.
> head(eigenValues)
[1] 0.039769880 0.017179826 0.007011972 0.005698736 0.005164871 0.004567238
> head(sqrt(eigenValues))
[1] 0.19942387 0.13107184 0.08373752 0.07548997 0.07186704 0.06758134
And if my assume is right, what can I do for this problems? Only one
way to drop out 7 indices?
Thank you for reading and I’ll waiting your reply. Best regards!
and her reply:
Dear Hansol,
Thank you for your interest. Yes, your understanding is good.
Unfortunately, the seven indices could not be applied.
Best regards,
Nadia Ghazzali
#seni The cause of this error is data related. If you look at the source code of this function,
NbClust <- function(data, diss="NULL", distance = "euclidean", min.nc=2, max.nc=15, method = "ward", index = "all", alphaBeale = 0.1)
{
x<-0
min_nc <- min.nc
max_nc <- max.nc
jeu1 <- as.matrix(data)
numberObsBefore <- dim(jeu1)[1]
jeu <- na.omit(jeu1) # returns the object with incomplete cases removed
nn <- numberObsAfter <- dim(jeu)[1]
pp <- dim(jeu)[2]
TT <- t(jeu)%*%jeu
sizeEigenTT <- length(eigen(TT)$value)
eigenValues <- eigen(TT/(nn-1))$value
for (i in 1:sizeEigenTT)
{
if (eigenValues[i] < 0) {
print(paste("There are only", numberObsAfter,"nonmissing observations out of a possible", numberObsBefore ,"observations."))
stop("The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.")
}
}
And I think the root cause of this error is the negative eigenvalues that seep in when the number of clusters is very high, i.e. the max.nc is high. So to solve the problem, you must look at your data. See if it got more columns then rows. Remove missing values, check for issues like collinearity & multicollinearity, variance, covariance etc.
For the other error, invalid clustering method, look at the source code of the method here. Look at line number 168, 169 in the given link. You are getting this error message because the clustering method is empty. if (is.na(method))
stop("invalid clustering method")

Issues fitting an exponential function

I'm having some serious issues fitting an exponential function (Beer-Lambert law) to my data. The optimization toolset function that I'm using produces terrible fits:
function [ Coefficients ] = fitting_new( Modified_Spectrum_Data,trajectory )
x_axis = trajectory;
fun = #(x,x_axis) (x(1)*exp((-x(2))*x_axis));
start = [Modified_Spectrum_Data(1) 0.05];
nlm = nlinfit(x_axis,Modified_Spectrum_Data,fun,start,opts);
Coefficients = nlm;
end
Data:
Modified_Spectrum_Data = [1.11111111111111, 1.08784976353957, 1.06352170731165, 1.04099672033640, 1.02649723285838, 1.00423806910703, 0.994116452961827, 0.975928861361604, 0.963081773802984, 0.953191520906905, 0.940636278551651, 0.930360007604054, 0.922259178548511, 0.916659345499171, 0.909149956799775, 0.901241601559703, 0.895375741449218, 0.893308346234150, 0.887985459843162, 0.884657500398024, 0.883852990694089, 0.877158499678129, 0.874817832833850, 0.875428444059047, 0.873170360623947, 0.871461252768665, 0.867913776631497, 0.866459074988087, 0.863819528471106, 0.863228815347816 ,0.864369045426273 ,0.860602502500599, 0.862653463581049, 0.861169231463016, 0.858658616425390, 0.864588421841755, 0.858668693409622, 0.857993365648639]
trajectory = [0.0043, 0.9996, 2.0007, 2.9994, 3.9996, 4.9994, 5.9981, 6.9978, 7.9997, 8.9992, 10.0007, 10.9993, 11.9994, 12.9992, 14.0001, 14.9968, 15.9972, 16.9996, 17.9996, 18.999, 19.9992, 20.9996, 21.9994, 23.0003, 23.9992, 24.999, 25.9987, 26.9986, 27.999, 28.9991, 29.999, 30.9987, 31.9976, 32.9979, 33.9983, 34.9988, 35.999, 36.9991]
I've tried using multiple different fitting functions and messing around with the options, but they don't seem to make too big of a difference. Additionally, I've tried changing the initial guess, but again that doesn't really make a difference.
Excel seems to be able to fit the data perfectly fine, but I have 900 rows of data I want to fit so doing it in Excel is not possible.
Any help would be greatly appreciated, thank you.
You'll want to use the cftool. Your data looks to follow a power law. Then choose 'Modified Spectrum Data' as your x axis and 'Trajectory' as your y. Select 'Power' from the drop down menu towards the top of the GUI.
Modified_Spectrum_Data = [1.11111111111111, 1.08784976353957, 1.06352170731165, 1.04099672033640, 1.02649723285838, 1.00423806910703, 0.994116452961827, 0.975928861361604, 0.963081773802984, 0.953191520906905, 0.940636278551651, 0.930360007604054, 0.922259178548511, 0.916659345499171, 0.909149956799775, 0.901241601559703, 0.895375741449218, 0.893308346234150, 0.887985459843162, 0.884657500398024, 0.883852990694089, 0.877158499678129, 0.874817832833850, 0.875428444059047, 0.873170360623947, 0.871461252768665, 0.867913776631497, 0.866459074988087, 0.863819528471106, 0.863228815347816 ,0.864369045426273 ,0.860602502500599, 0.862653463581049, 0.861169231463016, 0.858658616425390, 0.864588421841755, 0.858668693409622, 0.857993365648639]
trajectory = [0.0043, 0.9996, 2.0007, 2.9994, 3.9996, 4.9994, 5.9981, 6.9978, 7.9997, 8.9992, 10.0007, 10.9993, 11.9994, 12.9992, 14.0001, 14.9968, 15.9972, 16.9996, 17.9996, 18.999, 19.9992, 20.9996, 21.9994, 23.0003, 23.9992, 24.999, 25.9987, 26.9986, 27.999, 28.9991, 29.999, 30.9987, 31.9976, 32.9979, 33.9983, 34.9988, 35.999, 36.9991]
cftool
Screenshot:
For more information on the curve fitting (cftool), see: https://www.mathworks.com/help/curvefit/curvefitting-app.html

triple sum in matlab

I want to compute the following sum :
I have tried using the following code :
I = imread('C:\Users\Billal\Desktop\image.png');
[x,y,z]=size(I);
x=(1:x) ;
y=(1:y) ;
z=(1:z) ;
Fx=ones(size(x));
Fy=ones(size(y));
Fz=ones(size(z));
X=x*Fy';
Y=Fx*y';
Z=z*Fz';
f=I(X,Y,Z);
sum1 = sum(f(:));
[x1,y1,z1]=size(I);
total = sum1/(x1*y1*z1);
But the result is 0 . I could not figure out where is the problem ? I am following this tutorial .
https://www.mathworks.com/matlabcentral/newsreader/view_thread/126366
Please help me to solve this question .
You can do this in a single step:
result=1/prod(size(I))* sum(I(:));
In the end, the equation just adds up values of the whole image.
The question you link to needs to sum over values of x and y. You don't, you just need to sum over indexes, thus there is no need of all those Fx,Fy things

How can I zip two RDDs in PySpark?

I have been trying to merge the two Rdds below averagePoints1 and kpoints2 . It keeps throwing this error
ValueError: Can not deserialize RDD with different number of items in pair: (2, 1)
and I tried many things but I can't the two Rdds are identical, have the same number of partitions . my next to step is to apply euclidean distance function on the two lists to measure the difference so if any one knows how to solve this error or has a different approach I can follow I would really appreciate it.
Thanks in advance
averagePoints1 = averagePoints.map(lambda x: x[1])
averagePoints1.collect()
Out[15]:
[[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
kpoints2 = sc.parallelize(kpoints,4)
In [17]:
kpoints2.collect()
Out[17]:
[[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
a= [[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
b= [[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
rdda = sc.parallelize(a)
rddb = sc.parallelize(b)
c = rdda.zip(rddb)
print(c.collect())
check this answer
Combine two RDDs in pyspark
newSample=newCenters.collect() #new centers as a list
samples=zip(newSample,sample) #sample=> old centers
samples1=sc.parallelize(samples)
totalDistance=samples1.map(lambda (x,y):distanceSquared(x[1],y))
for future searchers this is the solution I followed at the end

How to extract year from a dates cell array in MATLAB?

i have a cell array as below, which are dates. I am wondering how can i extract the year at the last 4 digits? Could anyone teach me how to locate the year in the string? Thank you!
'31.12.2001'
'31.12.2000'
'31.12.2004'
'31.12.2003'
'31.12.2002'
'31.12.2000'
'31.12.1999'
'31.12.1998'
'31.12.1997'
'31.12.2005'
'31.12.2004'
'31.12.2003'
'31.12.2002'
'31.12.2001'
'31.12.2000'
'31.12.1999'
'31.12.1998'
'31.12.2005'
'31.12.2004'
'31.12.2003'
'31.12.2002'
'31.12.2005'
Example cell array:
A = {'31.12.2001'; '31.12.2002'; '31.12.2003'};
Apply some regular expressions:
B = regexp(A, '\d\d\d\d', 'match')
B = [B{:}];
EDIT: I never realized that matlab will "nest" an extra layer of cells until I tested this. I don't like this solution as much now that I know the second line is necessary. Here is an alternative approach that gets you the years in numeric form:
C = datevec(A, 'dd.mm.yyyy');
C = C(:, 1);
SECOND EDIT: Suprisingly, if your cell array has less than 10000 elements, the regexp approach is faster on my machine. But the output of it is another cell array (which takes up much more memory than a numeric matrix). You can use B = cell2mat(B) to get a character array instead, but this brings the two approaches to approximately equal efficiency.
Just to add a fun answer, designed to take the OP to the stranger regions of Matlab:
C = char(C);
y = (D(:,7:end)-'0') * 10.^(3:-1:0).'
which is an order of magnitude faster than anything posted in the other answers :)
Or, to stay a bit closer to home,
y = cellfun(#(x)str2double(x(7:end)),C);
or, yet another regexp variation:
y = str2num(char(regexprep(C, '\d+\.\d+\.','')));
Assuming your matrix with dates is M or a cell array C:
In case your data is in a cell array start with
M = cell2mat(C)
Then get the relevant part
Y=M(:,end-4:end)
If required you can even make the year a number
Year = str2num(Y)
Using regexp this will works also with dates with slightly different formats, like 1.1.2000, which can mess with you offsets
res = regexp(dates, '(?<=\d+\.\d+\.)\d+', 'match')