dissimilarity of community matrix with many samples equal zero, vegan package - vegan

I have community dataset for macrofauna associated with corals that I am struggling to analyze it in vegan package.
Coral colonies were imaged in 2015 (for two coral species at five sites) and we counted macrofauna species found on each colony. In 2016 and 2017, we revisited the same coral colonies to count associated fauna. So far, this is a repeated measure experiment (Year/colonyID), but I have two problems:
1- some of the revisited colonies in 2016 and 2017 had no fauna (143 out of 686 total colonies) meaning we have zero samples (n=143). This caused a problem in adonis function to test the dissimilarity.
adonis(F_Mat ~ Species+Site+Year, data = F_Meta, permutations = 9999)
you have empty rows: their dissimilarities may be meaningless in method �bray�missing values in results
I understand this message, but I must account for zero samples as they represent the dynamic of fauna community over time. I tried "bray" and "Jaccard" methods but both give me the same error message as above.
I used log1p (1+F_Mat ) to replace zeros to transform my values and replace zeros, but it did not work to calculate alpha diversity; Chaol diversity index, but worked for adonis function. To cope with that, I used dist.zeroes function in BiodiversityR package to deal with adonis, and use abundance matrix for the alpha diversity. Not sure if it the right approach though
2- Some colonies in 2015 could not be found in later years (2016 and 2017), and instead, we took images for new colonies in 2016 and 2017 that have not been visited previously. So, it is not really repeated measure and I think we should account for colony ID as a random effect instead, but this is not doable in vegan to my knowledge.
Any advice on how to analyze this dataset and troubleshoot my experimental problems? Your help is really appreciated.

From: https://rdrr.io/cran/vegan/man/vegdist.html
In principle, you cannot study species composition without species and you should remove empty sites from community data. Since vegdist is passed to adonis I believe this statement carries through.
A potential solution might be adding a dummy species that has a constant value for each site.
If you end up needing to remove these rows or columns that sum to 0 you can reference this question How to remove columns and rows that sum to 0 while preserving non-numeric columns
df <- df[rowSums(df[-(1:7)]) !=0, ]

Related

How to get a new solution with high probability from previously found, incomplete solutions with different probabilities?

I am working on a AI algorithm. first when program runs a random solution is generated from which ,in first iteration of the program 10 solution vectors are created, by analyzing these solutions we could give each of them a probability ( highest , second highest, third highest and so on) towards the optimal solution , for the second input of the program I want it to be a vector (possible solution) obtained from those 10 vectors previously found. But i need the vector solution to consider all the previous solutions with a different impact depending on their probability ...
i.e A=[4.7 ,5.6, 3.5,9 ] b=[-7.9 ,8 ,-2.8 ,4.6] c=[7 ,9.7 , 4,6,3.9] ......
i used mean in my program
NextPossibleSolution = mean(([A;B;C;]))
But do you think mean is the right move ? i don't think because all the solution contributes equal to Next Possible Solution (next input) regardless of their likelihood ... Please if there is a method formula or anything , Let me know that ... I really need it badly .... A Billion Thanks

More efficient way to search a large matrix in MATLAB?

I have a code that does what I want but it is too slow because I have a very large mat file with a matrix (33 gigabyte) that I need to search for particular values and extract those.
The file that I'm searching has the following structure:
reporter sector partner year ave
USA all CAN 2007 0.060026126
USA all CAN 2011 0.0637898418
...
This goes on for millions of rows. I want to extract the last (5th) column value for particular reporter and partner values (sector and year are fixed). In actuality there are more fixed values that I have taken out for the sake of simplicity but this might slow down my code even more. The country_codes and partner values need to vary and are looped for that reason.
The crucial part of my code is the following:
for i = 1:length(country_codes)
for g = [1:length(partner)]
matrix(i,g) = big_file(...
ismember(GTAP_data(:,1), country_codes(i)) & ... % reporter
ismember(GTAP_data(:,2), 'all') & ...sector
ismember(GTAP_data(:,3), partner(g)) & ... partner
ismember([GTAP_data{:,4}]', 2011) & ... year
,5); % ave column
end
end
In other words, the code goes through the million rows and finds just the right value by applying ismember with logical & on everything.
Is there a faster way to do this than using ismember? Can someone assist?
So what I see is you build a big table out of the data in different files.
It seems your values are text-based. That takes up more memory. "USA" already takes up three bytes of memory. If you have less then 255 countries to concider, you could store them as only one byte in uint8 format.
If you can store all columns as a value between 0 and 255 you can make a uint8 matrix that can be indexed very fast.
As an example:
%demo
GTAP_regions={'USA','NL','USA','USA','NL','GB','NL','USA','Korea Republic of','GB','NL','USA','Korea Republic of'};
S=whos('GTAP_regions');
S.bytes
GTAP_regions requires 1580 bytes. Now we convert it.
GTAP_regions_list=GTAP_regions(1);
GTAP_regions_uint=uint8(1);
for ct = 2:length(GTAP_regions)
I=ismember(GTAP_regions_list,GTAP_regions(ct));
if ~any(I)
GTAP_regions_list(end+1)=GTAP_regions(ct);
else
GTAP_regions_uint(end+1)=uint8(find(I));
end
end
S=whos('GTAP_regions_list');
S.bytes
S=whos('GTAP_regions_uint');
S.bytes
GTAP_regions_uint we need to use to do indexing, and it is now only 10 bytes and will be very fast to analyse.
GTAP_regions_list we need to use to find what index value belongs to what country, is only 496 bytes.
You can also do this for sector, partner and year, depending on the range of years. If it is no more than 255 different years it will work. Otherwise you could store it as uint16 and have 65535 possible options.

Matlab non-linear binary Minimisation

I have to set up a phoneme table with a specific probability distribution for encoding things.
Now there are 22 base elements (each with an assigned probability, sum 100%), which shall be mapped on a 12 element table, which has desired element probabilities (sum 100%).
So part of the minimisation is to merge several base elements to get 12 table elements. Each base element must occur exactly once.
In addition, the table has 3 rows. So the same 12 element composition of the 22 base elements must minimise the error for 3 target vectors. Let's say the given target vectors are b1,b2,b3 (dimension 12x1), the given base vector is x (dimension 22x1) and they are connected by the unknown matrix A (12x22) by:
b1+err1=Ax
b2+err2=Ax
b3+err3=Ax
To sum it up: A is to be found so that dot_prod(err1+err2+err3, err1+err2+err3)=min (least squares). And - according to the above explanation - A must contain only 1's and 0's, while having exactly one 1 per column.
Unfortunately I have no idea how to approach this problem. Can it be expressed in a way different from the matrix-vector form?
Which tools in matlab could do it?
I think I found the answer while parsing some sections of the Matlab documentation.
First of all, the problem can be rewritten as:
errSum=err1+err2+err3=3Ax-b1-b2-b3
=> dot_prod(errSum, errSum) = min(A)
Applying the dot product (least squares) yields a quadratic scalar expression.
Syntax-wise, the fmincon tool within the optimization box could do the job. It has constraints parameters, which allow to force Aij to be binary and each column to be 1 in sum.
But apparently fmincon is not ideal for binary problems algorithm-wise and the ga tool should be used instead, which can be called in a similar way.
Since the equation would be very long in my case and needs to be written out, I haven't tried yet. Please correct me, if I'm wrong. Or add further solution-methods, if available.

Johansen test on two stocks (for pairs trading) yielding weird results

I hope you can help me with this one.
I am using cointegration to discover potential pairs trading opportunities within stocks and more precisely I am utilizing the Johansen trace test for only two stocks at a time.
I have several securities, but for each test I only test two at a time.
If two stocks are found to be cointegrated using the Johansen test, the idea is to define the spread as
beta' * p(t-1) - c
where beta'=[1 beta2] and p(t-1) is the (2x1) vector of the previous stock prices. Notice that I seek a normalized first coefficient of the cointegration vector. c is a constant which is allowed within the cointegration relationship.
I am using Matlab to run the tests (jcitest), but have also tried utilizing Eviews for comparison of results. The two programs yields the same.
When I run the test and find two stocks to be cointegrated, I usually get output like
beta_1 = 12.7290
beta_2 = -35.9655
c = 121.3422
Since I want a normalized first beta coefficient, I set beta1 = 1 and obtain
beta_2 = -35.9655/12.7290 = -2.8255
c =121.3422/12.7290 = 9.5327
I can then generate the spread as beta' * p(t-1) - c. When the spread gets sufficiently low, I buy 1 share of stock 1 and short beta_2 shares of stock 2 and vice versa when the spread gets high.
~~~~~~~~~~~~~~~~ The problem ~~~~~~~~~~~~~~~~~~~~~~~
Since I am testing an awful lot of stock pairs, I obtain a lot of output. Quite often, however, I receive output where the estimated beta_1 and beta_2 are of the same sign, e.g.
beta_1= -1.4
beta_2= -3.9
When I normalize these according to beta_1, I get:
beta_1 = 1
beta_2 = 2.728
The current pairs trading literature doesn't mention any cases where the betas are of the same sign - how should it be interpreted? Since this is pairs trading, I am supposed to long one stock and short the other when the spread deviates from its long run mean. However, when the betas are of the same sign, to me it seems that I should always go long/short in both at the same time? Is this the correct interpretation? Or should I modify the way in which I normalize the coefficients?
I could really use some help...
EXTRA QUESTION:
Under some of my tests, I reject both the hypothesis of r=0 cointegration relationships and r<=1 cointegration relationships. I find this very mysterious, as I am only considering two variables at a time, and there can, at maximum, only be r=1 cointegration relationship. Can anyone tell me what this means?

Ignoring vectors containing NaN entries in Matlab calculations

This code prices bonds according to the fitSvensson function. How do I get Matlab to ignore NaN values in the CleanPrice vector when a date is selected for which some bonds have a NaN entry for a missing price. How can I get it to ignore that bond altogether when deriving the zero curve? It seems many solutions to NaNs resort to interpolation or setting to zero, but this will lead to an erroneous curve.
Maturity=gcm3.data.CouponandMaturity(1:36,2);
[r,c]=find(gcm3.data.CleanPrice==datenum('11-May-2012'));
row=r
SettleDate=gcm3.data.CouponandMaturity(row,3);
Settle = repmat(SettleDate,[length(Maturity) 1]);
CleanPrices =transpose(gcm3.data.CleanPrice(row,2:end));
CouponRate = gcm3.data.CouponandMaturity(1:36,1);
Instruments = [Settle Maturity CleanPrices CouponRate];
PlottingPoints = gcm3.data.CouponandMaturity(1,2):gcm3.data.CouponandMaturity(36,2);
Yield = bndyield(CleanPrices,CouponRate,Settle,Maturity);
SvenssonModel = IRFunctionCurve.fitSvensson('Zero',SettleDate,Instruments)
ParYield=SvenssonModel.getParYields(Maturity);
The Data looks like this where each column is a bond, column 1 is the dates and the elements the clean price. As you can see, the first part of the data contains lots of NaNs for bonds yet to have prices. After a point all bonds have prices but unfortunately there are instances where one or two day's prices are missing.
Ideally, if a NaN is present I would like it to ignore that bond on that date if possible as the more curves generated (irrespective of number of bonds used) the better. If this is not possible then ignoring that date is an option but will result in many curves not generating.
This is a general solution to your problem. I don't have that toolbox on my work computer so I can't test whether it works with the IRFunctionCurve.fitSvensson command
[row,~]=find(gcm3.data.CleanPrice(:,1)==datenum('11-May-2012'));
col_set=find(~isnan(gcm3.data.CleanPrice(row,2:end)));
CleanPrices=transpose(gcm3.data.CleanPrice(row,col));