Incorrect count in venndiagram - radix

I am comparing the microbiome generated from MiSeq and PacBio at different taxonomic levels using VennDiagram.
I already extracted taxa at each taxonomic level from each technology as columns (MiSeq, PacBio) and created excel sheets for each level to draw venndiagrams of intersections.
For example, at the phylum level, I used this code:
ven <- venn.diagram(
x = list(miseq_vector, pacbio_vector), col=c("#0000FF", '#00FF00'),
category.names = c("MiSeq" , "PacBio " ), fill = c(alpha("#0000FF",0.5), alpha('#00FF00',0.5)), main ="Phylum level", main.fontface = "bold", main.fonfamily = "serif", main.cex = 3, main.just = c(0.5, 1), fontfamily = "serif", fontface = "bold", lwd = c(1,1), cat.cex = c(2,2) , cex= c(1,1,1), cat.col = c("black", "black"),compression =
"lzw",
filename = '#pollen_phylum.tiff',
output=TRUE, cat.pos = c(0, 30))
ven
The figure is generated as below,
[1]: https://i.stack.imgur.com/YWmZH.png
However, number of phyla for miseq is 20 and for pacbio is 8. The count is correct for miseq and the intersection, but I have an additional value for pacbio. I tried using zeros and the word "NotAvailable" for empty cells in pacbio columns since I am comparing 2 vectors of the same length, and in both cases, these values were counted as a field (as shown in the figure 2 taxa for pacbio only, however, it is only one). I do not know how to compare 2 vectors of different lengths without using zeros or NA since each is counted as a value. I need to match both columns and draw venndiagram.
Thanks!

Related

gtsummary::tbl_regression() - Obtain Random Effects from GLMM Zero-Inflated Model

When trying to create a table with the conditional random effects in r using the gtsummary function tbl_regression from a glmmTMB mixed effects negative-binomial zero-inflated model, I get duplicate random effects rows.
Example (using Mollie Brooks' Zero-Inflated GLMMs on Salamanders Dataset):
data(Salamanders)
head(Salamanders)
library(glmmTMB)
zinbm2 = glmmTMB(count~spp + mined +(1|site), zi=~spp + mined + (1|site), Salamanders, family=nbinom2)
zinbm2_table_cond <- tbl_regression(
zinbm2,
tidy_fun = function(...) broom.mixed::tidy(..., component = "cond"),
exponentiate = TRUE,
estimate_fun = purrr::partial(style_ratio, digits = 3),
pvalue_fun = purrr::partial(style_sigfig, digits = 3))
zinbm2_table_cond
Output:
Random Effects Output (cond)
When extracting the random effects from de zero-inflated part of the model I get the same problem.
Example:
zinbm2_table_zi <- tbl_regression(
zinbm2,
tidy_fun = function(...) broom.mixed::tidy(..., component = "zi"),
exponentiate = TRUE,
estimate_fun = purrr::partial(style_ratio, digits = 3),
pvalue_fun = purrr::partial(style_sigfig, digits = 3))
zinbm2_table_zi
Output:
Random Effects Output (zi)
The problem persists if I specify the effects argument in broom.mixed.
tidy_fun = function(...) broom.mixed::tidy(..., effects = "ran_pars", component = "cond"),
Looking at confidence intervals in both outputs it seems that somehow it is extracting random effects from both parts of the model and changing the estimate of the zero-inflated random effects (in 1st image; opposite in the 2nd image) to match the conditional part estimate while keeping the CI.
I am not knowledgeable enough to understand why this is happening. Since both rows have the same label I am having difficulty removing the wrong one.
Any tips on how to avoid this problem or a workaround to remove the undesired rows?
If you need more info, let me know.
Thank you in advance.
PS: Output images were changed to link due to insufficient reputation.

How to use caffe to classify text?

I'm using the Rotten Tomatoes dataset to train my net. It's divided in two groups, positive and negative examples. How can I configure my cnn in caffe to predict if a given text is a positive or a negative example?
I already formatted the data, each sentence has a size of 56 words. But using the following config does not give me even a satisfactory result.
n = caffe.NetSpec()
n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB,
source=db_path,
transform_param=dict(scale= 1 / mean),
ntop=2)
n.conv1 = L.Convolution(n.data, kernel_size=3, pad=1,
param=dict(lr_mult=1), num_output=10,
weight_filler=dict(type='xavier'))
n.pool1 = L.Pooling(n.conv1, kernel_size=n_classes,
stride=2, pool=P.Pooling.MAX)
n.ip1 = L.InnerProduct(n.pool1, num_output=100,
weight_filler=dict(type='xavier'))
n.relu1 = L.ReLU(n.ip1, in_place=True)
n.ip2 = L.InnerProduct(n.relu1, num_output=n_classes,
weight_filler=dict(type='xavier'))
n.loss = L.SoftmaxWithLoss(n.ip2, n.label)
My dataset is divided in two text files. One containing the positives examples and other containing negatives examples. Polarity dataset v1.1. To organize my data I get the length of the biggest sentence (59 words) so if a sentence is smaller than 59 words I add some text to it. I adapted from this code. For example, lets pretend that the biggest sentence has 3 words:
data = 'abc def ghijkl. mnopqrst uvwxyz. abcd.'
##
#In this data I have 3 sentences:
##
sentence_one = 'abc def ghijkl
sentence_two = 'mnopqrst uvwxyz'
sentence_three = 'abcd'
The sentence_one is the biggest (3 words), so to format the others two sentence I did the following:
sentence_two = 'mnopqrst uvwxyz <PAD>'
sentence_three = 'abcd <PAD> <PAD>'
Saved each positive and negative sentence to a caffe datum and saved in lmdb:
datum = caffe.proto.caffe_pb2.Datum()
datum.channels = 1
datum.height = 59 #biggest sentence
datum.width = 1
datum.label = label # 0 or 1
datum.data = sentence.tobytes()
Using my datum database and the above caffe's configuration I get a poor accuracy (less than 3 percent). What am I doing wrong?

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)
You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.

Find and replace in matlab?

I want do find and replace all in matlab (As we do in MS office).
https://www.dropbox.com/s/hxfqunjwhnvkl1f/matlab.mat?dl=0
I have a cell array LUT_HS_complete (contains identifier in column 1 and protein name in column 2 and summary in column 3) this is my look up table. on the other hand, I have my protein-protein interaction data (named Second_layer with identifiers in first two columns and the score in column 3).
I want to replace the first two columns in my Second_layer with the corresponding protein name from my look up table.
I tried strmatch, but that didn't help me.
Source_gene = Second_layer(:,1); Source_gene = regexprep(Source_gene,'[-/\s]','');
Target_gene = Second_layer(:,2); Target_gene = regexprep(Target_gene,'[-/\s]','');
Inter_score = Second_layer(:,3);
%%
for i=1:length(Source_gene(1:end,1));
SG = strmatch(Source_gene(i),LUT_HS_complete(1:end,1),'exact');
renamed_Source_gene(SG,1) = LUT_HS_complete(SG,2);
end
for j=1:length(Target_gene(1:end,1));
TG = strmatch(Target_gene(j),LUT_HS_complete(1:end,1),'exact');
renamed_Target_gene(TG,1) = LUT_HS_complete(TG,2);
end
If you could find a solution. It would be a great help.
Might this work for you?
renamed_Second_layer(:,1)=LUT_HS_complete(cellfun(#(x) find(strcmp(x,LUT_HS_complete(:,1))),Second_layer(:,1)),2);
renamed_Second_layer(:,2)=LUT_HS_complete(cellfun(#(x) find(strcmp(x,LUT_HS_complete(:,1))),Second_layer(:,2)),2);
renamed_Second_layer(:,3)=Second_layer(:,3);

Is there a quick way to assign unique text entries in an array a number?

In MatLab, I have several data vectors that are in text. For example:
speciesname = [species1 species2 species3];
genomelength = [8 10 5];
gonometype = [RNA DNA RNA];
I realise that to make a plot, arrays must be numerical. Is there a quick and easy way to assign unique entries in an array a number, for example so that RNA = 1 and DNA = 2? Note that some arrays might not be binary (i.e. have more than two options).
Thanks!
So there is a quick way to do it, but im not sure that your plots will be very intelligible if you use numbers instead of words.
You can make a unique array like this:
u = unique(gonometype);
and make a corresponding number array is just 1:length(u)
then when you go through your data the number of the current word will be:
find(u == current_name);
For your particular case you will need to utilize cells:
gonometype = {'RNA', 'DNA', 'RNA'};
u = unique(gonometype)
u =
'DNA' 'RNA'
current = 'RNA';
find(strcmp(u, current))
ans =
2