Multiple Factor Analysis (MFA) with R using FactoMineR

Multiple Factor Analysis (MFA) with R using FactoMineR - multivariate-testing

I have encountered a problem with the MFA in FactoMineR. I am working with a data set containing physical, chemical and microbiological continuous variables measured in tomato plants, taken from 2 different treatments and at 3 time points. I have accommodated my data like this:
structure(list(row.names = structure(c(1L, 4L, 7L, 10L, 13L,
16L), .Label = c("GBA1", "GBA2", "GBA3", "GBB1", "GBB2", "GBB3",
"GBC1", "GBC2", "GBC3", "GBD1", "GBD2", "GBD3", "GBE1", "GBE2",
"GBE3", "RWA1", "RWA2", "RWA3", "RWB1", "RWB2", "RWB3", "RWC1",
"RWC2", "RWC3", "RWD1", "RWD2", "RWD3", "RWE1", "RWE2", "RWE3",
"RWF1", "RWF2", "RWF3", "RWG1", "RWG2", "RWG3", "RWH1", "RWH2",
"RWH3", "RWI1", "RWI2", "RWI3", "RWJ1", "RWJ2", "RWJ3"), class = "factor"),
Trt = structure(c(2L, 2L, 2L, 2L, 2L, 1L), .Label = c("Mineral",
"Organic"), class = "factor"), Status = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = c("H", "S"), class = "factor"),
Humidity = c(87.21704394, 80.29885961, 65.68047337, 85.9775641,
83.33333333, 85.98568282), pH = c(5.44, 5.94, 6.64, 6.19,
6.13, 5.45), Conductivity = c(837L, 867L, 752L, 871L, 699L,
406L), Nit.N = c(436.18, 433.92, 418.1, 458.78, 411.32, 167.24
), Ammonia.N = c(3.8122, 2.6452, 1.945, 1.7116, 2.4896, 7.16
), P = c(30.95, 15.2, 20.15, 16.1, 18.35, 48.2), K = c(135,
35, 95, 40, 145, 275), Ca = c(1287.5, 1427.5, 1610, 1570,
1640, 130), Mg = c(367.5, 575, 537.5, 532.5, 590, 42.5),
S = c(705L, 924L, 603L, 962L, 626L, 111L), Sodium = c(92.5,
170, 135, 127.5, 137.5, 35), Chlorides = c(15.1, 11.1, 15.4,
13.2, 13.8, 10.8), Fe = c(1.5, 2.2, 1.7, 2, 2.1, 3.1), Mn = c(1.1,
0.55, 0.7, 0.4, 0.65, 1.9), Rhizobium = c(0, 0, 0, 0, 0,
0), Total.bacteria = c(7207207.207, 5454545.455, 22727272.73,
18918918.92, 30630630.63, 64864864.86)), .Names = c("row.names",
"Trt", "Status", "Humidity", "pH", "Conductivity", "Nit.N", "Ammonia.N",
"P", "K", "Ca", "Mg", "S", "Sodium", "Chlorides", "Fe", "Mn",
"Rhizobium", "Total.bacteria"), row.names = c(NA, 6L), class = "data.frame")
I divided the variables in categorical (first 2), then the other 16 are continuous. However, I want to treat the 2 categorical variables separately. So I wrote the following code:
>res <- MFA(Oliver, group=c(1,1,3,11,2), type=c("n", "n","s", "s","s"),ncp=5,name.group=c("Sub","Stat", "Phys", "Chem", "Microbial"))
However, it doesn't seem to work. Hence, I tried the following:
>res=MFA(Oliver,group=c(2,16),type=c(rep("n",1),rep("s",1)),ncp=5,name.group=c("cat","cont"))
and this other:
>res=MFA(Oliver,group=c(2, 3, 11,2),type=c(rep("n",1),rep("s",3)), ncp=5,name.group=c("type","Phys", "Chem", "Microbial"))
But I kept having the same problem ("not convenient group definition"). Is there anything that I can do to keep the first 2 categorical groups separately? I would really appreciate your advice on how to properly run the model!
Best wishes,
Emma

I think that the problem comes from your variable Status which is not a variable since all the values are equal to "H". So no analysis cn be done with this "variable".
You can suppress it, there is no information in this column. And then, it should work.
Francois

Related

summary row with gtsummary

I am trying to create a table of events with gtsummary and I would like to obtain a final row counting the events of the previous rows. add_overall() and add_n() do add the total but in a column, counting the same event across groups but not the overall events.
I created this example.
x1 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.85, 0.15))
x2 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.9, 0.1))
x3 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.75, 0.25))
y <- sample(c("A", "B"), 30, replace = TRUE, prob = c(0.5, 0.5))
df <- data.frame(as_factor(x1), as_factor(x2), as_factor(x3), as_factor(y))
colnames(df) <-c("event_1", "event_2", "event_3", "group")
tbl_summary(df, by=group, statistic = all_categorical() ~ "{n}")
example
I tried using summary_rows() function from gt package after converting the table to a gt object but there is an error when summarising because these variables are factors.
Any other ideas?

You can do this by adding a new variable to your data frame that is the row sum of each of the events. Then you can display that variable's sum in the summary table. Example below!
library(gtsummary)
#> #Uighur
library(tidyverse)
df <-
data.frame(
event_1 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.85, 0.15)),
event_2 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.9, 0.1)),
event_3 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.75, 0.25)),
group = sample(c("A", "B"), 30, replace = TRUE, prob = c(0.5, 0.5))
) |>
rowwise() |>
mutate(Total = sum(event_1, event_2, event_3))
tbl_summary(
df,
by = group,
type = Total ~ "continuous",
statistic =
list(all_categorical() ~ "{n}",
all_continuous() ~ "{sum}")
) |>
as_kable() # convert to kable to display on stack overflow
Characteristic
A, N = 16
B, N = 14
event_1
4
4
event_2
1
2
event_3
7
6
Total
12
12
Created on 2023-01-12 with reprex v2.0.2

Thank you so much (great package gtsummary). That works! I had some trouble summing over factors. If variables are factors the code
mutate(Total = sum(event_1=="Yes", event_2=="Yes", event_3=="Yes"))
does it.

Google OR-Tools doesn't find solution on VRPtw problem

I'm tackling with VRPtw problem and struggling that the solver finds no solution with any data except for artificial small one.
The setting is as below.
There are several depots and locations to visit. Each locations have the time-window. Each vehicles have break time and work time. Also, the locations have some constraints and only the vehicles which satisfy that demand can visit there.
Based on this experiment setting, I wrote the code below.
As I wrote, it looks that it is working with small artificial data, but with real data, it never found the solution. I tried with 5 different data sets.
Although I set the 7200 sec time limit, previously I ran for longer than 10 hours and it was same.
The data's scale is 40~50 vehicles and 200~300 locations.
Does this code have a problem? If not, on what kind of order, should I change the approach(such as initialization, searching method and so on)?
(Edited to use integer for time matrix)
from dataclasses import dataclass
from typing import List, Tuple
from ortools.constraint_solver import pywrapcp
from ortools.constraint_solver import routing_enums_pb2
# TODO: Refactor
BIG_ENOUGH = 100000000
TIME_DIMENSION = 'Time'
TIME_LIMIT = 7200
#dataclass
class DataSet:
time_matrix: List[List[int]]
locations_num: int
vehicles_num: int
vehicles_break_time_window: List[Tuple[int, int, int]]
vehicles_work_time_windows: List[Tuple[int, int]]
location_time_windows: List[Tuple[int, int]]
vehicles_depots_indices: List[int]
possible_vehicles: List[List[int]]
def execute(data: DataSet):
manager = pywrapcp.RoutingIndexManager(data.locations_num,
data.vehicles_num,
data.vehicles_depots_indices,
data.vehicles_depots_indices)
routing_parameters = pywrapcp.DefaultRoutingModelParameters()
routing_parameters.solver_parameters.trace_propagation = True
routing_parameters.solver_parameters.trace_search = True
routing = pywrapcp.RoutingModel(manager, routing_parameters)
def time_callback(source_index, dest_index):
from_node = manager.IndexToNode(source_index)
to_node = manager.IndexToNode(dest_index)
return data.time_matrix[from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(time_callback)
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
routing.AddDimension(
transit_callback_index,
BIG_ENOUGH,
BIG_ENOUGH,
False,
TIME_DIMENSION)
time_dimension = routing.GetDimensionOrDie(TIME_DIMENSION)
# set time window for locations start time
# set condition restrictions
possible_vehicles = data.possible_vehicles
for location_idx, time_window in enumerate(data.location_time_windows):
index = manager.NodeToIndex(location_idx + data.vehicles_num)
time_dimension.CumulVar(index).SetRange(time_window[0], time_window[1])
routing.SetAllowedVehiclesForIndex(possible_vehicles[location_idx], index)
solver = routing.solver()
for i in range(data.vehicles_num):
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.Start(i)))
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.End(i)))
# set work time window for vehicles
for vehicle_index, work_time_window in enumerate(data.vehicles_work_time_windows):
start_index = routing.Start(vehicle_index)
time_dimension.CumulVar(start_index).SetRange(work_time_window[0],
work_time_window[0])
end_index = routing.End(vehicle_index)
time_dimension.CumulVar(end_index).SetRange(work_time_window[1],
work_time_window[1])
# set break time for vehicles
node_visit_transit = {}
for n in range(routing.Size()):
if n >= data.locations_num:
node_visit_transit[n] = 0
else:
node_visit_transit[n] = 1
break_intervals = {}
for v in range(data.vehicles_num):
vehicle_break = data.vehicles_break_time_window[v]
break_intervals[v] = [
solver.FixedDurationIntervalVar(vehicle_break[0],
vehicle_break[1],
vehicle_break[2],
True,
'Break for vehicle {}'.format(v))
]
time_dimension.SetBreakIntervalsOfVehicle(
break_intervals[v], v, node_visit_transit
)
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)
search_parameters.local_search_metaheuristic = (
routing_enums_pb2.LocalSearchMetaheuristic.GREEDY_DESCENT)
search_parameters.time_limit.seconds = TIME_LIMIT
search_parameters.log_search = True
solution = routing.SolveWithParameters(search_parameters)
return solution
if __name__ == '__main__':
data = DataSet(
time_matrix=[[0, 0, 4, 5, 5, 6],
[0, 0, 6, 4, 5, 5],
[1, 3, 0, 6, 5, 4],
[2, 1, 6, 0, 5, 4],
[2, 2, 5, 5, 0, 6],
[3, 2, 4, 4, 6, 0]],
locations_num=6,
vehicles_num=2,
vehicles_depots_indices=[0, 1],
vehicles_work_time_windows=[(720, 1080), (720, 1080)],
vehicles_break_time_window=[(720, 720, 15), (720, 720, 15)],
location_time_windows=[(735, 750), (915, 930), (915, 930), (975, 990)],
possible_vehicles=[[0], [1], [0], [1]]
)
solution = execute(data)
if solution is not None:
print("solution is found")

SHAP explanation for the inputs with different types in CNN models

I have a question to use SHAP to explain the result of my CNN model. My CNN models take 2 inputs with different types. One is an image, and another is a feature vector. I trained and tested the model by taking both of them into account. No problem with establishing the model.
When I tried to use SHAP to explain the result for those two inputs simultaneously, it doesn't work. I actually have tried both deepexplainer and gradientexplainer. The error I got is below:
File "", line 1, in
shap_values = explainer.shap_values([x_test[:3], feature_test[:3]])
File "C:\Users\kaz10003\AppData\Local\Continuum\anaconda3\lib\site-> > packages\shap\explainers\deep_init_.py", line 119, in shap_values
return self.explainer.shap_values(X, ranked_outputs, output_rank_order)
File "C:\Users\kaz10003\AppData\Local\Continuum\anaconda3\lib\site-> packages\shap\explainers\deep\deep_tf.py", line 284, in shap_values
diffs = model_output[:, l] - self.expected_value[l] - > output_phis[l].sum(axis=tuple(range(1, output_phis[l].ndim)))
AttributeError: 'list' object has no attribute 'sum'
Anybody has any idea to know if SHAP supports such implementation? Here is my code:
n_features = 10
input_feat = Input((n_features,))
input_tensor = Input(shape=(50,60, 1))
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (input_tensor)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (c3)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)
f_repeat = RepeatVector(6*7)(input_feat)
f_conv = Reshape((6, 7, n_features))(f_repeat)
p3_feat = concatenate([p3, f_conv], -1)
c3 = Flatten()(p3_feat)
c3 = Dense(512)(c3)
outputs = Dense(2, activation='softmax')(c3)
model = Model(inputs=[input_tensor, input_feat], outputs=[outputs])
model.summary()
explainer = shap.GradientExplainer(model, [x_train, feature_train])
shap_values = explainer.shap_values([x_test[:3], feature_test[:3]])

"Impossible" error message in R for VennDiagram

I keep getting an error message when trying to use VennDiagram in R. Below is my posted code:
draw.quintuple.venn(area1 = 578, area2 = 519, area3 = 212, area4 = 402, area5 = 172, n12 = 366, n15 = 97, n13 =149, n14 = 284, n23 = 103, n24 = 202, n25 = 125, n35 = 31, n34= 12, n45 = 27, n123 = 80, n124 = 161, n125 = 84, n134 = 8, n135 = 25, n145 = 20, n234 = 5, n235 = 24, n245 = 21, n345 = 1, n1234 = 5, n1345 = 21, n1245 = 16, n1235 = 0, n2345 = 0, n12345 = 0, category = c("1", "2", "3", "4", "5"), lty = "blank", fill = c("skyblue", "pink1", "mediumorchid", "yellow", "orange"))
Error:
ERROR [2018-07-09 13:37:19] Impossible: a11 <- n23 - a21 - a22 - a24 -
a26 - a29 - a30 - a31 produces negative area Error in
draw.quintuple.venn(area1 = 578, area2 = 519, area3 = 212, area4 =
402, : Impossible: a11 <- n23 - a21 - a22 - a24 - a26 - a29 - a30
- a31 produces negative area
What am I doing wrong?
I double checked and made sure the values are all correct.

I do not think that this package is well documented. A look at the source code shows that the meaning of nxxxxx is not the obvious one. For instance n135 means "how many elements belong to at least groups 1, 3 and 5". When you want to draw the diagram, the package calculates how many of those n135 also belong to other groups (i. e., n1235, n1345 and n12345) and substracts them.
What seems to be happening here is that you interpret n135 as "how many elements only belong to sets 1, 3 and 5" (that would have also been my guess). If you want to use those numbers directly, you should write:
draw.quintuple.venn(area.vector = c(578, 519, 212, 402, 172, 31, 97, 284, 366, 125, 103, 149, 12, 202, 27, 1, 25, 20, 161, 84, 24, 80, 8, 5, 21, 0, 21, 16, 0, 5, 0), category = c("1", "2", "3", "4", "5"), lty = "blank", fill = c("skyblue", "pink1", "mediumorchid", "yellow", "orange"), direct.area = T)
The order of the numbers is taken directly from the source code, I have not seen it documented. Let us call a135 "how many elements only belong to sets 2, 3 and 5". With this in mind, the order would be:
a1, a2, a3, a4, a5, a35, a15, a14, a12, a25, a23, a13, a34, a24, a45, a345, a135, a145, a124, a125, a235, a123, a134, a234, a245, a2345, a1345, a1245, a1235, a1234, a12345
If you prefer to use the n135 notation, you would need to transform your data, so that n135 = a135 + a1235 + a1345 + a12345, and so forth. In your case, n135 = 25 + 0 + 21 + 0 = 36.
Although not part of the question, you can also use my nVennR package for a proportional representation. The order of the numbers is documented in the help and the vignette, and you can also enter raw sets rather than numbers:
library(nVennR)
myV <- createVennObj(nSets = 5, sNames = c('1', '2', '3', '4', '5'), sSizes = c(0, 172, 402, 27, 212, 31, 12, 1, 519, 125, 202, 21, 103, 24, 5, 0, 578, 97, 284, 20, 149, 25, 8, 21, 366, 84, 161, 16, 80, 0, 5, 0))
myV <- plotVenn(nVennObj = myV, setColors = c("skyblue", "pink1", "mediumorchid", "yellow", "orange"), borderWidth = 0)
And the result:

use crossbar ggplot2

I am plotting bar charts for climate different models. I would like make a plot comparing models to observations. The climate models will plotted as bar (geom_bar()) but I would like to have the observation crossbars.
The script below makes a plot but there is something (upside down triangle) that is plotted above the graph. What is wrong with this script?, Am I missing something?
ch<-structure(list(Month = structure(c(4L, 5L, 6L, 7L, 8L, 9L, 10L,
11L, 12L, 1L, 2L, 3L), .Label = c("Oct", "Nov", "Dec", "Jan",
"Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"), class = c("ordered",
"factor")), GCM1 = c(169.5, 157.19, 90.07, 42.97, 13.24, 1.56,
2.53, 5.99, 14.92, 46.35, 88.23, 138.02), GCM2 = c(215.01, 193.37,
131.14, 41.48, 7.63, 0.94, 0.81, 0.78, 1.88, 15.95, 99.58, 188.16
), GCM3 = c(164.83, 158.82, 97.5, 29.27, 5.47, 2.14, 3.34, 0.85,
9.94, 16.9, 57.21, 117.05), OBS = c(142.25, 138.59, 59.95, 26.48,
2.61, 0.2, 0.1, 0.4, 0.72, 11.64, 38.75, 119.82)), .Names = c("Month",
"GCM1", "GCM2", "GCM3", "OBS"), row.names = c(NA, -12L), class = "data.frame")
ch$Month<-month.abb
ch$Month<-factor(ch$Month, levels=c(month.abb[10:12],month.abb[1:9]), ordered=TRUE)
chm<-melt(ch, id="Month")
cbPalette1 <- cbbPalette <- c("#D55E00", "#56B4E9", "#009E73","#0072B2", "#CC79A7","#000000")
p<-ggplot(data=chm,aes(x=factor(Month),y=value,group=variable,fill=variable))+geom_bar(subset = .(variable != "OBS"),stat="identity",position=position_dodge())+
scale_fill_manual(values=cbPalette1)+
geom_crossbar(subset = .(variable == "OBS"),aes(ymin = min(value), ymax = max(value)), col="gray30",fatten=3)
.......
Many thanks in advance
BHH

Two things:
You are overriding the group aesthetic to just be variable, so
in the crossbar, it is ignoring the different x values (treating
it as continuous) and that is giving a weird crossbar.
I think you just want the bar itself, not any extent around it.
If so, you want to set ymin and ymax to the central value, not
to the range of all central values.
Making both those changes:
p<-ggplot(data=chm,
aes(x = Month,
y = value,
fill = variable)) +
geom_bar(subset = .(variable != "OBS"),
stat="identity",
position=position_dodge()) +
scale_fill_manual(values=cbPalette1)+
geom_crossbar(subset = .(variable == "OBS"),
aes(ymin = value, ymax = value),
col="gray30", fatten=3)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Multiple Factor Analysis (MFA) with R using FactoMineR - multivariate-testing

I think that the problem comes from your variable Status which is not a variable since all the values are equal to "H". So no analysis cn be done with this "variable". You can suppress it, there is no information in this column. And then, it should work. Francois

Related

summary row with gtsummary

Google OR-Tools doesn't find solution on VRPtw problem

SHAP explanation for the inputs with different types in CNN models

"Impossible" error message in R for VennDiagram

use crossbar ggplot2

Categories

Resources