knitr - strange behaviour summary(numeric) prints different values than str() - knitr

I am trying to develop a reproducible research report that includes printed output of the variable distributions of input datasets.
I am confused by the result of summary() in the small example below.
When I evaluate the code directly to the console I get 'b' is all 2012 as expected, however when I run this with knit2html() it appears as 2010.
dat <- data.frame(a = letters, b = rep(2012, length(letters)))
str(dat)
## 'data.frame': 26 obs. of 2 variables:
## $ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ b: num 2012 2012 2012 2012 2012 ...
dd <- lapply(dat, summary)
dd
## $a
## a b c d e f g h i j k l m n o p q r s t u v w x y z
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $b
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2010 2010 2010 2010 2010 2010
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C
## [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C
## [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.5
##
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.1 formatR_0.9 stringr_0.6.2 tools_3.1.0

In knitr code chunks, options("digits") defaults to 4. The summary function has a digits argument that defaults to max(3, getOption("digits")-3) (see ?summary). This causes summary to round 2012 to three significant digits, resulting in 2010.
You can either increase the digits option in your code chunk:
options(digits=7)
Or specify the digits argument in your call to summary:
dd <- lapply(dat, summary, digits=4)

Related

Error in post hoc test for lmer(): both multcomp() and emmeans()

I have a dataset of measurements of "Y" at different locations, and I am trying to determine how variable Y is influenced by variables A, B, and D by running a lmer() model and analyzing the results. However, when I reach the post hoc step, I receive an error when trying to analyze.
Here is an example of my data:
table <- " ID location A B C D Y
1 1 AA 0 0.6181587 -29.67 14.14 168.041
2 2 AA 1 0.5816176 -29.42 14.21 200.991
3 3 AA 2 0.4289670 -28.57 13.55 200.343
4 4 AA 3 0.4158891 -28.59 12.68 215.638
5 5 AA 4 0.3172721 -28.74 12.28 173.299
6 6 AA 5 0.1540603 -27.86 14.01 104.246
7 7 AA 6 0.1219355 -27.18 14.43 128.141
8 8 AA 7 0.1016643 -26.86 13.75 179.330
9 9 BB 0 0.6831649 -28.93 17.03 210.066
10 10 BB 1 0.6796935 -28.54 18.31 280.249
11 11 BB 2 0.5497743 -27.88 17.33 134.023
12 12 BB 3 0.3631052 -27.48 16.79 142.383
13 13 BB 4 0.3875498 -26.98 17.81 136.647
14 14 BB 5 0.3883785 -26.71 17.56 142.179
15 15 BB 6 0.4058061 -26.72 17.71 109.826
16 16 CC 0 0.8647298 -28.53 11.93 220.464
17 17 CC 1 0.8664036 -28.39 11.59 326.868
18 18 CC 2 0.7480748 -27.61 11.75 322.745
19 19 CC 3 0.5959143 -26.81 13.27 170.064
20 20 CC 4 0.4849077 -26.77 14.68 118.092
21 21 CC 5 0.3584687 -26.65 15.65 95.512
22 22 CC 6 0.3018285 -26.33 16.11 71.717
23 23 CC 7 0.2629121 -26.39 16.16 60.052
24 24 DD 0 0.8673077 -27.93 12.09 234.244
25 25 DD 1 0.8226558 -27.96 12.13 244.903
26 26 DD 2 0.7826429 -27.44 12.38 252.485
27 27 DD 3 0.6620447 -27.23 13.84 150.886
28 28 DD 4 0.4453213 -27.03 15.73 102.787
29 29 DD 5 0.3720257 -27.13 16.27 109.201
30 30 DD 6 0.6040217 -27.79 16.41 101.509
31 31 EE 0 0.8770987 -28.62 12.72 239.036
32 32 EE 1 0.8504547 -28.47 12.92 220.600
33 33 EE 2 0.8329484 -28.45 12.94 174.979
34 34 EE 3 0.8181102 -28.37 13.17 138.412
35 35 EE 4 0.7942685 -28.32 13.69 121.330
36 36 EE 5 0.7319724 -28.22 14.62 111.851
37 37 EE 6 0.7014828 -28.24 15.04 110.447
38 38 EE 7 0.7286984 -28.15 15.18 121.831"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
# Make sure location is a factor
df$location<-as.factor(df$location)
Here is my model:
# Load libraries
library(ggplot2)
library(pscl)
library(lmtest)
library(lme4)
library(car)
mod = lmer(Y ~ A * B * poly(D, 2) * (1|location), data = df)
summary(mod)
plot(mod)
I now need to determine what variables significantly influence Y. Thus I ran Anova() from the package car (output pasted here).
Anova(mod)
# Analysis of Deviance Table (Type II Wald chisquare tests)
#
# Response: Y
# Chisq Df Pr(>Chisq)
# A 8.2754 1 0.004019 **
# B 0.0053 1 0.941974
# poly(D, 2) 40.4618 2 1.636e-09 ***
# A:B 0.1709 1 0.679348
# A:poly(D, 2) 1.6460 2 0.439117
# B:poly(D, 2) 5.2601 2 0.072076 .
# A:B:poly(D, 2) 0.6372 2 0.727175
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This suggests that:
A significantly influences Y
B does not significantly influence Y
D significantly influences Y
So next I would run a post hoc test for each of these variables, but this is where I run into issues. I have tried using both emmeans and multcomp packages below:
library(emmeans)
emmeans(mod, list(pairwise ~ A), adjust = "tukey")
# NOTE: Results may be misleading due to involvement in interactions
# Error in if ((misc$estType == "pairs") && (paste(c("", by), collapse = ",") != :
# missing value where TRUE/FALSE needed
pairs(emmeans(mod, "A"))
# NOTE: Results may be misleading due to involvement in interactions
# Error in if ((misc$estType == "pairs") && (paste(c("", by), collapse = ",") != :
# missing value where TRUE/FALSE needed
library(multcomp)
summary(glht(mod, linfct = mcp(A = "Tukey")), test = adjusted("fdr"))
# Error in h(simpleError(msg, call)) :
# error in evaluating the argument 'object' in selecting a method for function 'summary': Variable(s) ‘depth’ of class ‘integer’ is/are not contained as a factor in ‘model’.
This is the first time I've run an ANOVA/post hoc test on a lmer() model, and though I've read a few introductory sites for this model, I'm not sure I am testing it correctly. Any help would be appreciated.
If I am looking at the data correctly, A is the variable that has values of 0, 1, ..., 7. Now look at your anova table, where you see that A has only 1 d.f., not 7 as it should for a factor having 8 levels. That means your model is taking A to be a numerical predictor -- which is rather meaningless. Make A into a factor and re-fit he model. You'll have better luck.
I also think you meant to have + (1|location) at the end of the model formula, rather than having the random effects interacting with some of the polynomial effects.

Converting integer to binary for SPI transfer

I am trying to convert an integer to 3 bytes to send via the SPI protocol to control a MAX6921 serial chip. I am using spi.xfer2() but cannot get it to work as expected.
In my code, I can call spi.xfer2([0b00000000, 0b11101100, 0b10000000]) it displays the letter "H" on my display, but when I try and convert the int value for this 79616, it doesn't give the correct output:
val = 79616
spi.xfer2(val.to_bytes(3, 'little'))
My full code so far is on GitHub, and for comparison, this is my working code for Arduino.
More details
I have an IV-18 VFD tube driver module, which came with some example code for an ESP32 board. The driver module has a 20 output MAX6921 serial chip to drive the vacuum tube.
To sent "H" to the second grid position (as the first grid only displays a dot or a dash) the bits are sent to MAX6921 in order OUT19 --> OUT0, so using the LSB in my table below. The letter "H" has an int value 79616
I can successfully send this, manually, via SPI using:
spi.xfer2([0b00000000, 0b11101100, 0b10000000])
The problem I have is trying to convert other letters in a string to the correct bits. I can retrieve the integer value for any character (0-9, A-Z) in a string, but can't then work out how to convert it to the right format for spi.xfer() / spi.xfer2()
My Code
def display_write(val):
spi.xfer2(val.to_bytes(3, 'little'))
# Loops over the grid positions
def update_display():
global grid_index
val = grids[grid_index] | char_to_code(message[grid_index:grid_index+1])
display_write(val)
grid_index += 1
if (grid_index >= 9):
grid_index = 0
The full source for my code so far is on GitHub
Map for MAX6921 Data out pins to the IV-18 tube pins:
BIT
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
IV-18
G9
G8
G7
G6
G5
G4
G3
G2
G1
A
B
C
D
E
F
G
DP
MAX6921
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
IV-18 Pinout diagram

Why do I always get 1 for df when running adonis function (permanova)?

I run adonis on community data and environmental matrix (that contains a factor of two levels and 6 continuous variables) using Bray-Curtis and I always take 1 df but this is not the case. Probably there is a bug here.
See also the example in adonis
data(dune)
data(dune.env)
str(dune.env)
adonis(dune ~ Management*A1, data=dune.env, permutations=99)
Although A1 is a numeric variable the result provides 1 df.
In the model:
> adonis(dune ~ Management*A1, data=dune.env, permutations=99)
Call:
adonis(formula = dune ~ Management * A1, data = dune.env, permutations = 99)
Permutation: free
Number of permutations: 99
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
Management 3 1.4686 0.48953 3.2629 0.34161 0.01 **
A1 1 0.4409 0.44089 2.9387 0.10256 0.02 *
Management:A1 3 0.5892 0.19639 1.3090 0.13705 0.21
Residuals 12 1.8004 0.15003 0.41878
Total 19 4.2990 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The main effect of A1 uses a single degree of freedom because it is a continuous variable. The interaction between Management and A1 uses 3 additional degree's of freedom as there is one additional "effect" of A1 per level of Management.
This is all expected and there is certainly no bug illustrated in adonis() from this model.
Importantly, you must insure that factor variables are coded as factors otherwise, for example if the categories are coded as integers, then R will still interpret those variables as continuous/numeric. It will only interpret them as factors if coerced to the "factor" class. Check the output of str(df), where df is your data frame containing the predictor variables (covariates; the things on the right-hand side of ~), and insure that each factor variable is of the appropriate class. For example, the dune.env data are:
> str(dune.env)
'data.frame': 20 obs. of 5 variables:
$ A1 : num 2.8 3.5 4.3 4.2 6.3 4.3 2.8 4.2 3.7 3.3 ...
$ Moisture : Ord.factor w/ 4 levels "1"<"2"<"4"<"5": 1 1 2 2 1 1 1 4 3 2 ...
$ Management: Factor w/ 4 levels "BF","HF","NM",..: 4 1 4 4 2 2 2 2 2 1 ...
$ Use : Ord.factor w/ 3 levels "Hayfield"<"Haypastu"<..: 2 2 2 2 1 2 3 3 1 1 ...
$ Manure : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 3 5 5 3 3 4 4 2 2 ...
which indicates that Management is a factor, A1 is numeric (it is the thickness of the A1 soil horizon), and the remaining variables are ordered factors (but still factors; they work correctly in R's omdel formula infrastructure).

charToDate(x) error when using "seqformat" in TraMineR

I'm using TraMineR to inspect work trajectories.
When using the seqformat function (from SPELL data) with process = TRUE, and an external data frame for pdata, as follows :
situations <- seqformat(data[,1:4], id = 1, from = "SPELL", to = "STS",
begin = 3, end = 4, status = 2, right = NA,
process = TRUE, limit = 7644, pdata = pdata,
pvar = c("id","birth"))
I get an error message :
Error in charToDate(x) :
character string is not in a standard unambiguous format
I read many threads about that issue, but could not find any helpful solution.
Here are the structures of my data frames data and pdata :
str(data)
'data.frame': 2428 obs. of 9 variables:
$ ID_SQ : Factor w/ 798 levels "1","2","3","5",..: 1 1 1 1 1 2 2
...
$ SITUATION : chr "En poste" "En poste" "En poste" "En poste" ...
$ DATE_DE : Date, format: "1997-09-01" "1999-05-03" "2003-01-01"
...
$ DATE_A : Date, format: "1999-04-26" "2002-12-31" "2006-04-28"
...
$ SEXE : Factor w/ 2 levels "Féminin","Masculin": 1 1 1 1 1 1 1
...
$ PROMO : Factor w/ 6 levels "1997","1998",..: 1 1 1 1 1 2 2 ...
$ DEPARTEMENT : Factor w/ 10 levels "BC","GCU","GE",..: 1 1 1 1 1 4 4 4
4 4 ...
$ NIVEAU_ADMISSION: Factor w/ 2 levels "En Premier Cycle",..: NA NA NA NA
NA 1 1 1 1 1 ...
$ FILIERE_SECTION : Factor w/ 4 levels "Cursus Classique",..: NA NA NA NA
NA 4 4 4 4 4 ...
str(pdata)
'data.frame': 798 obs. of 2 variables:
$ id : Factor w/ 798 levels "1","2","3","5",..: 1 2 3 4 5 6 7 8 9 10 ...
$ birth: Date, format: "1997-01-01" "1998-01-01" "1998-01-01" "2000-01-01" ...
It seems to me that all date formats are OK.
But, clearly, something's wrong.
What am I doing wrong?
Thank you in advance for your help,
Best,
Arnaud.
The seqformat function expects integer values for the begin and end dates of the spells. Actually, these integers should be the (time-)position in the state sequence and will correspond in your example to column numbers in the resulting STS format.
So you need to transform your dates into integer values.
=============
The error
Error in charToDate(x) : character string is not in a standard unambiguous format
occurs while the function tests whether pdata is the string "auto" with if(pdata == "auto"). This is because, when pdata contains dates, the test attempts to coerce "auto" into a date for the sake of comparison. The workaround is to input the dates as integers.

Tokenizing with Perl and Unstructured data

I have the following data (from a text file), I would like to split / get each element, and even those element that are blanks (some grades as you can see are not listed, which means they are 0, so I want to get them also)
CRN SUB CRSE SECT COURSE TITLE INSTRUCTOR A A- B+ B B- C+ C C- D+ D D- F I CR NC W WN INV TOTAL
----- -- ---- ---- ----------------- ----------------- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -----
33450 XX 9950 AIP OVERSEAS-AIP SPAI NOT FOUND 1 1 2
33092 XX 9950 ALB ddddddd, SPN. vi NOT FOUND 1 1
33494 XX 9950 W16 OVERSEAS Univ.Wes NOT FOUND 1 1
INSTRUCTOR TOTALS NOT FOUND 2 1 18 1 2 24
PERCENTAGE DISTRI NOT FOUND 8 4 75 4 8 ******
33271 PE 3600 001 Global Geography sfnfbg,dsdassaas 2 2 1 1 2 3 6 5 3 3 1 29
INSTRUCTOR TOTALS snakdi,plid 2 2 1 1 2 3 6 5 3 3 1 29
PERCENTAGE DISTRI krapsta,lalalal 7 7 3 3 7 10 21 17 10 10 3 ***
The problem as you can see, I don't have a specific delimiter, because some grades are missing, if they weren't, I could have getting all the data from the line start until the first grade ('A') and then all the grades and splitting them by /\s+/, but thats not the case.
any suggestions (if there are any....) would be awesome.
thanks,
There are irregularities at places in some columns (note that first total values 18 and 75 are partially in next column), but if you don't need them, you can try something like this:
my #data;
# skip header
my $hdr = <DATA>;
my $sep = <DATA>;
while(<DATA>) {
chomp;
# skip empty and total lines
next if /^\s*$/ || /^[ ]{5}/;
push #data, [
map { s/^\s+//; s/\s+$//; $_ } # trim each column
unpack 'A6A7A7A7 A18A20 A4A4A4A4A4A4A4A4A4A4A4A4A4A4A4A4A4A4 A10', $_
];
}
use Data::Dump;
dd \#data;
__DATA__
CRN SUB CRSE ...
----- -- ---- ...
You might need to tweak column boundaries in unpack template for real data, but this should get you started.
This looks like it would be best to write or find a column-based text parser? I have found DataExtract-FixedWidth on CPAN, but have no personal experience with it. The format looks pretty messy, especially with the numbers on the column borders. You would have to do some kind of pre-processing or heuristics anyway…