How to get a zoo object with a num and Date object in it? - date

I want to transform my excel (bank return & the date) in a zoo object, with the data in the zoo object being numeric & date. I used the following data:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1455 obs. of 2 variables:
$ date : POSIXct, format: "1925-01-02" "1925-01-03" "1925-01-05" "1925-01-06" ...
$ Deutsche Bank: num 0.181 0.191 0.191 0.184 0.186 ...
I used the following code:
db.xts <- na.omit(as.data.frame(db.kurs))
db.xts2 <- db.xts %>% mutate(date = as.Date(date, format = "%d.%m.%Y")) %>% mutate(`Deutsche Bank` = as.numeric(`Deutsche Bank`))
db.xts3 <- as.xts(db.xts2, db.kurs$date)
db.zoo <- as.zoo(db.xts3)
db.zoo <- db.zoo[, colnames(db.zoo) != "date"]
which leaves me with the following:
‘zoo’ series from 1925-01-02 to 1929-12-31
Data: chr [1:1455] "0.1807194" "0.1911455" "0.1911455" "0.1841948" "0.1859325" "0.1841948" "0.1807194" "0.1789817" ...
Index: POSIXct[1:1455], format: "1925-01-02" "1925-01-03" "1925-01-05" "1925-01-06" "1925-01-07" "1925-01-08" "1925-01-09" "1925-01-10" ...
If I try to run it without the as.xts command R deletes all the dates and uses an iteger from 1 to 1455.
Does anybody have an idea how to solve it?
Thanks for the help,
Nick

Related

Unable to get Basic tbl_regression()

Planning to run the following code and get a simple univariate table
lang2 <- glm(
langr52 ~ r9racehisp,
data = test,
family = binomial(link = "logit")
)
summary(lang2)
lang_tbl_1 <-
tbl_regression(
lang2
)
Then get the following error
There were 18 warnings (use warnings() to see them)"
appreciate guidance

Analyze weather data stored in csv

I have some weather data stored in a csv file in the form of: „id, date, temperature, rainfall“, with id being the weather station and, obviously, date being the date of measurement. The file contains the data of 3 different stations over a period of 10 years.
What I'd like to do is analyze the data of each station and each year. For example: I'd like to calculate day-to-day differences in temperature [abs((n+1)-n)] for each station and each year.
I thought while-loops could be a possibility, with the loop calculating something as long as the id value is equal to the one in the next row.
But I’ve no idea how to do it.
Best regards
If you still need assistance, I would consider importing the .csv file data using "readtable". So long as only the first row are text, MATLAB will create a 'table' variable (this shouldn't be an issue for a .csv file). The individual columns can be accessed via "tablename.header" and can be reestablished as double data type (ex variable_1=tablename.header). You can then concatenate your dataset as you like. As for sorting by date and station id, I would advocate using "sortrows". For example, if the station id is the first column, sortrow(data,1) will sort "data" by the station id. sortrow(data, [1 2]) will sort "data" by the first column, then by the second column. From there, you can write an if statement to compare the station id's and perform the required calculations. I hope my brief answer is somewhat helpful.
A basic code structure would be:
path=['copy and paste file path here']; % show matlab where to look
data=readtable([path '\filename.csv'], 'ReadVariableNames',1); % read the file from csv format to table
variable1=data.header1 % general example of making double type variable from table
variable2=data.header2
variable3=data.header3
double_data=[variable1 variable2 variable3]; % concatenates the three columns together
sorted_data=sortrows(double_data, [1 2]); % sorts double_data by column 1 then column 2
It always helps to have actual data to work on and specifics as to what kind of output format is expected. Basically, ins and outs :) With the little info provided, I figured I would generate random data for you in the first section, and then calculate some stats in the second. I include the loop as an example since that's what you asked, but I highly recommend using vectorized calculations whenever available, such as the one done in summary stats.
%% example for weather stations
% generation of random data to correspond to what your csv file looks like
rng(1); % keeps the random seed for testing purposes
nbDates = 1000; % number of days of data
nbStations = 3; % number of weather stations
measureDates = repmat((now()-(nbDates-1):now())',nbStations,1); % nbDates days of data ending today
stationIds = kron((1:nbStations)',ones(nbDates,1)); % assuming 3 weather stations with IDs [1,2,3]
temp = rand(nbStations*nbDates,1)*70+30; % temperatures are in F and vary between 30 and 100 degrees
rain = max(rand(nbStations*nbDates,1)*40-20,0); % rain fall is 0 approximately half the time, and between 0mm and 20mm the rest of the time
csv = table(measureDates, stationIds, temp, rain);
clear measureDates stationIds temps rain;
% augment the original dataset as needed
years = year(csv.measureDates);
data = [csv,array2table(years)];
sorted = sortrows( data, {'stationIds', 'measureDates'}, {'ascend', 'ascend'} );
% example looping through your data
for i = 1 : size( sorted, 1 )
fprintf( 'Id=%d, year=%d, temp=%g, rain=%g', sorted.stationIds( i ), sorted.years( i ), sorted.temp( i ), sorted.rain( i ) );
if( i > 1 && sorted.stationIds( i )==sorted.stationIds( i-1 ) && sorted.years( i )==sorted.years( i-1 ) )
fprintf( ' => absolute difference with day before: %g', abs( sorted.temp( i ) - sorted.temp( i-1 ) ) );
end
fprintf( '\n' ); % new line
end
% depending on the statistics you wish to do, other more efficient ways of
% accessing summary stats might be accessible, for example:
grpstats( data ...
, {'stationIds','years'} ... % group by categories
, {'mean','min','max','meanci'} ... % statistics we want
, 'dataVars', {'temp','rain'} ... % variables on which to calculate stats
) % doesn't require data to be sorted or any looping
This produces one line printed for each row of data (and only calculates difference in temperature when there is no year or station change). It also produces some summary stats at the end, here's what I get:
stationIds years GroupCount mean_temp min_temp max_temp meanci_temp mean_rain min_rain max_rain meanci_rain
__________ _____ __________ _________ ________ ________ ________________ _________ ________ ________ ________________
1_2016 1 2016 82 63.13 30.008 99.22 58.543 67.717 6.1181 0 19.729 4.6284 7.6078
1_2017 1 2017 365 65.914 30.028 99.813 63.783 68.045 5.0075 0 19.933 4.3441 5.6708
1_2018 1 2018 365 65.322 30.218 99.773 63.275 67.369 4.7039 0 19.884 4.0615 5.3462
1_2019 1 2019 188 63.642 31.16 99.654 60.835 66.449 5.9186 0 19.864 4.9834 6.8538
2_2016 2 2016 82 65.821 31.078 98.144 61.179 70.463 4.7633 0 19.688 3.4369 6.0898
2_2017 2 2017 365 66.002 30.054 99.896 63.902 68.102 4.5902 0 19.902 3.9267 5.2537
2_2018 2 2018 365 66.524 30.072 99.852 64.359 68.69 4.9649 0 19.812 4.2967 5.6331
2_2019 2 2019 188 66.481 30.249 99.889 63.647 69.315 5.2711 0 19.811 4.3234 6.2189
3_2016 3 2016 82 61.996 32.067 98.802 57.831 66.161 4.5445 0 19.898 3.1523 5.9366
3_2017 3 2017 365 63.914 30.176 99.902 61.932 65.896 4.8879 0 19.934 4.246 5.5298
3_2018 3 2018 365 63.653 30.137 99.991 61.595 65.712 5.3728 0 19.909 4.6943 6.0514
3_2019 3 2019 188 64.201 30.078 99.8 61.319 67.082 5.3926 0 19.874 4.4541 6.3312

Read and merge large tables on computer cluster

I need to merge different large tables (up to 10Gb each) into a single one. To do so I am using a computer cluster with 50+ cores and 10+Gb Ram that runs on Linux.
I always end up with an error message like: "Cannot allocate vector of size X Mb".
Given that commands like memory.limit(size=X) are Windows-specific and not accepted, I cannot find a way around to merge my large tables.
Any suggestion welcome!
This is the code I use:
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
temp = list.files(pattern="*.txt$")
gc()
Here the error occurs:
myfiles = parLapply(cl,temp, function(x) read.csv(x,
header=TRUE,
sep=";",
stringsAsFactors=F,
encoding = "UTF-8",
na.strings = c("NA","99","")))
myfiles.final = do.call(rbind, myfiles)
You could just use merge, for example:
`
mergedTable <- merge(table1, table2, by = "dbSNP_RSID")
If your samples have overlapping column names, then you'll find that the mergedTable has (for example) columns called Sample1.x and Sample1.y. This can be fixed by renaming the columns before or after the merge.
Reproducible example:
x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")
`
One way to approach this is with python and dask. The dask dataframe is stored mostly on disk rather than in ram- allowing you to work with larger than ram data- and can help you do computations with clever parallelization. A nice tutorial of ways to work on big data can be found in this kaggle post which might also be helpful for you. I also suggest checking out the docs on dask performance here. To be clear, if your data can fit in RAM using regular R dataframe or pandas dataframe will be faster.
Here's a dask solution which will assume you have named columns in the tables to align the concat operation. Please add to your question if you have any other special requirements about the data we need to consider.
import dask.dataframe as dd
import glob
tmp = glob.glob("*.txt")
dfs= []
for f in tmp:
# read the large tables
ddf = dd.read_table(f)
# make a list of all the dfs
dfs.append(ddf)
#row-wise concat of the data
dd_all = dd.concat(dfs)
#repartition the df to 1 partition for saving
dd_all = dd_all.repartition(npartitions=1)
# save the data
# provide list of one name if you don't want the partition number appended on
dd_all.to_csv(['all_big_files.tsv'], sep = '\t')
if you just wanted to cat all the tables together you can do something like this in straight python. (you could also use linux cat/paste).
with open('all_big_files.tsv', 'w') as O:
file_number = 0
for f in tmp:
with open(f, 'rU') as F:
if file_number == 0:
for line in F:
line = line.rstrip()
O.write(line + '\n')
else:
# skip the header line
l = F.readline()
for line in F:
line = line.rstrip()
O.write(line + '\n')
file_number +=1

Aligning and italicising table column headings using Rmarkdown and pander

I am writing a rmarkdown document knitting to pdf with tables taken from portions of lists from the ezANOVA package. The tables are made using the pander package. Toy Rmarkdown file with toy dataset below.
---
title: "Table Doc"
output: pdf_document
---
```{r global_options, include=FALSE}
#set global knit options parameters.
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
echo=FALSE, warning=FALSE, message=FALSE, dev = 'pdf')
```
```{r, echo=FALSE}
# toy data
id <- rep(c(1,2,3,4), 5)
group1 <- factor(rep(c("A", "B"), 10))
group2 <- factor(rep(c("A", "B"), each = 10))
dv <- runif(20, min = 0, max = 10)
df <- data.frame(id, group1, group2, dv)
```
``` {r anova, echo = FALSE}
library(ez)
library(plyr)
library(pander)
# create anova object
anOb <- ezANOVA(df,
dv = dv,
wid = id,
between = c(group1, group2),
type = 3,
detailed = TRUE)
# extract the output table from the anova object, reduce it down to only desired columns
anOb <- data.frame(anOb[[1]][, c("Effect", "F", "p", "p<.05")])
# format entries in columns
anOb[,2] <- format( round (anOb[,2], digits = 1), nsmall = 1)
anOb[,3] <- format( round (anOb[,3], digits = 4), nsmall = 1)
pander(anOb, justify = c("left", "center", "center", "right"))
```
Now I have a few problems
a) For the last three columns I would like to have the column heading in the table aligned in the center, but the actual column entries underneath those headings aligned to the right.
b) I would like to have the column headings 'F' and 'p' in italics and the 'p' in the 'p<.05' column in italics also but the rest in normal font. So they read F, p and p<.05
I tried renaming the column headings using plyr::rename like so
anOb <- rename(anOb, c("F" = "italic(F)", "p" = "italic(p)", "p<.05" = ""))
But it didn't work
In markdown, you have to use the markdown syntax for italics, which is wrapping text between a star or underscore:
> names(anOb) <- c('Effect', '*F*', '*p*', '*p<.05*')
> pander(anOb)
-----------------------------------------
Effect *F* *p* *p<.05*
--------------- ------ -------- ---------
(Intercept) 52.3 0.0019 *
group1 1.3 0.3180
group2 2.0 0.2261
group1:group2 3.7 0.1273
-----------------------------------------
If you want to do that in a programmatic way, you can also use the pandoc.emphasis helper function to add the starts to a string.
But your other problem is due to a bug in the package, for which I've just proposed a fix on GH. Please feel free to give a try to that branch and report back on GH -- I will try to get some time later this week to clean up the related unit tests and merge the branch if everything seem to be OK.

Standardized coefficients for lmer model

I used to use the code below to calculate standardized coefficients of a lmer model. However, with the new version of lme the structure of the returned object has changed.
How to adapt the function stdCoef.lmer to make it work with the new lme4 version?
# Install old version of lme 4
install.packages("lme4.0", type="both",
repos=c("http://lme4.r-forge.r-project.org/repos",
getOption("repos")[["CRAN"]]))
# Load package
detach("package:lme4", unload=TRUE)
library(lme4.0)
# Define function to get standardized coefficients from an lmer
# See: https://github.com/jebyrnes/ext-meta/blob/master/r/lmerMetaPrep.R
stdCoef.lmer <- function(object) {
sdy <- sd(attr(object, "y"))
sdx <- apply(attr(object, "X"), 2, sd)
sc <- fixef(object)*sdx/sdy
#mimic se.ranef from pacakge "arm"
se.fixef <- function(obj) attr(summary(obj), "coefs")[,2]
se <- se.fixef(object)*sdx/sdy
return(list(stdcoef=sc, stdse=se))
}
# Run model
fm0 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
# Get standardized coefficients
stdCoef.lmer(fm0)
# Comparison model with prescaled variables
fm0.comparison <- lmer(scale(Reaction) ~ scale(Days) + (scale(Days) | Subject), sleepstudy)
The answer by #LeonardoBergamini works, but this one is more compact and understandable and only uses standard accessors — less likely to break in the future if/when the structure of the summary() output, or the internal structure of the fitted model, changes.
stdCoef.merMod <- function(object) {
sdy <- sd(getME(object,"y"))
sdx <- apply(getME(object,"X"), 2, sd)
sc <- fixef(object)*sdx/sdy
se.fixef <- coef(summary(object))[,"Std. Error"]
se <- se.fixef*sdx/sdy
return(data.frame(stdcoef=sc, stdse=se))
}
library("lme4")
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
fixef(fm1)
## (Intercept) Days
## 251.40510 10.46729
stdCoef.merMod(fm1)
## stdcoef stdse
## (Intercept) 0.0000000 0.00000000
## Days 0.5352302 0.07904178
(This does give the same results as stdCoef.lmer in
#LeonardoBergamini's answer ...)
You can get partially scaled coefficients — scaled by 2 times the SD of x, but not scaled by SD(y), and not centred — using broom.mixed::tidy + dotwhisker::by_2sd:
library(broom.mixed)
library(dotwhisker)
(fm1
|> tidy(effect="fixed")
|> by_2sd(data=sleepstudy)
|> dplyr::select(term, estimate, std.error)
)
this should work:
stdCoef.lmer <- function(object) {
sdy <- sd(attr(object, "resp")$y) # the y values are now in the 'y' slot
### of the resp attribute
sdx <- apply(attr(object, "pp")$X, 2, sd) # And the X matriz is in the 'X' slot of the pp attr
sc <- fixef(object)*sdx/sdy
#mimic se.ranef from pacakge "arm"
se.fixef <- function(obj) as.data.frame(summary(obj)[10])[,2] # last change - extracting
## the standard errors from the summary
se <- se.fixef(object)*sdx/sdy
return(data.frame(stdcoef=sc, stdse=se))
}