TXR collecting data from a table with ill-defined separator - text-processing

I have data output that looks like the following:
Item Time Type Width Area Area Name
# [min] [m] [m^2] %
----|-------|------|-------|----------|--------|---------------------
1 0.323 A B 0.0000 0.00000 0.00000 ABC
2 1.581 C 0.0000 0.00000 0.00000 DEF
3 2.898 D2 0.0000 0.00000 0.00000 GHI
Totals : 0.00000
The challenge with this data is that there is no obvious column separator other than the position of the | from the line before the data. So, I'm trying to figure out how I can use those character positions to correctly capture the column variables.
It seems to me like the following should work:
#(define os)#/[ ]*/#(end)
Item Time Type Width Area Area Name
#(skip 1 1)
#(coll)#{field /(-)+/}#(chr sep)#(until)#(eol)#(end)
#(collect :gap 0 :vars (item time type width area area_pct name))
#item#(os )#(chr (toint [sep 0]))#(os)#\
#time#(os )#(chr (toint [sep 1]))#(os)#\
#type#(os )#(chr (toint [sep 2]))#(os)#\
#width#(os )#(chr (toint [sep 3]))#(os)#\
#area#(os )#(chr (toint [sep 4]))#(os)#\
#area_pct#(os)#(chr (toint [sep 5]))#(os)#\
#name#(os )#(chr (toint [sep 6]))#(os)#(eol)
#(end)
Totals : #total
#(skip)
#(output)
Item,Time,Type,Width,Area,Area(%),Name
# (repeat)
#item,#time,#type,#width,#area,#area_pct,#name
# (end)
#(end)
But none of the rows of data are matching. What am I missing?
The desired output (as a CSV table) is:
Item,Time,Type,Width,Area,Area(%),Name
1,0.323,A B,0.0000,0.00000,0.00000,ABC
2,1.581,C,0.0000,0.00000,0.00000,DEF
3,2.898,D2,0.0000,0.0000,0.00000,GHI
The following code is a "hack" which produces the desired output, but does so leveraging mostly TXR Lisp instead of TXR. The closer I can make the code reflect the data file, the happier my future self will be.
#(define os)#/[ ]*/#(end)
Item Time Type Width Area Area Name
#(skip 1 1)
#(coll)#{field /(-)+/}#(chr sep)#(until)#(eol)#(end)
#(collect :gap 0 :vars (item time type width area area_pct name))
# (cases)
Totals : #(skip)
# (accept)
# (or)
#line
# (set line #(progn (mapdo (lambda (s) (chr-str-set line s #\|)) (rest (reverse sep))) line))
# (set line #(mapcar 'trim-str (split-str line "|")))
# (bind item #[line 0])
# (bind time #[line 1])
# (bind type #[line 2])
# (bind width #[line 3])
# (bind area #[line 4])
# (bind area_pct #[line 5])
# (bind name #[line 6])
# (end)
#(end)
#(skip)
#(output)
Item,Time,Type,Width,Area,Area(%),Name
# (repeat)
#item,#time,#type,#width,#area,#area_pct,#name
# (end)
#(end)

Related

With gtsummary, is it possible to have N on a separate row to the column name?

gtsummary by default puts the number of observations in a by group beside the label for that group. This increases the width of the table... with many groups or large N, the table would quickly become very wide.
Is it possible to get gtsummary to report N on a separate row beneath the label? E.g.
> data(mtcars)
> mtcars %>%
+ select(mpg, cyl, vs, am) %>%
+ tbl_summary(by = am) %>%
+ as_tibble()
# A tibble: 6 x 3
`**Characteristic**` `**0**, N = 19` `**1**, N = 13`
<chr> <chr> <chr>
1 mpg 17.3 (14.9, 19.2) 22.8 (21.0, 30.4)
2 cyl NA NA
3 4 3 (16%) 8 (62%)
4 6 4 (21%) 3 (23%)
5 8 12 (63%) 2 (15%)
6 vs 7 (37%) 7 (54%)
would become
# A tibble: 6 x 3
`**Characteristic**` `**0**` `**1**`
<chr> <chr> <chr>
1 N = 19 N = 13
2 mpg 17.3 (14.9, 19.2) 22.8 (21.0, 30.4)
3 cyl NA NA
4 4 3 (16%) 8 (62%)
5 6 4 (21%) 3 (23%)
6 8 12 (63%) 2 (15%)
7 vs 7 (37%) 7 (54%)
(I only used as_tibble so that it was easy to show what I mean by editing it manually...)
Any idea?
Thanks!
Here is one way you could do this:
library(tidyverse)
library(gtsummary)
mtcars %>%
select(mpg, cyl, vs, am) %>%
# create a new variable to display N in table
mutate(total = 1) %>%
# this is just to reorder variables for table
select(total, everything()) %>%
tbl_summary(
by = am,
# this is to specify you only want N (and no percentage) for new total variable
statistic = total ~ "N = {N}") %>%
# this is a gtsummary function that allows you to edit the header
modify_header(all_stat_cols() ~ "**{level}**")
First, I am making a new variable that is just total observations (called total)
Then, I am customizing the way I want that variable statistic to be displayed
Then I am using gtsummary::modify_header() to remove N from the header
Additionally, if you use the flextable print engine, you can add a line break in the header itself:
mtcars %>%
select(mpg, cyl, vs, am) %>%
# create a new variable to display N in table
tbl_summary(
by = am
# this is to specify you only want N (and no percentage) for new total variable
) %>%
# this is a gtsummary function that allows you to edit the header
modify_header(all_stat_cols() ~ "**{level}**\nN = {n}") %>%
as_flex_table()
Good luck!
#kittykatstat already posted two fantastic solutions! I'll just add a slight variation :)
If you want to use the {gt} package to print the table and you're outputting to HTML, you can use the HTML tag <br> to add a line break in the header row (very similar to the \n solution already posted).
library(gtsummary)
mtcars %>%
select(mpg, cyl, vs, am) %>%
dplyr::mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
# create a new variable to display N in table
tbl_summary(by = am) %>%
# this is a gtsummary function that allows you to edit the header
modify_header(stat_by = "**{level}**<br>N = {N}")

Option to cut values below a threshold in papaja::apa_table

I can't figure out how to selectively print values in a table above or below some value. What I'm looking for is known as "cut" in Revelle's psych package. MWE below.
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
print(derp, cut=0.5) #removes all loadings smaller than 0.5
derp <- print(derp, cut=0.5) #apa_table still doesn't print like this
Question is, how do I add that cut to an apa_table? Printing apa_table(derp) prints the entire table, including all values.
The print-method from psych does not return the formatted loadings but only the table of variance accounted for. You can, however, get the result you want by manually formatting the loadings table:
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
# Class `loadings` cannot be coerced to data.frame or matrix
class(derp$Structure)
[1] "loadings"
# Class `matrix` is supported by apa_table()
derp_loadings <- unclass(derp$Structure)
class(derp_loadings)
[1] "matrix"
# Remove values below "cut"
derp_loadings[derp_loadings < 0.5] <- NA
colnames(derp_loadings) <- paste("Factor", 1:3)
apa_table(
derp_loadings
, caption = "Factor loadings"
, added_stub_head = "Item"
, format = "pandoc" # Omit this in your R Markdown document
, format.args = list(na_string = "") # Don't print NA
)
*Factor loadings*
Item Factor 1 Factor 2 Factor 3
---------- --------- --------- ---------
reason.4 0.60
reason.16
reason.17 0.65
reason.19
letter.7 0.61
letter.33 0.56
letter.34 0.65
letter.58
matrix.45
matrix.46
matrix.47
matrix.55
rotate.3 0.70
rotate.4 0.73
rotate.6 0.63
rotate.8 0.63

Compare contrasts in linear model in Python (like Rs contrast library?)

In R I can do the following to compare two contrasts from a linear model:
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv"
filename <- "spider_wolff_gorb_2013.csv"
install.packages("downloader", repos="http://cran.us.r-project.org")
library(downloader)
if (!file.exists(filename)) download(url, filename)
spider <- read.csv(filename, skip=1)
head(spider, 5)
# leg type friction
# 1 L1 pull 0.90
# 2 L1 pull 0.91
# 3 L1 pull 0.86
# 4 L1 pull 0.85
# 5 L1 pull 0.80
fit = lm(friction ~ type + leg, data=spider)
fit
# Call:
# lm(formula = friction ~ type + leg, data = spider)
#
# Coefficients:
# (Intercept) typepush legL2 legL3 legL4
# 1.0539 -0.7790 0.1719 0.1605 0.2813
install.packages("contrast", repos="http://cran.us.r-project.org")
library(contrast)
l4vsl2 = contrast(fit, list(leg="L4", type="pull"), list(leg="L2",type="pull"))
l4vsl2
# lm model parameter contrast
#
# Contrast S.E. Lower Upper t df Pr(>|t|)
# 0.1094167 0.04462392 0.02157158 0.1972618 2.45 277 0.0148
I have found out how to do much of the above in Python:
import pandas as pd
df = pd.read_table("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv", sep=",", skiprows=1)
df.head(2)
import statsmodels.formula.api as sm
model1 = sm.ols(formula='friction ~ type + leg', data=df)
fitted1 = model1.fit()
print(fitted1.summary())
Now all that remains is finding the t-statistic for the contrast of leg pair L4 vs. leg pair L2. Is this possible in Python?
statsmodels is still missing some predefined contrasts, but the t_test and wald_test or f_test methods of the model Results classes can be used to test linear (or affine) restrictions. The restrictions either be given by arrays or by strings using the parameter names.
Details for how to specify contrasts/restrictions should be in the documentation
for example
>>> tt = fitted1.t_test("leg[T.L4] - leg[T.L2]")
>>> print(tt.summary())
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 0.1094 0.045 2.452 0.015 0.022 0.197
==============================================================================
The results are attributes or methods in the instance that is returned by t_test. For example the conf_int can be obtained by
>>> tt.conf_int()
array([[ 0.02157158, 0.19726175]])
t_test is vectorized and treats each restriction or contrast as separate hypothesis. wald_test treats a list of restrictions as joint hypothesis:
>>> tt = fitted1.t_test(["leg[T.L3] - leg[T.L2], leg[T.L4] - leg[T.L2]"])
>>> print(tt.summary())
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 -0.0114 0.043 -0.265 0.792 -0.096 0.074
c1 0.1094 0.045 2.452 0.015 0.022 0.197
==============================================================================
>>> tt = fitted1.wald_test(["leg[T.L3] - leg[T.L2], leg[T.L4] - leg[T.L2]"])
>>> print(tt.summary())
<F test: F=array([[ 8.10128575]]), p=0.00038081249480917173, df_denom=277, df_num=2>
Aside: this also works for robust covariance matrices if cov_type was specified as argument to fit.

Read txt files in Matlab contain symbols in two first rows

I have data which are as the following :
MTtmax6000_N1000000_k+0.1_k-T0.001_k-D0.1_kh1.txt
# nMT=1000000 tmax=60000 trelax=10000 k+=0.1 k-T=0.001 k-D=0.1 kh=1
#t (L-L0) L varL NGTP varNGTP Cap varCap
0 0 50090.2 2089.48 0.100257 0.100158 0.104798 0.114295
100 0.897735 50091.1 2109.92 0.099841 0.0998968 0.104373 0.114029
200 1.80163 50092 2130.83 0.099736 0.0995947 0.104204 0.113554
300 2.70513 50092.9 2151.79 0.099775 0.0997319 0.104323 0.113928
400 3.60867 50093.9 2172.17 0.099982 0.0999776 0.104546 0.114294
500 4.50984 50094.8 2192.49 0.100229 0.100263 0.104795 0.114473
600 5.40802 50095.6 2213.72 0.100149 0.100159 0.10463 0.114101
700 6.3161 50096.6 2234.2 0.099856 0.100117 0.10433 0.114139
800 7.21386 50097.5 2254.76 0.099624 0.0997151 0.104171 0.113879
900 8.11601 50098.4 2275.18 0.100183 0.100386 0.104615 0.114237
1000 9.01724 50099.3 2296.13 0.100504 0.100423 0.105058 0.114745
1100 9.92572 50100.2 2317.11 0.100368 0.10056 0.105023 0.115089
1200 10.8262 50101.1 2338.26 0.099476 0.0998665 0.103951 0.113913
1300 11.7243 50102 2359.96 0.099775 0.0997559 0.104246 0.113753
1400 12.6273 50102.9 2381.2 0.100081 0.100099 0.104571 0.11406
1500 13.5297 50103.8 2401.8 0.099702 0.0997495 0.104267 0.114045
1600 14.4281 50104.7 2422.56 0.099792 0.0999496 0.104292 0.113975
1700 15.3369 50105.6 2443.44 0.099912 0.0999296 0.104452 0.114242
I tried to read these data by using dlmread, txtscan or textread when I implement the code I receive this massage:
Error using dlmread (line 139) Mismatch between file and format string. Trouble reading number from file (row 1u, field 1u) ==> # nMT=1000000 tmax=4000 trelax=1000 k+=1 k-T=0.01 k-D=0.1 kh=1\n
I want command to read txt files and ignore two first rows.Any help would be greatly appreciated. I will be grateful to you.
clc;
clear all;
close all;
%%
tic
Values11 = zeros(225,6);
K_minus_t =[0.01];
K_minus_d = [0.1];
%k_plus =[0.1 0.2 0.4 0.7 1 1.1 1.2 1.5 1.7 2 2.5 3 3.5 4 5];
m=length(K_minus_t);
r=length(K_minus_d);
kk=0;
ll=1;
for l=1:r
for j=1:m
h=[1];
k_plus =[1];
K_minus_T =K_minus_t(j);
K_minus_D = K_minus_d(l);
sets = {k_plus, K_minus_T, K_minus_D,h};
[x,y,z r] = ndgrid(sets{:});
cartProd = [x(:) y(:) z(:) r(:)];
nFiles = size(cartProd,1);
filename{nFiles,j}=[];
for i=1:nFiles
%% MT_Sym_N1000000_k+1_k-T0.01_k-D0.1_kh1.txt
filename{i,j} = ['MT_Sym_N1000000_' ...
'k+' num2str(cartProd(i,1)) '_' ...
'k-T' num2str(cartProd(i,2),'%6.3g') '_' ...
'k-D' num2str(cartProd(i,3)) '_' ...
'kh' num2str(cartProd(i,4)) '' ...
'.txt'];
file1=dlmread(filename{i,j})
%% line (length)
t= file1(:,1);
dline= file1(:,2);
[coef_line1,s]= polyfit(t, dline, 1);
coef_line(i,:)= coef_line1;
v1{i}=s.R;
v2{i}=s.df;
v3{i}=s.normr;
Dl(i)=sqrt (v3{i}/length(t));
end
end
end
Use importdata:
x = importdata('file.txt',' ',2); %// ' ': col separator; 2: number of header lines
data = x.data; %// x.data is what you want
The first line gives a struct x with data, textdata and colheaders fields. The numeric data is in field data, so x.data is what you want.

Merge columns of data from multiple text files by row from each seperate file using Powershell

I have output from a numerical modelling code. I needed to extract a specific value from a series of files. I used the following code to get it (I derived this from an example that would extract IP addresses from logfiles):
$input_path = ‘C:\_TEST\Input_PC\out5.txt’
$output_file = ‘C:\_TEST\Output_PC_All\out5.txt’
$regex = ‘\bHEAD(.+)\s+[\-]*\d{1,3}\.\d{6,6}\s?\b’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
So I now have got a number of text files which contain measurements (the number of files may be variable, currently there are 50) with one column of numeric data (with a number of rows which currently equals 7302 but which may vary depending on the length of the time series modelled) and which may be positive or negative as per the example data below.
Note a semicolon preceding the text indicates that what follows is a comment I am using to explain the order of the dataset and does not appear in the data to be processed...
out1.txt
-1.000000 ; 1st line of out1.txt
2.000000 ; 2nd line of out1.txt
-3.000000 ; 3rd line of out1.txt
...
5.000000 ; nth line of out1.txt
out2.txt
-1.200000 ; 1st line of out2.txt
-2.200000 ; 2nd line of out2.txt
3.200000 ; 3rd line of out2.txt
...
-5.20000 ; nth line of out2.txt
outn.txt
1.300000 ; 1st line of outn.txt
-2.300000 ; 2nd line of outn.txt
-3.300000 ; 3rd line of outn.txt
...
10.300000 ; nth line of outn.txt
I need to merge them into a single text file (for this example lets call it "Combined_Output.txt") using Powershell with the data ordered so that the first row of values from the differing output files appear first, then repeat this for row 2 and so on as below:
Combined_Output.txt
-1.000000 ; 1st line of out1.txt
-1.200000 ; 1st line of out2.txt
1.300000 ; 1st line of outn.txt
2.000000 ; 2nd line of out1.txt
-2.200000 ; 2nd line of out2.txt
-2.300000 ; 2nd line of outn.txt
-3.000000 ; 3rd line of out1.txt
3.200000 ; 3rd line of out2.txt
-3.300000 ; 3rd line of outn.txt
...
5.000000 ; nth line of out1.txt
-5.200000 ; nth line of out2.txt
10.300000 ; nth line of outN.txt
Just to say that I'm very new to this sort of thing so I hope that the explanation above makes sense and also to say any help that you can provide would be much appreciated.
EDIT
Having now run the models, when using this code for the large data files created, there seems to be an issue of sorting of the imported data. This seems to occur primarily when there are repeated values for example the second row of data from each outfile has been combined in the following order by the script. It looks like there is some sorting based on the value of the data and not just based on the out file name:
Value ; out file text number
-1.215809 ; 1
-0.480543 ; 18
-0.480541 ; 19
-0.48054 ; 2
-0.480539 ; 20
-0.480538 ; 21
-0.480537 ; 22
-0.480536 ; 23
-0.480535 ; 24
-0.480534 ; 25
-0.480534 ; 26
-0.480688 ; 10
-0.480533 ; 27
-0.480532 ; 3
-0.480776 ; 4
-0.48051 ; 5
-0.48051 ; 6
-0.48051 ; 7
-0.48051 ; 8
-0.48051 ; 9
-0.48051 ; 11
-0.48051 ; 12
-0.48051 ; 13
I feel like I might have over complicated this answer but lets see how we do. Consider the following dummy data similar to your samples
Out1.txt Out2.txt Out3.txt
-0.40000 0.800000 4.100000
3.500000 0.300000 -0.90000
-2.60000 0.800000 2.200000
0.500000 1.800000 -1.40000
3.600000 1.800000 1.400000
40000000 -0.70000 1.500000
The file contents are arranged side by side for answer brevity and to help understand the output. The code is as follows:
$allTheFiles = #()
Get-ChildItem c:\temp\out*.txt | ForEach-Object{
$allTheFiles += ,(Get-Content $_.FullName)
}
For ($lineIndex=0; $lineIndex -lt $allTheFiles[0].Count; $lineIndex++){
For($fileIndex=0; $fileIndex -lt $allTheFiles.Count; $fileIndex++){
$allTheFiles[$fileIndex][$lineIndex]
}
} | Out-File -FilePath c:\temp\file.txt -Encoding ascii
Gather all out*.txt files the code creates an array of arrays which are the file contents themselves. Using nested For loops cycle though each single file outputting one line from each file at a time. While I am having a hard time being clear on what is happening if you compare the sample data to the output you should see that the first line or every file is outputted together followed by the next line...etc.
This code will produce the following output
-0.40000
0.800000
4.100000
3.500000
0.300000
-0.90000
-2.60000
0.800000
2.200000
0.500000
1.800000
-1.40000
3.600000
1.800000
1.400000
40000000
-0.70000
1.500000
Caveats
The code assumes that all files are of the same size. The number of lines is determined by the first file. If other files contain more data it would be lost in this model.