Hello I am new to scripting and searching for a solution. I have two text files with different names and i want to merge them together into a new third text file. The format of each text file will be exactly same and will be like that. each text file will have some (same number of) rows starting with # sign followed by some text. after those rows starting with # signs. I will have rows starting with numbers. these rows will have numbers in three columns separated by space. The numbers in the first two columns will be the same in both files while the numbers in the third column will be different. after few hundred rows i might have rows starting with # sign again followed by rows starting with numbers in three columns like before and this can repeat many times. now here is what i want to do.
I want to create a new text file which will have the rows starting with # sign copied exactly from the first text file. I want to copy the first two columns of numbers exactly as they are. these two columns can be copied from the first text file or the 2nd text file as they will be the same inboth files. Now for the third column in the new text file I want to do add the number in the third column of the first two files
number in third column of new text file = (number in third column of first file + number in third column of 2nd file)
after some rows i might have rows with # sign again and then rows followed by numbers in 3 columns. and this can repeat.
sample Format of only one text file is given below. the 2nd text file will have exact same format.
#
#
#
#
#
#
#
0.0 0.0 4.4226
0.0 5.0 4.4246
0.0 10.0 4.4456
0.0 15.0 4.4876
0.0 20.0 4.4453
0.0 25.0 5.6585
.
.
.
.
#
#
#
#
#
0.0 0.0 0.410135
0.0 5.0 0.745745
0.0 10.0 0.574555
0.0 15.0 0.415675
0.0 20.0 0.575454
0.0 25.0 0.410135
0.0 30.0 0.678768
0.0 35.0 0.410135
0.0 40.0 0.976876
0.0 45.0 0.678678
0.0 50.0 0.410135
0.0 55.0 0.678976
0.0 60.0 0.410135
0.0 65.0 0.687876
0.0 70.0 0.768677
.
.
.
.
.
.
and this format of rows with # sign and numbers in three columns can repeat. the rows with numbers in three columns has columns separated by spaces and in the beginning of the these rows with numbers there is one space as well.
I hope i did a good job explaining. I prefer bacth script as it will be easy for me to run. however perl will also work. Thank you very much for your help. highly appreciated.
New file format will be exactly the same as the other two files with third column being the sum of numbers in third column of the first and 2nd txt files. sample third file format is given below.
#
#
#
#
#
#
#
0.0 0.0 8.4355
0.0 5.0 6.3553
0.0 10.0 6.4327
.
.
.
.
#
#
#
#
#
0.0 0.0 4.832735
0.0 5.0 7.436343
0.0 10.0 0.323325
0.0 15.0 4.876656
.
.
.
.
.
.
Once again thank you very much. I am getting headaches because as i have a lot of these files. your help is much appreciated.
File 1 is here
#
#
#
#
#
#
#
0.0 0.0 5.30562
0.0 5.0 5.30562
0.0 10.0 1.4852
90.0 355.0 1.99511
#
#
#
#
#
0.0 0.0 0.948027
0.0 5.0 0.948027
90.0 355.0 1.54
file 2 is
#
#
#
#
#
#
#
0.0 0.0 1.4621
0.0 5.0 1.4621
0.0 10.0 1.4621
90.0 355.0 3.3359
#
#
#
#
#
0.0 0.0 0.747458
0.0 5.0 0.747458
90.0 355.0 0.550766
now u can check problem i think is with spaces in beginning and in between columns
Try following using awk
awk 'NR==FNR {if($3~/[0-9]+\.[0-9]+/){a[i++]=$3}; next} \
$3~/[0-9]+\.[0-9]+/ {$3=$3+a[j++]} \
1' file1 file2 > file3
Testing on sample input specified in comment:
$ cat file1
# comment here
90.0 355.0 1.54
$ cat file2
# comment here
90.0 355.0 0.550766
$ awk 'NR==FNR {if($3~/[0-9]+\.[0-9]+/){a[i++]=$3}; next} \
? $3~/[0-9]+\.[0-9]+/ {$3=$3+a[j++]} \
? 1' file1 file2 > file3
$ cat file3
# comment here
90.0 355.0 2.09077
Related
How can I remove useless ".0" strings in a txt file of numbers?
If I have this file:
43.65 24.0 889.5 5.0
32.14 32.0 900.0 6.0
38.27 43.0 899.4 5.0
I want to get:
43.65 24 889.5 5
32.14 32 900 6
38.27 43 899.4 5
I tried: sed 's|\.0 | |g' but that obviously does not work with new lines and EOF.
Any suggestion without getting into python, etc?
This might work for you (GNU sed):
sed 's/\.0\b//g' file
Or
sed 's/\.0\>//g' file
Remove any period followed by a zero followed by a word boundary.
You can use
sed -E 's,([0-9])\.0($| ),\1\2,g' file
Details:
-E - enables POSIX ERE syntax
([0-9])\.0($| ) - finds and captures into Group 1 a digit, then matches .0, and then matches and captures into Group 2 a space or end of string
\1\2 - replaces with Group 1 + Group 2 concatenated values
g - replaces all occurrences.
See the online demo:
s='43.65 24.0 889.5 5.0
32.14 32.0 900.0 6.0
38.27 43.0 899.4 5.0'
sed -E 's,([0-9])\.0($| ),\1\2,g' <<< "$s"
Output:
43.65 24 889.5 5
32.14 32 900 6
38.27 43 899.4 5
I can't figure out how to selectively print values in a table above or below some value. What I'm looking for is known as "cut" in Revelle's psych package. MWE below.
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
print(derp, cut=0.5) #removes all loadings smaller than 0.5
derp <- print(derp, cut=0.5) #apa_table still doesn't print like this
Question is, how do I add that cut to an apa_table? Printing apa_table(derp) prints the entire table, including all values.
The print-method from psych does not return the formatted loadings but only the table of variance accounted for. You can, however, get the result you want by manually formatting the loadings table:
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
# Class `loadings` cannot be coerced to data.frame or matrix
class(derp$Structure)
[1] "loadings"
# Class `matrix` is supported by apa_table()
derp_loadings <- unclass(derp$Structure)
class(derp_loadings)
[1] "matrix"
# Remove values below "cut"
derp_loadings[derp_loadings < 0.5] <- NA
colnames(derp_loadings) <- paste("Factor", 1:3)
apa_table(
derp_loadings
, caption = "Factor loadings"
, added_stub_head = "Item"
, format = "pandoc" # Omit this in your R Markdown document
, format.args = list(na_string = "") # Don't print NA
)
*Factor loadings*
Item Factor 1 Factor 2 Factor 3
---------- --------- --------- ---------
reason.4 0.60
reason.16
reason.17 0.65
reason.19
letter.7 0.61
letter.33 0.56
letter.34 0.65
letter.58
matrix.45
matrix.46
matrix.47
matrix.55
rotate.3 0.70
rotate.4 0.73
rotate.6 0.63
rotate.8 0.63
Am quite new in the Unix field and I am currently trying to extract data set from a text file. I tried with sed, grep, awk but it seems to only work with extracting lines, but I want to extract an entire dataset... Here is an example of file from which I'd like to extract the 2 data sets (figures after the lines "R.Time Intensity")
[Header]
Application Name LabSolutions
Version 5.87
Data File Name C:\LabSolutions\Data\Antoine\170921_AC_FluoSpectra\069_WT3a derivatized lignin LiCl 430_GPC_FOREVER_430_049.lcd
Output Date 2017-10-12
Output Time 12:07:32
[Configuration]
Instrument Name BOTAN127-Instrument1
Instrument # 1
Line # 1
# of Detectors 3
Detector ID Detector A Detector B PDA
Detector Name Detector A Detector B PDA
# of Channels 1 1 2
[LC Chromatogram(Detector A-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
Ex. Wavelength(nm) 405
Em. Wavelength(nm) 430
R.Time (min) Intensity
0,00000 -709779
0,00833 -709779
0,01667 17
0,02500 3
0,03333 7
0,04167 19
0,05000 9
0,05833 5
0,06667 2
0,07500 24
0,08333 48
[LC Chromatogram(Detector B-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
R.Time (min) Intensity
0,00000 149
0,00833 149
0,01667 -1
I would greatly appreciate any idea. Thanks in advance.
Antoine
awk '/^[^0-9]/&&d{d=0} /R.Time/{d=1}d' file
Brief explanation,
Set d as a flag to determine print line or not
/^[^0-9]/&&d{d=0}: if regex ^[^0-9] matched && d==1, disabled d
/R.Time/{d=1}: if string "R.Time" searched, enabled d
awk '/R.Time/,/LC/' file|grep -v -E "R.Time|LC"
grep part will remove the R.Time and LC lines that come as a part of the output from awk
I think it's a job for sed.
sed '/R.Time/!d;:A;N;/\n$/!bA' infile
I have a file with 5 lines in it. I'd like to copy and paste these lines until a condition is satisfied.. e.g.,
123 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
I'd like to repeat these lines 20 times and each time incrementing the number '123' (on the fist line) by 1. i.e.,
123 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
124 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
.....
.....
172 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
Is this possible?
One way to do it (but without any elegance) is this:
1) Get a file with the desired final number of lines (in your case, 20x5). Let's name it "100lines.txt".
2) Do:
awk '{if(NR%5==1) {print (NR+4)/5+122, "89.98 34.56"} if(NR%5==2) {print "AbcDef"} if(NR%5==3) {print "0.0 10.567"} if(NR%5==4) {print "Hijkl"} if(NR%5==0) {print "XYZ 345"}}' 100lines.txt
I think it could work.
I have a list of SNPs for example (let's call it file1):
SNP_ID chr position
rs9999847 4 182120631
rs999985 11 107192257
rs9999853 4 148436871
rs999986 14 95803856
rs9999883 4 870669
rs9999929 4 73470754
rs9999931 4 31676985
rs9999944 4 148376995
rs999995 10 78735498
rs9999963 4 84072737
rs9999966 4 5927355
rs9999979 4 135733891
I have another list of SNP with corresponding P-value (P) and BETA (as shown below) for different phenotypes here i have shown only one (let's call it file2):
CHR SNP BP A1 TEST NMISS BETA SE L95 U95 STAT P
1 rs3094315 742429 G ADD 1123 0.1783 0.2441 -0.3 0.6566 0.7306 0.4652
1 rs12562034 758311 A ADD 1119 -0.2096 0.2128 -0.6267 0.2075 -0.9848 0.3249
1 rs4475691 836671 A ADD 1111 -0.006033 0.2314 -0.4595 0.4474 -0.02608 0.9792
1 rs9999847 878522 A ADD 1109 -0.2784 0.4048 -1.072 0.5149 -0.6879 0.4916
1 rs999985 890368 C ADD 1111 0.179 0.2166 -0.2455 0.6034 0.8265 0.4087
1 rs9999853 908247 C ADD 1110 -0.02015 0.2073 -0.4265 0.3862 -0.09718 0.9226
1 rs999986 918699 G ADD 1111 -1.248 0.7892 -2.795 0.2984 -1.582 0.114
Now I want to make two files named file3 and file4 such that:
file3 should contain:
SNPID Pvalue_for_phenotype1 Pvalue_for_phenotype2 Pvalue_for_phenotype3 and so on....
rs9999847 0.9263 0.00005 0.002 ..............
The first column (SNPIDs) in file3 will be fixed (all the snps in my chip will be listed here), and i want to write a programe so that it will match snp id in file3 and file2 and will fetch the P-value for that corresponding snp id and put it in file3 from file2.
file4 should contain:
SNPID BETAvale_for_phenotype1 BETAvale_for_phenotype2 BETAvale_for_phenotype3 .........
rs9999847 0.01812 -0.011 0.22
the 1st column (SNPIDs) in file4 will be fixed (all the SNPs in my chip will be listed here), and I want to write a program so that it will match SNP ID in file4 and file2 and will fetch the BETA for that corresponding SNP ID and put it in file4 from file2.
it's a simple exercise about How to transfer the data of columns to rows (with awk)?
file2 to file3.
I assumed that you have got machine with large RAM, because I think that you have got million lines into file2.
you could save this code into column2row.awk file:
#!/usr/bin/awk -f
BEGIN {
snp=2
val=12
}
{
if ( vector[$snp] )
vector[$snp] = vector[$snp]","$val
else
vector[$snp] = $val
}
END {
for (snp in vector)
print snp","vector[snp]
}
where snp is column 2 and val is column 12 (pvalue).
now you could run script:
/usr/bin/awk -f column2row.awk file2 > file3
If you have got small RAM, then you could divide load:
cat file1 | while read l; do s=$(echo $l|awk '{print $1}'); grep -w $s file2 > $s.snp; /usr/bin/awk -f column2row.awk $s.snp >> file3; done
It recovers from $l (line) first parameter ($s, snp name), search $s into file2 and create small file about each snp name.
and then it uses awk script to generate file3.
file2 to file4.
you could modify value about val into awk script from column 12 to 7.