Merge columns of data from multiple text files by row from each seperate file using Powershell - powershell

I have output from a numerical modelling code. I needed to extract a specific value from a series of files. I used the following code to get it (I derived this from an example that would extract IP addresses from logfiles):
$input_path = ‘C:\_TEST\Input_PC\out5.txt’
$output_file = ‘C:\_TEST\Output_PC_All\out5.txt’
$regex = ‘\bHEAD(.+)\s+[\-]*\d{1,3}\.\d{6,6}\s?\b’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
So I now have got a number of text files which contain measurements (the number of files may be variable, currently there are 50) with one column of numeric data (with a number of rows which currently equals 7302 but which may vary depending on the length of the time series modelled) and which may be positive or negative as per the example data below.
Note a semicolon preceding the text indicates that what follows is a comment I am using to explain the order of the dataset and does not appear in the data to be processed...
out1.txt
-1.000000 ; 1st line of out1.txt
2.000000 ; 2nd line of out1.txt
-3.000000 ; 3rd line of out1.txt
...
5.000000 ; nth line of out1.txt
out2.txt
-1.200000 ; 1st line of out2.txt
-2.200000 ; 2nd line of out2.txt
3.200000 ; 3rd line of out2.txt
...
-5.20000 ; nth line of out2.txt
outn.txt
1.300000 ; 1st line of outn.txt
-2.300000 ; 2nd line of outn.txt
-3.300000 ; 3rd line of outn.txt
...
10.300000 ; nth line of outn.txt
I need to merge them into a single text file (for this example lets call it "Combined_Output.txt") using Powershell with the data ordered so that the first row of values from the differing output files appear first, then repeat this for row 2 and so on as below:
Combined_Output.txt
-1.000000 ; 1st line of out1.txt
-1.200000 ; 1st line of out2.txt
1.300000 ; 1st line of outn.txt
2.000000 ; 2nd line of out1.txt
-2.200000 ; 2nd line of out2.txt
-2.300000 ; 2nd line of outn.txt
-3.000000 ; 3rd line of out1.txt
3.200000 ; 3rd line of out2.txt
-3.300000 ; 3rd line of outn.txt
...
5.000000 ; nth line of out1.txt
-5.200000 ; nth line of out2.txt
10.300000 ; nth line of outN.txt
Just to say that I'm very new to this sort of thing so I hope that the explanation above makes sense and also to say any help that you can provide would be much appreciated.
EDIT
Having now run the models, when using this code for the large data files created, there seems to be an issue of sorting of the imported data. This seems to occur primarily when there are repeated values for example the second row of data from each outfile has been combined in the following order by the script. It looks like there is some sorting based on the value of the data and not just based on the out file name:
Value ; out file text number
-1.215809 ; 1
-0.480543 ; 18
-0.480541 ; 19
-0.48054 ; 2
-0.480539 ; 20
-0.480538 ; 21
-0.480537 ; 22
-0.480536 ; 23
-0.480535 ; 24
-0.480534 ; 25
-0.480534 ; 26
-0.480688 ; 10
-0.480533 ; 27
-0.480532 ; 3
-0.480776 ; 4
-0.48051 ; 5
-0.48051 ; 6
-0.48051 ; 7
-0.48051 ; 8
-0.48051 ; 9
-0.48051 ; 11
-0.48051 ; 12
-0.48051 ; 13

I feel like I might have over complicated this answer but lets see how we do. Consider the following dummy data similar to your samples
Out1.txt Out2.txt Out3.txt
-0.40000 0.800000 4.100000
3.500000 0.300000 -0.90000
-2.60000 0.800000 2.200000
0.500000 1.800000 -1.40000
3.600000 1.800000 1.400000
40000000 -0.70000 1.500000
The file contents are arranged side by side for answer brevity and to help understand the output. The code is as follows:
$allTheFiles = #()
Get-ChildItem c:\temp\out*.txt | ForEach-Object{
$allTheFiles += ,(Get-Content $_.FullName)
}
For ($lineIndex=0; $lineIndex -lt $allTheFiles[0].Count; $lineIndex++){
For($fileIndex=0; $fileIndex -lt $allTheFiles.Count; $fileIndex++){
$allTheFiles[$fileIndex][$lineIndex]
}
} | Out-File -FilePath c:\temp\file.txt -Encoding ascii
Gather all out*.txt files the code creates an array of arrays which are the file contents themselves. Using nested For loops cycle though each single file outputting one line from each file at a time. While I am having a hard time being clear on what is happening if you compare the sample data to the output you should see that the first line or every file is outputted together followed by the next line...etc.
This code will produce the following output
-0.40000
0.800000
4.100000
3.500000
0.300000
-0.90000
-2.60000
0.800000
2.200000
0.500000
1.800000
-1.40000
3.600000
1.800000
1.400000
40000000
-0.70000
1.500000
Caveats
The code assumes that all files are of the same size. The number of lines is determined by the first file. If other files contain more data it would be lost in this model.

Related

How to count the numbers of elements in parts of a text file using a loop in Perl?

I´m looking for a way to create a script in Perl to count the elements in my text file and do it in parts. For example, my text file has this form:
ID Position Potential Jury agreement NGlyc result
(PART 1)
NP_073551.1_HCoV229Egp2 23 NTSY 0.5990 (8/9) +
NP_073551.1_HCoV229Egp2 62 NTSS 0.7076 (9/9) ++
NP_073551.1_HCoV229Egp2 171 NTTI 0.5743 (5/9) +
...
(PART 2)
QJY77946.1_NA 20 NGTN 0.7514 (9/9) +++
QJY77946.1_NA 23 NTSH 0.5368 (5/9) +
QJY77946.1_NA 51 NFSF 0.7120 (9/9) ++
QJY77946.1_NA 62 NTSS 0.6947 (9/9) ++
...
(PART 3)
QJY77954.1_NA 20 NGTN 0.7694 (9/9) +++
QJY77954.1_NA 23 NTSH 0.5398 (5/9) +
QJY77954.1_NA 51 NFSF 0.7121 (9/9) ++
...
(PART N°...)
Like you can see the ID is the same in each part (one for PART 1, other to PART 2 and then...). The changes only can see in the columns Position//Potential//Jury agreement//NGlyc result Then, my main goal is to count the line with Potential 0,7 >=.
With this in mind, I´m looking for output like this:
Part 1:
1 (one value 0.7 >=)
Part 2:
2 (two values 0.7 >=)
Part 3:
2 (two values 0.7 >=)
Part N°:
X numbers of values 0.7 >=
This output tells me the number of positive values (0.7 >=) for each ID.
The pseudocode I believe would be something like this:
foreach ID in LIST
foreach LINE in FILE
if (ID is in LINE)
... count the line ...
end foreach LINE
end foreach ID
I´m looking for any suggestion (for a package or script idea) or comment to create a better script.
Thanks! Best!
To count the number of lines, for each part, that match some condition on a certain column, you can just loop over the lines, skip the header, parse the part number, and use an array to count the number of lines matching for each part.
After this you can just loop over the counts recorded in the array and print them out in your specific format.
#!/usr/bin/perl
use strict;
use warnings;
my $part = 0;
my #cnt_part;
while(my $line = <STDIN>) {
if($. == 1) {
next;
}elsif($line =~ m{^\(PART (\d+)\)}) {
$part = $1;
}else {
my #cols = split(m{\s+},$line);
if(#cols == 6) {
my $potential = $cols[3];
if(0.7 <= $potential) {
$cnt_part[$part]++;
};
};
};
};
for(my $i=1;$i<=$#cnt_part;$i++){
print "Part $i:\n";
print "$cnt_part[$i] (values 0.7 <=)\n";
};
To run it, just pipe the entire file through the Perl script:
cat in.txt | perl count.pl
and you get an output like this:
Part 1:
1 (values 0.7 <=)
Part 2:
2 (values 0.7 <=)
Part 3:
2 (values 0.7 <=)
If you want to also display the counts into words, you can use Lingua::EN::Numbers (see this program ) and you get an output very similar to the one in your post:
Part 1:
1 (one values 0.7 <=)
Part 2:
2 (two values 0.7 <=)
Part 3:
2 (two values 0.7 <=)
All the code in this post is also available here.

How to extract data set from a text file?

Am quite new in the Unix field and I am currently trying to extract data set from a text file. I tried with sed, grep, awk but it seems to only work with extracting lines, but I want to extract an entire dataset... Here is an example of file from which I'd like to extract the 2 data sets (figures after the lines "R.Time Intensity")
[Header]
Application Name LabSolutions
Version 5.87
Data File Name C:\LabSolutions\Data\Antoine\170921_AC_FluoSpectra\069_WT3a derivatized lignin LiCl 430_GPC_FOREVER_430_049.lcd
Output Date 2017-10-12
Output Time 12:07:32
[Configuration]
Instrument Name BOTAN127-Instrument1
Instrument # 1
Line # 1
# of Detectors 3
Detector ID Detector A Detector B PDA
Detector Name Detector A Detector B PDA
# of Channels 1 1 2
[LC Chromatogram(Detector A-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
Ex. Wavelength(nm) 405
Em. Wavelength(nm) 430
R.Time (min) Intensity
0,00000 -709779
0,00833 -709779
0,01667 17
0,02500 3
0,03333 7
0,04167 19
0,05000 9
0,05833 5
0,06667 2
0,07500 24
0,08333 48
[LC Chromatogram(Detector B-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
R.Time (min) Intensity
0,00000 149
0,00833 149
0,01667 -1
I would greatly appreciate any idea. Thanks in advance.
Antoine
awk '/^[^0-9]/&&d{d=0} /R.Time/{d=1}d' file
Brief explanation,
Set d as a flag to determine print line or not
/^[^0-9]/&&d{d=0}: if regex ^[^0-9] matched && d==1, disabled d
/R.Time/{d=1}: if string "R.Time" searched, enabled d
awk '/R.Time/,/LC/' file|grep -v -E "R.Time|LC"
grep part will remove the R.Time and LC lines that come as a part of the output from awk
I think it's a job for sed.
sed '/R.Time/!d;:A;N;/\n$/!bA' infile

Matlab - read unstructured file

I'm quite new with Matlab and I've been searching, unsucessfully, for the following issue: I have an unstructure txt file, with several rows I don't need, but there are a number of rows inside that file that have an structured format. I've been researching how to "load" the file to edit it, but cannot find anything.
Since i don't know if I was clear, let me show you the content in the file:
8782 PROJCS["UTM-39",GEOGC.......
1 676135.67755473056 2673731.9365976951 -15 0
2 663999.99999999302 2717629.9999999981 -14.00231124135486 3
3 709999.99999999162 2707679.2185399458 -10 2
4 679972.20003752434 2674637.5679516452 0.070000000000000007 1
5 676124.87132483651 2674327.3183533219 -18.94794942571912 0
6 682614.20527054626 2671000.0000000549 -1.6383425512446661 0
...........
8780 682247.4593014461 2676571.1515358146 0.1541080392180566 0
8781 695426.98657108378 2698111.6168302582 -8.5039945992245904 0
8782 674723.80100125563 2675133.5486935056 -19.920312922947179 0
16997 3 21
1 2147 658 590
2 1855 2529 5623
.........
I'd appreciate if someone can just tell me if there is the possibility to open the file to later load only the rows starting with 1 to the one starting with 8782. First row and all the others are not important.
I know than manually copy and paste to a new file would be a solution, but I'd like to know about the possibility to read the file and edit it for other ideas I have.
Thanks!
% Now lines{i} is the string of the i'th line.
lines = strsplit(fileread('filename'), '\n')
% Now elements{i}{j} is the j'th field of the i'th line.
elements = arrayfun(#(x){strsplit(x{1}, ' ')}, lines)
% Remove the first row:
elements(1) = []
% Take the first several rows:
n_rows = 8782
elements = elements(1:n_rows)
Or if the number of rows you need to take is not fixed, you can replace the last two statements above by:
firsts = arrayfun(#(x)str2num(x{1}{1}), elements)
n_rows = find((firsts(2:end) - firsts(1:end-1)) ~= 1, 1, 'first')
elements = elements(1:n_rows)

Gnuplot reading not locale encoding file

I want to plot data of an ISO_8859_1 encoded file (two columns of numbers). Those are the first 10 data points of the file:
#Pe2
1 0.8000
2 0.8000
3 0.8000
4 0.8000
5 0.8000
6 0.8000
7 0.8000
8 0.8000
9 0.8000
10 0.8000
The original file has 15000 data points. I create this data with MATLAB, specifically setting ISO_8859_1 encoding, so I am sure that that's the encoding. This is a snippet of the matlab code:
slCharacterEncoding('ISO-8859-1'); %Instruction before writing anything to the file.
fprintf(fileID,' %7d %7.4f',Tempo(i),y(i)); %For loop in this instruction
fprintf(fileID,'\r'); %Closing the file
fclose(fileID);
This is the script that I run. This file is encoded with the default Windows txt files encoding:
set encoding iso_8859_1
set terminal wxt size 1000,551
# Line width of the axes
set border linewidth 1.5
# Line styles
set style line 1 lc rgb '#dd181f' lt 1 lw 1 pt 0 # red
# Axes label
set xlabel 'tiempo'
set ylabel 'valor'
plot 'Pe2.txt' with lines ls 1
This is the output of the gnuplot console when I run the script. After that I input "show encoding":
G N U P L O T
Version 4.6 patchlevel 5 last modified February 2014
Build System: MS-Windows 32 bit
Copyright (C) 1986-1993, 1998, 2004, 2007-2014
Thomas Williams, Colin Kelley and many others
gnuplot home: http://www.gnuplot.info
faq, bugs, etc: type "help FAQ"
immediate help: type "help" (plot window: hit 'h')
Terminal type set to 'wxt'
gnuplot> cd 'C:\Example'
gnuplot> load 'script.txt'
"script.txt", line 10: warning: Skipping data file with no valid points
gnuplot> plot 'Pe2.txt' with lines ls 1
^
"script.txt", line 10: x range is invalid
gnuplot> show encoding
nominal character encoding is iso_8859_1
however LC_CTYPE in current locale is Spanish_Spain.1252
gnuplot>
If I open the file, make some change undo the change and save the file, gnuplot plots the file. I guess that it's because it saves it with local encoding which is the one gnuplot uses to read files.
How do I plot files with gnuplot which are not with the local encoding format?
I also have what it seems to be a similar problem when I output a file with VS2010Css. If I don't specifically set the culture with:
Thread.CurrentThread.CurrentUICulture = CultureInfo.GetCultureInfo("en-US");
Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("en-US");
I am not able to save a file wich gnuplot is able to plot. I believe that this last problem is because of the "," and the "."
In Css I save the files with this:
StreamWriter Writer = new StreamWriter(dir + #"\" + + (k+1) + "_" + nombre + extension);
Writer.WriteLine("#" + (k+1) + "_" + nombre);
Writer.WriteLine();
Writer.WriteLine("{0,32} {1,32}", "#tiempo", "#valor");
for (int i = 0; i < tiempo.GetLength(0); i++)
{
Writer.WriteLine("{0,32} {1,32}", tiempo[i].ToString(), valor[i, k]);
}
Thank you.
Your file has only carriage returns (\r 0xd) as line breaks which doesn't work with gnuplot. You must use only line feed (\n 0xa), but \r\n does also work.

How to sum values in a column grouped by values in the other

I have a large file consisting data in 2 columns
100 5
100 10
100 10
101 2
101 4
102 10
102 2
I want to sum the values in 2nd column with matching values in column 1. For this example, the output I'm expecting is
100 25
101 6
102 12
I'm trying to work on this using bash script preferably. Can someone explain me how can I do this
Using awk:
awk '{a[$1]+=$2}END{for(i in a){print i, a[i]}}' inputfile
For your input, it'd produce:
100 25
101 6
102 12
In a perl oneliner
perl -lane "$s{$F[0]} += $F[1]; END { print qq{$_ $s{$_}} for keys %s}" file.txt
You can use an associative array. The first column is the index and the second becomes what you add to it.
#!/bin/bash
declare -A columns=()
while read -r -a line ; do
columns[${line[0]}]=$((${columns[${line[0]}]} + ${line[1]}))
done < "${1}"
for idx in ${!columns[#]} ; do
echo "${idx} ${columns[${idx}]}"
done
Using awk and maintain the order:
awk '!($1 in a){a[$1]=$2; b[++i]=$1;next} {a[$1]+=$2} END{for (k=1; k<=i; k++) print b[k], a[b[k]]}' file
100 25
101 6
102 12
Python is my choice:
d = {}
for line in f.readlines():
key,value = line.split()
if d[key] == None:
d[key] = 0
d[key] += value
print d
Why would you want a bash script?