get column list using sed/awk/perl - perl

I have different files like below format
Scenario 1 :
File1
no,name
1,aaa
20,bbb
File2
no,name,address
5,aaa,ghi
7,ccc,mn
I would like to get column list which is having more number of columns and if it is in the same order
**Expected output for scenario 1 :**
no,name,address
Scenario 2 :
File1
no,name
1,aaa
20,bbb
File2
no,age,name,address
5,2,aaa,ghi
7,3,ccc,mn
Expected Results :
Both file headers and positions are different as a message
I am interested in any short solution using bash / perl / sed / awk.

Perl solution:
perl -lne 'push #lines, $_;
close ARGV;
next if #lines < 2;
#lines = sort { length $a <=> length $b } #lines;
if (0 == index "$lines[1],", $lines[0]) {
print $lines[1];
} else {
print "Both file headers and positions are different";
}' -- File1 File2
-n reads the input line by line and runs the code for each line
-l removes newlines from input and adds them to printed lines
closing the special file handle ARGV makes Perl open the next file and read from it instead of processing the rest of the currently opened file.
next makes Perl go back to the beginning of the code, it can continue once more than one input line has been read.
sort sorts the lines by length so that we know the longer one is in the second element of the array.
index is used to check whether the shorter header is a prefix of the longer one (including the comma after the first header, so e.g. no,names is correctly rejected)

Related

sed/awk/perl remove the first two lines of a 3 line pattern

I have a huge text file. I need to replace all occurrences of this three line
pattern:
|pattern|some data|
|giberish|,,
|pattern|some other data|
by the last line of the pattern:
|pattern|some other data|
remove the first two lines of the pattern, keep only the last one.
The second line of the pattern ends with two commas and does not start with |pattern|
The first line of the pattern line starts with |pattern| and does not end with two commas.
The third line of the pattern line starts with |pattern| and does not end with two commas.
I tried this:
sed 'N;N;/^|pattern|.*\n.*,,\n|pattern|.*/I,+1 d' trial.txt
with no much luck
Edit: Here is a more substantial example
#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
EOL
and it should become:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
#zdim:
the first three lines of the file:
|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|
satisfy the pattern. So they are replaced by
|pattern|sdk;sd|
so the top of the file now becomes:
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
...
the first three lines of which are:
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
which satisfy the pattern, so they are replaced by:
|pattern|aslkaa|
so the top of the file now is:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
....
#JosephQuinsey:
consider this file:
#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|blabla|
|||4|||-0.97|0|1429037262.8271||20160229||1025||1000.0|0.01|,,
|pattern|blable|
|||5|||-1.27|0|1429037262.854||20160229||1025||1000.0|0.01|,,
|pattern|blasbla|
|||493|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,,
|||11|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,|T|347||1429043438.1962|-0.22|5|0||-0.22|1429043438.1962|,|Q|346||1429043437.713|-0.24|26|-0.22|5|||1429043437.713|
|pattern|jksds|
|||232|||-5.66|0|1429037262.817||20150415||1025||1000.0|0.01|,,
|pattern|bdjkds|
|||123q|||-7.15|0|1429037262.8271||20150415||1025||1000.0|0.01|,,
|pattern|blabla|
|||239ps|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,,
|||-92opa|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|1||1428969600.5019|-0.99|1|11||||,
|||kj2w|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|2||1428969600.5019|-1|1|11||||,
|||0293|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|3||1428969600.5019|-1.01|1|11||||,
|||2;;w32|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|4||1428969600.5019|-1.11|1|11||||,
EOL
Here is a simple take on it, using a buffer to collect and manage the pattern-lines
use warnings;
use strict;
use feature 'say';
my $file = shift or die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my #buf;
while (<$fh>) {
chomp;
if (/^\|pattern\|/ and not /,,$/) {
#buf = $_; # start the buffer (first line) or overwrite (third)
}
elsif (/,,$/ and not /^\|pattern\|/) {
if (#buf) { push #buf, $_ } # add to buffer with first line in it
else { say } # not part of 3-line-pattern; print
}
else {
say for #buf; # time to print out buffer
#buf = (); # ... empty it ...
say # and print the current line
}
}
This prints the expected output.
Explanation.
Pattern-lines go in a buffer, and when we get the "third line" the first two need be removed. Then "assign" to the array whenever we see ^|pattern| -- either to start the buffer if it's the first line or to re-initialize the array (removing what's in it) if it's the third line
A line ending with ,, is added to the buffer, if there is a line there already. Nothing prohibits lines ending with ,, just so -- they may be outside of a pattern; in that case just print it
So each |pattern| line sets the buffer straight -- either starts it or resets it. Thus once we run into a line with neither ^|pattern| nor ,,$ we can print out our buffer, and that line
Please test more comprehensively, what i still didn't get to do.
In order to run this either in a pipeline or on a file use the "magical" <> filehandle. So it becomes
use warnings;
use strict;
use feature 'say';
my #buf;
while (<>) { # reads lines from files given on command line, or from STDIN
...
}
Now you can run it either as data | script.pl or as script.pl datafile. (Make the script executable for this, or use as perl script.pl.)
The script's output goes to STDOUT which can be piped into other programs or redirected to a file.
It may depend on how your file is huge but if it is smaller than the allowed memory size, how about:
perl -0777 -pe '
1 while s/^\|pattern\|.+?\|\n(?<!\|pattern\|).+?,,\n(\|pattern\|.+?\|)$/\1/m;
' trial.txt
Output:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
An awk solution:
awk -v pa=pattern '
$0 ~ pa {
do {
hold=$0;
getline;
hold=hold "\n" $0;
getline;
} while(match($0, pa));
print hold
}
1' trial.txt
The idea is to buffer the line that matched the pattern, then the line after. If the next line also matches the pattern, loop, this time buffer the most recent matching line and and the one following it. This has the effect of removing the lines that need to be replaced.
When the loop stops, the first line the buffer contains is either the line to replace the removed lines or simply a first pattern match that is not to be removed. Either way the contents of the buffer get printed.
The final 1 statement is needed to print the line that ended the while loop and all other lines that aren't the first or second after one matching the pattern.
Updated answer: The following sed solution should work:
sed '/\n/!N;/\n.*\n/!N;/^|pattern|.*\n.*,,\n|pattern|/!{P;D;};s/[^\n]*\n//;D;'
Explanation:
/\n/!N if the P-space has only one line, read the next
/\n.*\n/!N if the P-space has only two lines, read in a third
/^|pattern|.*\n.*,,\n|pattern|/ test if the first and third lines start with |pattern|, and the middle line ends with two commas
!{P;D;} if the match fails, then print the first line and start over
s/[^\n]*\n//;D; otherwise, when the match succeeds, delete the first two lines, and start over.
This might work for you (GNU sed):
sed ':a;N;s/[^\n]*/&/3;Ta;/^|pattern|.*\n.*,,\n|pattern|/{/,,\n.*\n\|,,$/!{s/.*\n//;ba}};P;D' file
Populate the pattern space with the next three lines of the file. If the first pattern matches the current three lines and neither the first or the third line ends with ,,, then delete the first two lines and repeat. Otherwise print and delete the first line of the three line window and repeat.

Generating 2 files based on two columns in a third file

I am trying to prepare two input files based on information in a third file. File 1 is for sample1 and File 2 is for sample2. Both these files have lines with tab delimited columns. The first column contains unique identifier and the second column contains information.
File 1
>ENT01 xxxxxxxxxxxxxx
>ENT02 xyxyxyxyxyxy
>ENT03 ththththththt
..so on. Similarly, File 2 contains
>ENG012 ggggggggggggg
>ENG098 ksksksksksks
>ENG234 wewewewewew
I have a File 3 that contains two columns each corresponding to the identifier from File 1 and File 2
>ENT01 >ENG78
>ENT02 >ENG098
>ENT02 >ENG012
>ENT02 >ENG234
>ENT03 >ENG012
and so on. I want to prepare input files for File 1 and File 2 by following the order in file 3. If an entry is repeated in file 3 (ex ENT02) I want to repeat the information for that entry. The expected output is
For File 1:
>ENT01 xxxxxxxxxx
>ENT02 xyxyxyxyxyxy
>ENT02 xyxyxyxyxyx
>ENT02 xyxyxyxyxyx
>ENT03 ththththththth
And for file 2
>ENG78 some info
>ENG098 ksksksksks
>ENG012 gggggggg
>ENG234 wewewewewew
>ENG012 gggggggg
All the the entries in file 1 and file 2 are unique but not in file 3. Also, there are some entries in file3 in either column that is not present in either file 1 or file 2. The current logic I am using is that finding an intersection of identifiers from column 1 in both files1&2 with respective columns in file 3, storing this as a list and using this list to compare with File1 and File 2 separately to generate output for File 1 & 2. I am working with the following lines
awk 'FNR==NR{a[$1]=$0;next};{print a[$1]}' file1 intersectlist
grep -v -x -f idsnotfoundinfile1 file3
I am not able to get the right output as I think at some point it is getting sorted and only uniq values are printed out. Can someone please help me clear work this out.
You need to read and remember the first 2 files into some data structure, and then for the third file, output to 2 new files:
$ awk -F'\t' -v OFS='\t' '
FNR == 1 {file_num++}
file_num == 1 || file_num == 2 {data[file_num,$1] = $2; next}
function value(str) {
return str ? str : "some info"
}
{
for (i=1; i<=2; i++) {
print $i, value(data[i,$i]) > ARGV[i] ".new"
}
}
' file1 file2 file3
$ cat file1.new
>ENT01 xxxxxxxxxxxxxx
>ENT02 xyxyxyxyxyxy
>ENT02 xyxyxyxyxyxy
>ENT02 xyxyxyxyxyxy
>ENT03 ththththththt
$ cat file2.new
>ENG78 some info
>ENG098 ksksksksksks
>ENG012 ggggggggggggg
>ENG234 wewewewewew
>ENG012 ggggggggggggg
The files 1 and 2 first need be read so that you can find their lines with identifiers from file 3. Since the identifiers in these files are unique you can build a hash for each file, with identifiers as keys.
Then process file 3 line by line, where for each identifier on the line retrieve its value from the hash for the appropriate file and write the corresponding lines to new files 1 and 2.
use warnings;
use strict;
use feature 'say';
use Path::Tiny;
my ($file1, $file2, $file3) = qw(File1.txt File2.txt File3.txt);
my ($fileout1, $fileout2) = map { $_ . 'new' } ($file1, $file2);
my %file1 = map { split } path($file1)->lines;
my %file2 = map { split } path($file2)->lines;
my ($ofh1, $ofh2) = map { path($_)->openw } ($fileout1, $fileout2);
open my $fh, '<', $file3 or die "Can't open $file3: $!";
while (<$fh>) {
my ($f1, $f2) = split;
say $ofh1 "$f1\t", $file1{$f1} // 'some info'; #/ see text
say $ofh2 "$f2\t", $file2{$f2} // 'some info';
}
close $_ for $ofh1, $ofh2, $fh;
This produces the correct output based on fragments of input files that are provided.
I use Path::Tiny here for its conciseness. Its lines method returns all lines, and in map's block each is split by default space. The list of such pairs returned by map is assigned to a hash, whereby each pair of successive strings forms a key-value pair.
Multiple files can be opened in one statement, and Path::Tiny again makes it clean with openw. Its methods throw the exception (die) on errors, so we get error checking as well.
If an identifier in File 3 is not found in File 1/2 I bluntly use 'some info' as stated in the question,† but I expect that there is a more rounded solution for such a case. Then the laconic // should be changed to accommodate extra processing (or call a sub in place of 'some info' string).
It is assumed that files 1 and 2 always have two entries on a line.
Some shortcuts are taken, like reading each file into a hash in one line. Please expand the code as appropriate, with whatever checks may be needed.
† In such a case $file1{$f1} is undef so // (defined-or) operator returns its right-hand-side argument. A "proper" way is to test if (exist $file1{$f1}) but // works as well.

Using Perl to find and fix errors in CSV files

I am dealing with very large amounts of data. Every now and then there is a slip up. I want to identify each row with an error, under a condition of my choice. With that I want the row number along with the line number of each erroneous row. I will be running this script on a handful of files and I will want to output the report to one.
So here is my example data:
File_source,ID,Name,Number,Date,Last_name
1.csv,1,Jim,9876,2014-08-14,Johnson
1.csv,2,Jim,9876,2014-08-14,smith
1.csv,3,Jim,9876,2014-08-14,williams
1.csv,4,Jim,9876,not_a_date,jones
1.csv,5,Jim,9876,2014-08-14,dean
1.csv,6,Jim,9876,2014-08-14,Ruzyck
Desired output:
Row#5,4.csv,4,Jim,9876,not_a_date,jones (this is an erroneous row)
The condition I have chosen is print to output if anything in the date field is not a date.
As you can see, my desired output contains the line number where the error occurred, along with the data itself.
After I have my output that shows the lines within each file that are in error, I want to grab that line from the untouched original CSV file to redo (both modified and original files contain the same amount of rows). After I have a file of these redone rows, I can omit and clean up where needed to prevent interruption of an import.
Folder structure will contain:
Modified: 4.txt
Original: 4.csv
I have something started here, written in Perl, which by the logic will at least return the rows I need. However I believe my syntax is a little off and I do not know how to plug in the other subroutines.
Code:
$count = 1;
while (<>) {
unless ($F[4] =~ /\d+[-]\d+[-]\d+/)
print "Row#" . $count++ . "," . "$_";
}
The code above is supposed to give me my erroneous rows, but to be able to extract them from the originals is beyond me. The above code also contains some syntax errors.
This will do as you ask.
Please be certain that none of the fields in the data can ever contain a comma , otherwise you will need to use Text::CSV to process it instead of just a simple split.
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'example.csv';
<$fh>; # Skip header
while (<$fh>) {
my #fields = split /,/;
if( $fields[4] !~ /^\d{4}-\d{2}-\d{2}$/ ) {
print "Row#$.,$_";
}
}
output
Row#5,4.csv,4,Jim,9876,not_a_date,jones
Update
If you want to process a number of files then you need this instead.
The close ARGV at the end of the loop is there so that the line counter $. is reset to
1 at the start of each file. Without it it just continues from 1 upwards across all the files.
You would run this like
rob#Samurai-U:~$ perl findbad.pl *.csv
or you could list the files individually, separated by spaces.
For the test I have created files 1.csv and 2.csv which are identical to your example data except that the first field of each line is the name of the file containing the data.
You may not want the line in the output that announces each file name, in which case you should replace the entire first if block with just next if $. == 1.
use strict;
use warnings;
#ARGV = map { glob qq{"$_"} } #ARGV; # For Windows
while (<>) {
if ($. == 1) {
print "\n\nFile: $ARGV\n\n";
next;
}
my #fields = split /,/;
unless ( $fields[4] =~ /^\d{4}-\d{2}-\d{2}$/ ) {
printf "Row#%d,%s", $., $_;
}
close ARGV if eof ARGV;
}
output
File: 1.csv
Row#5,1.csv,4,Jim,9876,not_a_date,jones
File: 2.csv
Row#5,2.csv,4,Jim,9876,not_a_date,jones

Multiple text parsing and writing using the while statement, the diamond operator <> and $ARGV variable in Perl

I have some text files, inside a directory and i want to parse their content and write it to a file. So far the code i am using is this:
#!/usr/bin/perl
#The while loop repeats the execution of a block as long as a certain condition is evaluated true
use strict; # Always!
use warnings; # Always!
my $header = 1; # Flag to tell us to print the header
while (<*.txt>) { # read a line from a file
if ($header) {
# This is the first line, print the name of the file
**print "========= $ARGV ========\n";**
# reset the flag to a false value
$header = undef;
}
# Print out what we just read in
print;
}
continue { # This happens before the next iteration of the loop
# Check if we finished the previous file
$header = 1 if eof;
}
When i run this script i am only getting the headers of the files, plus a compiled.txt entry.
I also receive the following message in cmd : use of uninitialized $ARGV in concatenation <.> or string at concat.pl line 12
So i guess i am doing something wrong and $ARGV isn't used at all. Plus instead of $header i should use something else in order to retrieve the text.
Need some assistance!
<*.txt> does not read a line from a file, even if you say so in a comment. It runs
glob '*.txt'
i.e. the while loop iterates over the file names, not over their contents. Use empty <> to iterate over all the files.
BTW, instead of $header = undef, you can use undef $header.
As I understand you want to print a header with the filename just before the first line, and concatenate them all to a new one. Then a one-liner could be enough for the task.
It checks first line with variable $. and closes the filehandle to reset its value between different input files:
perl -pe 'printf qq|=== %s ===\n|, $ARGV if $. == 1; close ARGV if eof' *.txt
An example in my machine yields:
=== file1.txt ===
one
=== file2.txt ===
one
two

How to merge files with line-skipping

Have two files:
file f1 has the next structure (after the # are comments which are not in the file)
SomeText1 #Section name - one word [a-zA-Z]
acd:some text #code:text - the code contains only [a-z]
opo:some another text #variable number of code:text pairs
wed:text too #in the SomeText1 section are 3 pairs
SomeText2
xxx:textttt #here only 1 code:text pair
SomeText3
zzz:texxxxxxx #here only 1 code:text pair too
and file f2 what contains in the same order as the above file the next lines:
1000:acd:opo:wed:123.44:4545.23:1233.23 #3 codes - like in the above segment 1
304:xxx:10:11:12.12 #1 code - these lines contains only
4654:zzz:0 #codes and numbers
the desired output is
SomeText1:1000:acd:opo:wed:123.44:4545.23:1233.23
acd:some text:
opo:some another text:
wed:text too:
SomeText2:304:xxx:10:11:12
xxx:textttt:
SomeText3:4654:zzz:0
zzz:texxxxxxx:
So need to add the lines from the f2 to "section name" line. The codes in every line in the f2 file are the same as the codes in the code:text pairs in the f1
Haven't no idea how to start, because
can't use the paste command because i don't have the same line-count in the both files, and
can't use join, because here aren't common keys in both files.
So, would be really happy, when someone tell me SOME ALGORITHM, how to start - and I will program it myself.
I'm offering you different approach - I provide a code, and you should figure out how it works ;) :)
paste -d':' f1 <(perl -pe '$\="\n"x($c=()=/[a-z]+/g)' <f2)
produces exactly what you want from your inputs.
EDIT - Explanation:
The soultion comes from your comment the lines contains only codes and numbers. Therefore it is possible easily get the codes from the line.
therefore enough enter as many empty lines after each line - how many codes you have
the /[a-z]+/g matched every code and return them
the $c =()= is the "Rolex operator" - what allows count the list of matches
the count of matched codes gives the number how much empty lines are needed
the $\ = "\n" x NUMBER - mean repeat NUMBER times the string before `x, e.g. when have 3 codes, will repeat 3 times the "\n" (newline) character.
the newlines are added to the variabe $\ - output record sep.
and because the -p switch process the file by lines and print every line in the form "print $_$\;" - so after every line will print the output record separator - what contains a number of newlines.
therefore we get empty lines
I hope than my english was enough ok for the explanation.
Or wholly in Perl:
my $skip;
while (<$f1>) {
chomp;
my $suffix;
if ($skip--) {
$suffix = "\n";
} else {
$suffix = <$f2>;
$skip = () = $suffix =~ /[a-z]+/g;
}
print "$_:$suffix";
}