How to split a file (with sed) into numerous files according to a value found on each line? - sed

I have several Company_***.csv files (altough the separator's a tab not a comma; hence should be *.tsv, but never mind) which contains a header plus numerous data lines e.g
1stHeader 2ndHeader DateHeader OtherHeaders...
111111111 SOME STRING 2020-08-01 OTHER STRINGS..
222222222 ANOT STRING 2020-08-02 OTHER STRINGS..
I have to split them according to the 3rd column here, it's a date.
Each file should be named like e.g. Company_2020_08_01.csv Company_2020_08_02.csv & so one
and containing: same header on the 1st line + matching rows as the following lines.
At first I thought about saving (once) the header in a single file e.g.
sed -n '1w Company_header.csv' Company_*.csv
then parsing the files with a pattern for the date (hence the headers would be skipped) e.g.
sed -n '/\t2020-[01][0-9]-[0-3][0-9]\t/w somefilename.csv' Company_*.csv
... and at last, insert the (missing) header in each generated file.
But I'm stuck at step 2: I can't find how I could generate (dynamically) the "filename" expected by the w command, neither how to capture the date in the search pattern (because apparently this is just an address, not a search-replace "field" as in the s/regexp/replacement/[flags] command, so you can't have capturing groups ( ) in there).
So I wonder if this is actually doable with sed? Or should I look upon other tools e.g. awk?
Disclaimer: I'm quite a n00b with these commands so I'm just learning/starting from scratch...

Perl to the rescue!
perl -e 'while (<>) {
$h = $_, next if $. == 1;
$. = 0 if eof;
#c = split /\t/;
open my $out, ">>", "Company_" . $c[2] =~ tr/-/_/r . ".csv" or die $!;
print {$out} $h unless tell $out;
print {$out} $_;
}' -- Company_*.csv
The diamond operator <> in scalar context reads a line from the input.
The first line of each file is stored in the variable $h, see $. and eof
split populates the #c array by the column values for each line
$c[2] contains the date, using tr we translate dashes to underscores to create a filename from it. open opens the file for appending.
print prints the header if the file is empty (see tell)
and prints the current line, too.
Note that it only appends to the files, so don't forget to delete any output files before running the script again.

Related

Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}
It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}
You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.
If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

How to print without duplicates with perl?

My assignment is a little more in depth than the title but in the title is my main question. Here is the assignment:
Write a perl script that will grep for all occurrences of the regular expression in all regular files in the file/directory list as well as all regular files under the directories in the file/directory list. If a file is not a TEXT file then the file should first be operated on by the unix command strings (no switches) and the resulting lines searched. If the -l switch is given only the file name of the files containing the regular expression should be printed, one per line. A file name should occur a maximum of one time in this case. If the -l switch is not given then all matching lines should be printed, each proceeded on the same line by the file name and a colon. An example invocation from the command line:
plgrep 'ba+d' file1 dir1 dir2 file2 file3 dir3
Here is my code:
#!/usr/bin/perl -w
use Getopt::Long;
my $fname = 0;
GetOptions ('l' => \$fname);
$pat = shift #ARGV;
while (<>) {
if (/$pat/) {
$fname ? print "$ARGV\n" : print "$ARGV:$_";
}
}
So far that code does everything it's supposed to except for reading non-text files and printing out duplicates of file names when using the -l switch. Here is an example of my output after entering the following on the command line: plgrep 'ba+d' file1 file2
file1:My dog is bad.
file1:My dog is very baaaaaad.
file2:I am bad at the guitar.
file2:Even though I am bad at the guitar, it is still fun to play!
Which is PERFECT!
But when I use the -l switch to print out only the file names this is what I get after entering the following on the command line: plgrep -l 'ba+d' file1 file2
file1
file1
file2
file2
How do I get rid of those duplicates so it only prints:
file1
file2
I have tried:
$pat = shift #ARGV;
while (<>) {
if (/$pat/) {
$seen{$ARGV}++;
$fname ? print "$ARGV\n" unless ($seen{$ARGV} > 1); : print "$ARGV:$_";
}
}
But when I try to run it without the -l switch I only get:
file1:My dog is bad.
file2:I am bad at the guitar.
I also tried:
$fname ? print "$ARGV\n" unless ($ARGV > 1) : print "$ARGV:$_";
But I keep getting syntax error at plgrep line 17, near ""$ARGV\n" unless"
If someone could help me out with my duplicates issue as well as the italicized part of the assignment I would truly appreciate it. I don't even know where to start on that italicized part.
If you're printing only file names, you can exit the loop (using the last command) after the first match, since you already know the file matches. By not scanning the rest of the file, this will also prevent the name from being printed repeatedly.
Edited to add: In order to do it this way, you'll also need to switch from using <> to read the files to instead getting the names from #ARGV and opening them normally.
If you want to continue using <>, you'll instead need to watch $ARGV to see when it changes (indicating that you've started reading a new file) and keep a flag to indicate whether the current file has found any matches yet or not. However, this approach would require you to read every file in its entirety, which will be less efficient than only reading enough of each file to know whether it contains at least one match or not (i.e., skipping to the next file after the first match), so I would recommend switching to open instead.
The first syntax problem is simply an extra semicolon.
The second is that you may only use if/unless as a statement modifier at the end of a statement - you can't embed it in the middle of a conditional that way.
$fname ? print "$ARGV\n" unless ($seen{$ARGV} > 1); : print "$ARGV:$_";
Becomes:
next if $seen{$ARGV} > 1;
print $fname ? "$ARGV\n" : "$ARGV:$_";

removing double quotes from a file

I have a tab delimited file, that looks like this.
"""chr1" "38045559" "38046059" "C1orf122"""
"""" "" "" "C1orf122"""
"""" "" "" "YRDC"""
"""chr1" "205291045" "205291545" "YOD1"""
"""chr1" "1499717" "1500625" "SSU72"""
I got this file after converting a .csv to tab separated file from this command
perl -lpe 's/"/""/g; s/^|$/"/g; s/","/\t/g' <test.csv>test_tab
Now, I want my file to remain tab separated but all the extra quotes should be removed from the file. But at the same time when I print column 4 I should get all the names and for column 1,2, and 3 the co ordinates (this I still get it but with quotes).
What manipulation shall I do in above command to do so, kindly guide.
The output desired is (since I was asked to be clear)
chr1 38045559 38046059 C1orf122
C1orf122
YRDC
chr1 205291045 205291545 YOD1
chr1 1499717 1500625 SSU72
so that when I extract Column 4 I should get
C1orf122
C1orf122
YRDC
YOD1
SSU72
Thank you
It appears that most of those quotes are being inserted by your command to bring in the file. Instead open the file normally:
use strict;
use warnings;
open CSV, 'test.csv' or die "can't open input file.";
open TAB, '>test.tab' or die "can't open output file.";
my #row_array;
while (<CSV>)
{
#Remove any quotes that exist on the line (it is in default variable $_).
s/"//g;
#Split the current row into an array.
my #fields = split /,/;
#write the output, tab-delimited file.
print TAB join ("\t", #fields) . "\n";
#Put the row into a multidimensional array.
push #row_array, \#fields;
}
print "Column 4:\n";
print $_->[3] . "\n" foreach (#row_array);
print "\nColumns 1-3:\n";
print "#{$_}[0..2]\n" foreach (#row_array);
Any quotes that still do exist will be removed by s/"//g; in the above code. This will remove all quotes; it doesn't check whether they are at the beginning and end of a field. If you might have some quotes within the data that you need to preserve, you would need a more sophisticated matching pattern.
Update: I added code to create a tab-separated output file, since you seem to want that. I don't understand exactly what your requirement related to getting "all the names...and the coordinates" is. However, you should be able to use the above code for that. Just add what you need where it says "do stuff". You can reference, for example, column 1 with $fields[0].
Update 2: Added code to extract column 4, then columns 1-3. The syntax for using multidimensional arrays is tricky. See perldsc and perlref for more information.
Update 3: Added code to remove the quotes that still exist in your file.

How to insert a line into the middle of an existing file

consider an example where i want to insert few lines of text when
particular patter matches(if $line=~m/few lines in here/ then
insert lines in next line):
*current file:*
"This is my file and i wanna insert few lines in here and other
text of the file will continue."
*After insertion:*
"This is my file and i wanna insert few lines in here this is my
new text which i wanted to insert and other text of the file will
continue."
This is my code:
my $sourcename = $ARGV[1];
my $destname = $ARGV[0];
print $sourcename,"\n";
print $destname,"\n";
my $source_excel = new Spreadsheet::ParseExcel;
my $source_book = $source_excel->Parse($sourcename) or die "Could not open source Excel file $sourcename: $!";
my $source_cell;
#Sheet 1 - source sheet page having testnumber and worksheet number
my $source_sheet = $source_book->{Worksheet}[0]; #It is used to access worksheet
$source_cell = $source_sheet->{Cells}[1][0]; #Reads content of the cell;
my $seleniumHost = $source_cell->Value;
print $seleniumHost,"\n";
open (F, '+>>',"$destname") or die "Couldn't open `$destname': $!";
my $line;
while ($line = <F>){
print $line;
if($line=~m/FTP/){
#next if /FTP/;
print $line;
print F $seleniumHost;}
The perlfaq covers this. How do I change, delete, or insert a line in a file, or append to the beginning of a file?
Files are fixed blocks of data. They behave much like a piece of paper. How do you insert a line into the middle of a piece of paper? You can't, not unless you left space. You must recopy the whole thing, inserting your line into the new copy.
In a perl one-liner :
perl -ane 's/few lines in here and other\n/this is my\nnew text which i wanted to insert and other /; s/continue./\ncontinue./; print ' FILE
If you don't want a one-liner, it's easy to takes the substitutions in any script ;)
As long as you know the line:
perl -ne 'if ($. == 8) {s//THIS IS NEW!!!\n/}; print;'
Obviously you'd have to use -i to make the actual changes
OR:
perl -i -pe 'if($. == 8) {s//THIS IS NEW!!!\n/}' file
Someone mentioned Tie::File, which is a solution I'll have to look at for editing a file, but I generally use File::Slurp, which has relatively recently added edit_file and edit_file_lines subs.
Using perl's in-place edit flag (-i), it's easy to add lines to an existing file using Perl, as long as you can key off a text string, such as (in your case) "wanna insert few lines in here":
perl -pi -e 's{wanna insert few lines in here}{wanna insert few lines in here this is my\nnew text which i wanted to insert }' filename
It overwrites your old sentence (don't be scared) with a copy of your old sentence (nothing lost) plus the new stuff you want injected. You can even create a backup of the original file if you wish by passing a ".backup" extension to the -i flag:
perl -p -i'.backup' -e 's{wanna insert few lines in here}{wanna insert few lines in here this is my\nnew text which i wanted to insert }' filename
More info on Perl's search & replace capabilities can be found here:
http://www.atrixnet.com/in-line-search-and-replace-in-files-with-real-perl-regular-expressions/
You can avoid having to repeat the "markup" text using variable substitution.
echo -e "first line\nthird line" | perl -pe 's/(^first line$)/\1\nsecond line/'

Extracting unique values from multiple files in Perl

I have several data files that are tab delimited. I need to extract all the unique values in a certain column of these data files (say column 25) and write these values into an output file for further processing. How might I do this in Perl? Remember I need to consider multiple files in the same folder.
edit: The code I've done thus far is like this.
#!/usr/bin/perl
use warnings;
use strict;
my #hhfilelist = glob "*.hh3";
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my #line = split /\t/;
print "field is $line[24]\n";
}
close (F);
}
The question is how do I efficiently create the hash/array of unique values as I read each line of each file. Or is it faster if I populate the whole array and then remove duplicates?
Some tips on how to handle the problem:
Find files
For finding files within a directory, use glob: glob '.* *'
For finding files within a directory tree, use File::Find's find function
Open each file, use Text::CSV with \t character as the delimiter, extract wanted values and write to file
For Perl solution, please use Text::CSV module to parse flat (X-separated) files - the constructor accepts a parameter specifying a separator character. Do this for every file in a loop, with file list generated by either glob() for files in a given directory or File::Find for subdirectories as well
Then, to get the unique values, for each row, store the column #25 in a hash.
E.g. after retrieving the values:
$colref = $csv->getline($io);
$unique_values_hash{ $colref->[24] } = 1;
Then, iterate over hash keys and print to a file.
For non-Perl shell solution, you can simply do:
cat MyFile_pattern | awk -F'\t' 'print $25' |sort -u > MyUniqueValuesFile
You can replace awk with cut
Please note that non-Perl solution only works if the files don't contain TABs in the fields themselves and the columns aren't quoted.
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' inputs > output
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' *.hh3 > output
Command-line switches -F/\\t/ -an mean iterate through every line in every input file and split the line on the tab character into the array #F.
$F[24] refers to the value in the 25-th field of each line (between the 24-th and 25-th tab characters)
$seen{...} is a hashtable to keep track of which values have already been observed.
The first time a value is observed, $seen{VALUE} is 0 so Perl will execute the statement print"$F[24]\n". Every other time the value is observed, $seen{VALUE} will be non-zero and the statement won't be executed. This way each unique value gets printed out exactly once.
In a similar context to your larger script:
my #hhfilelist = glob "*.hh3";
my %values_in_field_25 = ();
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
my #F = split /\t/;
$values_in_field_25{$F[24]} = 1;
}
close (F);
}
my #unique_values_in_field_25 = keys %values_in_field_25; # or sort keys ...