How to isolate a word that corresponds with a letter from a different column of a CSV file? - perl

I have a CSV file, like this:
ACDB,this is a sentence
BECD,this is another sentence
BCAB,this is yet another
Each character in the first column corresponds to a word in the second column, e.g., in the first column, A corresponds with "this", C with "is", D with "a", and B, with sentence.
Given the variable character, which can be set to any of the characters appearing in the first column, I need to isolate the word which corresponds to the selected letter, e.g., if I set character="B", then the output of the above would be:
sentence
this
this another
If I set `character="C", then the output of the above would be:
is
another
is
How can I output only those words which correspond to the position of the selected letter?
The file contains many UTF-8 characters.
For every character in column 1, there is always an equal number of words in column 2.
The words in column 2 are separated by spaces.
Here is the code I have so far:
while read line
do
characters="$(echo $line | awk -F, '{print $1}')"
words="$(echo $line | awk -F, '{print $2}')"
character="B"
done < ./file.csv

This might work for you:
x=B # set wanted key variable
sed '
:a;s/^\([^,]\)\(.*,\)\([^ \n]*\) *\(.*\)/\2\4\n\1 \3/;ta # pair keys with values
s/,// # delete ,
s/\n[^'$x'] [^\n]*//g # delete unwanted keys/values
s/\n.//g # delete wanted keys
s/ // # delete first space
/^$/d # delete empty lines
' file
sentence
this
this another
or in awk:
awk -F, -vx=B '{i=split($1,a,"");split($2,b," ");c=s="";for(n=1;n<=i;n++)if(a[n]==x){c=c s b[n];s=" "} if(length(c))print c}' file
sentence
this
this another

This seems to do the trick. It reads data from within the source file using the DATA file handle, whereas you will have to obtain it from your own source. You may also have to cater for there being no word corresponding to a given letter (as for 'A' in the second data line here).
use strict;
use warnings;
my #data;
while (<DATA>) {
my ($keys, $words) = split /,/;
my #keys = split //, $keys;
my #words = split ' ', $words;
my %index;
push #{ $index{shift #keys} }, shift #words while #keys;
push #data, \%index;
}
for my $character (qw/ B C /) {
print "character = $character\n";
print join(' ', #{$_->{$character}}), "\n" for #data;
print "\n";
}
__DATA__
ACDB,this is a sentence
BECD,this is another sentence
BCAB,this is yet another
output
character = B
sentence
this
this another
character = C
is
another
is

Here's a mostly - done rump answer.
Since SO is not a "Do my work for me" site, you will need to fill in some trivial blanks.
sub get_index_of_char {
my ($character, $charset) = #_;
# Homework: read about index() function
#http://perldoc.perl.org/functions/index.html
}
sub split_line {
my ($line) = #_;
# Separate the line into a charset (before comma),
# and whitespace separated word list.
# You can use a regex for that
my ($charset, #words) = ($line =~ /^([^,]+),(?(\S+)\s+)+(\S+)$/g); # Not tested
return ($charset, \#words);
}
sub process_line {
my ($line, $character) = #_;
chomp($line);
my ($charset, $words) = split_line($line);
my $index = get_index_of_char($character, $charset);
print $words->[$index] . "\n"; # Could contain a off-by-one bug
}
# Here be the main loop calling process_line() for every line from input

Related

How to remove the word using array index in perl?

How to remove the certain words using array index for the following input using Perl?
file.txt
BOCK:top:blk1
BOCK:block2:blk2
BOCK:test:blk3
After join:
/BOCK/top/blk1
/BOCK/block2/blk2
/BOCK/test/blk3
Expected output:
/BOCK/blk1
/BOCK/blk2
/BOCK/blk3
Code which I had tried:
use warnings;
use strict;
my #words;
open(my $infile,'<','file.txt') or die $!;
while(<$infile>)
{
push(#words,split /\:/);
}
my $word=join("/",#words);
print $word;
close ($infile);
foreach my $word(#words)
{
if($word=~ /(\w+\/\w+\/\w+)/)
{
print $word;
}
}
The easiest way to get rid of the middle element is to use splice.
while ( my $line = <DATA> ) {
my #words;
push( #words, split( /:/, $line ) ); # colon has no special meaning
splice( #words, 1, 1 );
print '/', join( '/', #words );
}
__DATA__
BOCK:top:blk1
BOCK:block2:blk2
BOCK:test:blk3
I assumed that you want to do that for every line. The code that you had did something else. Because your #words is declared outside of the while loop it gets bigger withe every iteration, and every third element contains a newline \n character because you never chomp. Then you build create one long $word that has all the words from all lines joined with a slash /. Afterwards you try to match that for three words joined with slashes, which works. But you only have one capture group, so your $3 is never defined.
The code can be simplified and cleaned up, even to the point of
my #paths = map { '/' . join '/', (split ':')[0,-1] } <$infile>;
print "$_\n" for #paths;
The map imposes the list context on the filehandle read, which thus returns a list of all lines from the file. The code in map's block is applied to each element: it splits the line and takes the first and last element of that list, joins them, and then prepends the leading /. Inside the block the line is in the variable $_, what split uses as default. The resulting list is returned and assigned to #path.
A number of errors in the posted code have been explained clearly in simbabque's answer.
Thanks to jm666 in a comment for catching the requirement for the leading /.
The above can also be used for a one-liner
perl -F: -lane'print "/" . join "/", #F[0,-1]' < file.txt > out.txt
The -a turns on autosplit mode (with -n or -p), whereby each line is split and available in #F. The -F switch allows to specify the pattern to split on, here :, instead of the default space.
See switches in perlrun.

Perl: Find a match, remove the same lines, and to get the last field

Being a Perl newbie, please pardon me for asking this basic question.
I have a text file #server1 that shows a bunch of sentences (white space is the field separator) on many lines in the file.
I needed to match lines with my keyword, remove the same lines, and extract only the last field, so I have tried with:
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my #uniqmatchedline = split(/ /, #allmatchedlines);
my $lastfield = $uniqmatchedline[-1]\n";
print "$lastfield\n";
and it gives me the output showing:
1
I don't know why it's giving me just "1".
Could someone please explain why I'm getting "1" and how I can get the last field of the matched line correctly?
Thank you!
my #uniqmatchedline = split(/ /, #allmatchedlines);
You're getting "1" because split takes a scalar, not an array. An array in scalar context returns the number of elements.
You need to split on each individual line. Something like this:
my #uniqmatchedline = map { split(/ /, $_) } #allmatchedlines;
There are two issues with your code:
split is expecting a scalar value (string) to split on; if you are passing an array, it will convert the array to scalar (which is just the array length)
You did not have a way to remove same lines
To address these, the following code should work (not tested as no data):
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my %existing;
my #uniqmatchedline = grep !$existing{$_}++, #allmatchedlines; #this will return the unique lines
my #lastfields = map { ((split / /, $_)[-1]) . "\n" } #uniqmatchedline ; #this maps the last field in each line into an array
print for #lastfields;
Apart from two errors in the code, I find the statement "remove the same lines and extract only the last field" unclear. Once duplicate matching lines are removed, there may still be multiple distinct sentences with the pattern.
Until a clarification comes, here is code that picks the last field from the last such sentence.
use warnings 'all';
use strict;
use List::MoreUtils qw(uniq)
my $file = '/tmp/myfile.txt';
my $cmd = "ssh user1\#server1 cat $file";
open my $fh, '-|', $cmd // die "Error opening $cmd: $!"; # /
while (<$fh>) {
chomp;
push #allmatchedlines, $_ if /mysearch/;
}
close(output1);
my #unique_matched_lines = uniq #allmatchedlines;
my $lastfield = ( split ' ', $unique_matched_lines[-1] )[-1];
print $lastfield, "\n";
I changed to the three-argument open, with error checking. Recall that open for a process involves a fork and returns pid, so an "error" doesn't at all relate to what happened with the command itself. See open. (The # / merely turns off wrong syntax highlighting.) Also note that # under "..." indicates an array and thus need be escaped.
The (default) pattern ' ' used in split splits on any amount of whitespace. The regex / / turns off this behavior and splits on a single space. You most likely want to use ' '.
For more comments please see the original post below.
The statement #allmatchedlines = $_ if /mysearch/; on every iteration assigns to the array, overwriting whatever has been in it. So you end up with only the last line that matched mysearch. You want push #allmatchedlines, $_ ... to get all those lines.
Also, as shown in the answer by Justin Schell, split needs a scalar so it is taking the length of #allmatchedlines – which is 1 as explained above. You should have
my #words_in_matched_lines = map { split } #allmatchedlines;
When all this is straightened out, you'll have words in the array #uniqmatchedline and if that is the intention then its name is misleading.
To get unique elements of the array you can use the module List::MoreUtils
use List::MoreUtils qw(uniq);
my #unique_elems = uniq #whole_array;

Regular expression to print a string from a command outpout

I have written a function that uses regex and prints the required string from a command output.
The script works as expected. But it's does not support a dynamic output. currently, I use regex for "icmp" and "ok" and print the values. Now, type , destination and return code could change. There is a high chance that command doesn't return an output at all. How do I handle such scenarios ?
sub check_summary{
my ($self) = #_;
my $type = 0;
my $return_type = 0;
my $ipsla = $self->{'ssh_obj'}->exec('show ip sla');
foreach my $line( $ipsla) {
if ( $line =~ m/(icmp)/ ) {
$type = $1;
}
if ( $line =~ m/(OK)/ ) {
$return_type = $1;
}
}
INFO ($type,$return_type);
}
command Ouptut :
PSLAs Latest Operation Summary
Codes: * active, ^ inactive, ~ pending
ID Type Destination Stats Return Last
(ms) Code Run
-----------------------------------------------------------------------
*1 icmp 192.168.25.14 RTT=1 OK 1 second ago
Updated to some clarifications -- we need only the last line
As if often the case, you don't need a regex to parse the output as shown. You have space-separated fields and can just split the line and pick the elements you need.
We are told that the line of interest is the last line of the command output. Then we don't need the loop but can take the last element of the array with lines. It is still unclear how $ipsla contains the output -- as a multi-line string or perhaps as an arrayref. Since it is output of a command I'll treat it as a multi-line string, akin to what qx returns. Then, instead of the foreach loop
my #lines = split '\n', $ipsla; # if $ipsla is a multi-line string
# my #lines = #$ipsla; # if $ipsla is an arrayref
pop #lines while $line[-1] !~ /\S/; # remove possible empty lines at end
my ($type, $return_type) = (split ' ', $lines[-1])[1,4];
Here are some comments on the code. Let me know if more is needed.
We can see in the shown output that the fields up to what we need have no spaces. So we can split the last line on white space, by split ' ', $lines[-1], and take the 2nd and 5th element (indices 1 and 4), by ( ... )[1,4]. These are our two needed values and we assign them.
Just in case the output ends with empty lines we first remove them, by doing pop #lines as long as the last line has no non-space characters, while $lines[-1] !~ /\S/. That is the same as
while ( $lines[-1] !~ /\S/ ) { pop #lines }
Original version, edited for clarifications. It is also a valid way to do what is needed.
I assume that data starts after the line with only dashes. Set a flag once that line is reached, process the line(s) if the flag is set. Given the rest of your code, the loop
my $data_start;
foreach (#lines)
{
if (not $data_start) {
$data_start = 1 if /^\s* -+ \s*$/x; # only dashes and optional spaces
}
else {
my ($type, $return_type) = (split)[1,4];
print "type: $type, return code: $return_type\n";
}
}
This is a sketch until clarifications come. It also assumes that there are more lines than one.
I'm not sure of all possibilities of output from that command so my regular expression may need tweaking.
I assume the goal is to get the values of all columns in variables. I opted to store values in a hash using the column names as the hash keys. I printed the results for debugging / demonstration purposes.
use strict;
use warnings;
sub check_summary {
my ($self) = #_;
my %results = map { ($_,undef) } qw(Code ID Type Destination Stats Return_Code Last_Run); # Put results in hash, use column names for keys, set values to undef.
my $ipsla = $self->{ssh_obj}->exec('show ip sla');
foreach my $line (#$ipsla) {
chomp $line; # Remove newlines from last field
if($line =~ /^([*^~])([0-9]+)\s+([a-z]+)\s+([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)\s+([[:alnum:]=]+)\s+([A-Z]+)\s+([^\s].*)$/) {
$results{Code} = $1; # Code prefixing ID
$results{ID} = $2;
$results{Type} = $3;
$results{Destination} = $4;
$results{Stats} = $5;
$results{Return_Code} = $6;
$results{Last_Run} = $7;
}
}
# Testing
use Data::Dumper;
print Dumper(\%results);
}
# Demonstrate
check_summary();
# Commented for testing
#INFO ($type,$return_type);
Worked on the submitted test line.
EDIT:
Regular expressions allow you to specify patterns instead of the exact text you are attempting to match. This is powerful but complicated at times. You need to read the Perl Regular Expression documentation to really learn them.
Perl regular expressions also allow you to capture the matched text. This can be done multiple times in a single pattern which is how we were able to capture all the columns with one expression. The matches go into numbered variables...
$1
$2

Replace comma with space in just one field - from a .CSV file

I have happened upon a problem with a program that parses through a CSV file with a few million records: two fields in each line has comments that users have put in, and sometimes they use commas within their comments. If there are commas input, that field will be contained in double quotes. I need to replace any commas found in those fields with a space. Here is one such line from the file to give you an idea -
1925,47365,2,650187016,1,1,"MADE FOR DRAWDOWNS, NEVER P/U",16,IFC 8112NP,Standalone-6,,,44,10/22/2015,91607,,B24W02651,,"PA-3, PURE",4/28/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,C48A0D001EF449E3AB97F0B98C811B1B,POS.MISTINT.V0000.UP.Q,PROD_SMISA_BK,414D512050524F445F504F5331393235906F28561D2F0020,10/22/2015 9:29,10/22/2015 9:30
NOTE - I do not have the Text::CSV module available to me, nor will it be made available in the server I am using.
Here is part of my code in parsing this file. The first thing I do is concatenate the very first three fields and prepend that concatenated field to each line. Then I want to clear out the commas in #fields[7,19], then format the DATE in three fields and the DATETIME in two fields. The only line I can't figure out is clearing out those commas -
my #data;
# Read the lines one by one.
while ( $line = <$FH> ) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
# remove user input commas in fields[7,19]
$_ = for fields[7,19];
# format DATE and DATETIME fields for MySQL/sqlbatch60
$_ = join '-', (split /\//)[2,0,1] for #fields[14,20,23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
# write the parsed record back to the file
push #data, \#fields;
}
If it is ONLY the eighth field that is troubling AND you know exactly how many fields there should be, you can do it this way
Suppose the total number of fields is always N
Split the line on commas ,
Separate and store the first six fields
Separate and store the last n fields, where n is N-8
Rejoin what remains with commas ,. This now forms field 8
and then do what ever you like to do with it. For example, write it to a proper CSV file
Text::CSV_XS handles quoted commas just fine:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my $aoa = csv(in => 'file.csv'); # The file contains the sample line.
print $aoa->[0][6];
Note The two main versions below clean up one field. The most recent change in the question states that there are, in fact, two such fields. The third version, at the end, works with any number of bad fields.
All code has been tested with the supplied example and its variations.
Following clarifications, this deals with the case when the file need be processed by hand. A module is easily recommended for parsing .csv, but there is a problem here: reliance on the user to enter double quotes. If they end up not being there we have a malformed file.
I take it that the number of fields in the file is known with certainty and ahead of time.
The two independent solutions below use either array or string processing.
(1) The file is being processed line by line anyway, the line being split already. If there are more fields than expected, join the extra array elements by space and then overwrite the array with correct fields. This is similar to what is outlined in the answer by vanHoesel.
use strict;
use warnings;
my $num_fields = 39; # what should be, using the example
my $ibad = 6; # index of the malformed field-to-be
my #last = (-($num_fields-$ibad-1)..-1); # index-range, rest of fields
my $file = "file.csv";
open my $fh, '<', $file;
while (my $line = <$fh>) { # chomp it if needed
my #fields = split ',', $line;
if (#fields != $num_fields) {
# join extra elements by space
my $fixed = join ' ', #fields[$ibad..$ibad+#fields-$num_fields];
# overwrite array by good fields
#fields = (#fields[0..$ibad-1], $fixed, #fields[#last]);
}
# Process #fields normally
print "#fields";
}
close $fh;
(2) Preprocess the file, only checking for malformed lines and fixing them as needed. Uses string manipulations. (Or, the method above can be used.) The $num_fields and $ibad are the same.
while (my $line = <$fh>) {
# Number of fields: commas + 1 (tr|,|| counts number of ",")
my $have_fields = $line =~ tr|,|| + 1;
if ($have_fields != $num_fields) {
# Get indices of commas delimiting the bad field
my ($beg, $end) = map {
my $p = '[^,]*,' x $_;
$line =~ /^$p/ and $+[0]-1;
} ($ibad, $ibad+$have_fields-$num_fields);
# Replace extra commas and overwrite that part of the string
my $bad_field = substr($line, $beg+1, $end-$beg-1);
(my $fixed = $bad_field) =~ tr/,/ /;
substr($line, $beg+1, $end-$beg-1) = $fixed;
}
# Perhaps write the line out, for a corrected .csv file
print $line;
}
In the last line the bad part of $line is overwritten by assigning to substr, what this function allows. The new substring $fixed is constructed with commas changed (or removed, if desired), and used to overwrite the bad part of the $line. See docs.
If quotes are known to be there a regex can be used. This works with any number of bad fields.
while (my $line = <$fh>) {
$line =~ s/."([^"]+)"/join ' ', split(',', $1)/eg; # "
# process the line. note that double quotes are removed
}
If the quotes are to be kept move them inside parenthesis, to be captured as well.
This one line is all that need be done after while (...) { to clean up data.
The /e modifier makes the replacement side be evaluated as code, instead of being used as a double-quoted string. There the matched part of the line (between ") is split by comma and then joined by space, thus fixing the field. See the last item under "Search and replace" in perlretut.
All code has been tested with multiple lines and multiple commas in the bad field.

Unpack and x option in TEMPLATE

I have a row which looks like (no new lines, this is one line and I replaces the spaces with _ as otherwise they are trimmed):
46S990BZ6BRIG___1381TRANSOCEAN_LTD______________BCALL_FEB00025000__1000000000000000000000000000000000000000B90002132015000000099999900161100000000000000007500111214111414121714100003000_H8817H100015012200005000000000010000000000000000000000009920202020150213__20_________________________________________________OV__0203P
The the use of unpack perl method returns as follows:
unpack("x49 A4", $line); # where $line is the above example line
returns: CALL
unpack("x68 A4", $line);
returns: 0122
unpack("x238 A4", $line);
return: 2015
Apparently, the column numbers do not match with the number given after 'x' in the TEMPLATE, as x238 is not equal to column 238 ('0000'), I have '2015' on column 251, not 238. The same for the other.
Please, explain how exactly the numbers given after 'x' in TEMPLATE work.
Thank you
First of all, your data isn't what you said it is. The data you provided produces the desired result.
$ perl -E'
my $line = "46S990BZ6BRIG___1381TRANSOCEAN_LTD______________BCALL_FEB00025000__1000000000000000000000000000000000000000B90002132015000000099999900161100000000000000007500111214111414121714100003000_H8817H100015012200005000000000010000000000000000000000009920202020150213__20_________________________________________________OV__0203P";
$line =~ s/_/ /g;
say unpack("x238 A4", $line);
'
0000
Maybe your actual data contained non-printable characters. Or maybe some of the spaces were actually tabs.
$ perl -E'
$_ = "46S990BZ6BRIG___1381TRANSOCEAN_LTD______________BCALL_FEB00025000__1000000000000000000000000000000000000000B90002132015000000099999900161100000000000000007500111214111414121714100003000_H8817H100015012200005000000000010000000000000000000000009920202020150213__20_________________________________________________OV__0203P";
s/_/ /g;
s/LTD\K\s+/\t\t/; # If the spaces after LTD were tabs
say unpack("x238 A4", $_);
'
2015
If you want to extract the characters at based on how your viewer expands tabs, you will need to expand tabs to spaces in the same fashion before passing the string to unpack.