Perl - "/" causing issues for splitting by comma - perl

I'm trying to split a file by ",". It is a CSV file.
However, one "column" has values that includes "/" and spaces. And it seems to freak out with that column and does not print anything after that column but moves on to the next row.
My code is simply:
perl -lane '#values = split(",",$F[0]); print $values[0]."\t".$values[3];' basefile.txt > newfile.txt
The basefile.txt looks like:
"1","text","abc // 123 /// some more text // text","filename1"
"2","text","abc // 123 /// some more text // text","filename2"
"3","text","abc // 123 /// some more text // text","filename3"
My newfile.txt should have an output of:
"1","filename1"
"2","filename2"
"3","filename3"
Instead I get:
"1",
"2",
"3",
Thanks!

It's not the / that is confusing perl here, it's the spaces combined with the -a flag. Try:
perl -lne '#values = split(",",$_); print $values[0]."\t".$values[3]' basefile
Or, better yet, use Text::CSV_XS to do the splitting.

It's not the '/', it's the spaces.
The -a flag causes perl to split each line of input and put the fields into the variable #F. The delimiter for this split operation is whitespace, unless you override it with the -Fdelimiter option on the command line, too.
So for the input
"1","text","abc // 123 /// some more text // text","filename"
with the -lan flags specified, perl sets
$F[0] = '"1","text","abc';
$F[1] = '//';
$F[2] = '123';
$F[3] = '///';
$F[4] = 'some';
etc.
It seems like you just want to do your split operation on the whole line. In which case you should stop using the -a flag and just say
#values = split(",",$_); ...
or leverage the -a and -F... options and say
perl -F/,/ -lane '#values=#F; ...'

Related

Replacing all occurrence after nth occurrence in a line in perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.
Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file
I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.
You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.
I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file
With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

get column list using sed/awk/perl

I have different files like below format
Scenario 1 :
File1
no,name
1,aaa
20,bbb
File2
no,name,address
5,aaa,ghi
7,ccc,mn
I would like to get column list which is having more number of columns and if it is in the same order
**Expected output for scenario 1 :**
no,name,address
Scenario 2 :
File1
no,name
1,aaa
20,bbb
File2
no,age,name,address
5,2,aaa,ghi
7,3,ccc,mn
Expected Results :
Both file headers and positions are different as a message
I am interested in any short solution using bash / perl / sed / awk.
Perl solution:
perl -lne 'push #lines, $_;
close ARGV;
next if #lines < 2;
#lines = sort { length $a <=> length $b } #lines;
if (0 == index "$lines[1],", $lines[0]) {
print $lines[1];
} else {
print "Both file headers and positions are different";
}' -- File1 File2
-n reads the input line by line and runs the code for each line
-l removes newlines from input and adds them to printed lines
closing the special file handle ARGV makes Perl open the next file and read from it instead of processing the rest of the currently opened file.
next makes Perl go back to the beginning of the code, it can continue once more than one input line has been read.
sort sorts the lines by length so that we know the longer one is in the second element of the array.
index is used to check whether the shorter header is a prefix of the longer one (including the comma after the first header, so e.g. no,names is correctly rejected)

Can "perl -a" somehow re-join #F using the original whitespace?

My input has a mix of tabs and spaces for readability. I want to modify a field using perl -a, then print out the line in its original form. (The data is from findup, showing me a count of duplicate files and the space they waste.) Input is:
2 * 4096 backup/photos/photo.jpg photos/photo.jpg
2 * 111276032 backup/books/book.pdf book.pdf
The output would convert field 3 to kilobytes, like this:
2 * 4 KB backup/photos/photo.jpg photos/photo.jpg
2 * 108668 KB backup/books/book.pdf book.pdf
In my dream world, this would be my code, since I could just will perl to automatically recombine #F and preserve the original whitespace:
perl -lanE '$F[2]=int($F[2]/1024)." KB"; print;'
In real life, joining with a single space seems like my only option:
perl -lanE '$F[2]=int($F[2]/1024)." KB"; print join(" ", #F);'
Is there any automatic variable which remembers the delimiters? If I had a magic array like that, the code would be:
perl -lanE 'BEGIN{use List::Util "reduce";} $F[2]=int($F[2]/1024)." KB"; print reduce { $a . shift(#magic) . $b } #F;'
No, there is no such magic object. You can do it by hand though
perl -wnE'#p = split /(\s+)/; $p[4] = int($p[4]/1024); print #p' input.txt
The capturing parens in split's pattern mean that it is also returned, so you catch exact spaces. Since spaces are in the array we now need the fifth field.
As it turns out, -F has this same property. Thanks to Сухой27. Then
perl -F'(\s+)' -lanE'$F[4] = int($F[4]/1024); say #F' input.txt
Note: with 5.20.0 "-F now implies -a and -a implies -n". Thanks to ysth.
You could just find the correct part of the line and modify it:
perl -wpE's/^\s*+(?>\S+\s+){2}\K(\S+)/int($1\/1024) . " KB"/e'

Sed - replace words

I have a problem with replacing string.
|Stm=2|Seq=2|Num=2|Svc=101|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
I want to find occurrence of Svc till | appears and swap place with Stm till | appears.
My attempts went to replacing characters and this is not my goal.
awk -F'|' -v OFS='|'
'{a=b=0;
for(i=1;i<=NF;i++){a=$i~/^Stm=/?i:a;b=$i~/^Svc=/?i:b}
t=$a;$a=$b;$b=t}7' file
outputs:
|Svc=101|Seq=2|Num=2|Stm=2|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
the code exchange the column of Stm.. and Svc.., no matter which one comes first.
If perl solution is okay, assumes only one column matches each for search terms
$ cat ip.txt
|Stm=2|Seq=2|Num=2|Svc=101|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
$ perl -F'\|' -lane '
#i = grep { $F[$_] =~ /Svc|Stm/ } 0..$#F;
$t=$F[$i[0]]; $F[$i[0]]=$F[$i[1]]; $F[$i[1]]=$t;
print join "|", #F;
' ip.txt
|Svc=101|Seq=2|Num=2|Stm=2|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
-F'\|' -lane split input line on |, see also Perl flags -pe, -pi, -p, -w, -d, -i, -t?
#i = grep { $F[$_] =~ /Svc|Stm/ } 0..$#F get index of columns matching Svc and Stm
$t=$F[$i[0]]; $F[$i[0]]=$F[$i[1]]; $F[$i[1]]=$t swap the two columns
Or use ($F[$i[0]], $F[$i[1]]) = ($F[$i[1]], $F[$i[0]]); courtesy How can I swap two Perl variables
print join "|", #F print the modified array
You need to use capture groups and backreferences in a string substition.
The below will swap the 2:
echo '|Stm=2|Seq=2|Num=2|Svc=101|MsgSize(514)=514|MsgType=556|SymbolIndex=16631' | sed 's/\(Stm.*|\)\(.*\)\(Svc.*|\)/\3\2\1/'
As pointed out in the comment from #Kent, this will not work if the strings were not in that order.

Perl To Parse Whitespace Separated Columns

I have a large text file which has three columns present, each column separated by four spaces. I need a perl script to read this text file and output columns #1, and #2 to a new text file with each of these columns wrapped in quotation marks and separated commas in the output file.
The text file with the four columns has data which looks like this:
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
What I would like the output to look like is
"9a2ba3c0580b5f3799ad9d6f487b2d38","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg"
Easy as a one-liner:
perl -lane 'print join ",", map qq("$_"), #F[0, 1]'
-l handles newlines in print
-n reads the input line by line
-a splits each line on whitespace into the #F array
#F[0, 1] is an array slice, it extracts the first two elements of the #F array
map wraps each element in double quotes
join inserts the comma in between
Below code for your reference:
#!/usr/bin/perl
my $defaultFileName=defined $ARGV[0]?$ARGV[0]:"filename.txt";
die "Could not find file: $defaultFileName" unless(-f $defaultFileName);
open my $fh, '<',"textFileName.log";
foreach my $line(<$fh>) {
my #tmpData=split(/\s+/, $line);
printf "\"%s\",\"%s\"\\n\n",$tmpData[1],$tmpData[2];
}
close $fh;
This can also be done with awk
>>cat test
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
OUTPUT:
>>awk '{FS=" "}{print "\""$1"\",""\""$2"\",""\""$3"\"" }' test
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
>>awk '{FS=" "}{print "\""$1"\",""\""$2"\",""\""$3"\"" }' test > output.txt
then output.txt will have a desired output.