perl -a: How to change column separator? - perl

I want to read the columns in a file where the separator is :.
I tried it like this (because according to http://www.asciitable.com, the octal representation of the colon is 072):
$ echo "a:b:c" | perl -a -072 -ne 'print "$F[1]\n";'
I want it to print b, but it doesn't work.

Look at -F in perlrun:
% echo "a:b:c" | perl -a -F: -ne 'print "$F[1]\n";'
b
Note that the value is taken as regular expression, so some delimiters may need some extra escaping:
% echo "a.b.c" | perl -a -F. -ne 'print "$F[1]\n";'
% echo "a.b.c" | perl -a -F\\. -ne 'print "$F[1]\n";'
b

-0 specifies the record (line) separator. It was cause Perl to receive three lines:
>echo a:b:c | perl -072 -nE"say"
a:
b:
c
Since there's no whitespace on any of those lines, $F[1] would be empty if -a were to be used.
-F specifies the input field separator. This is what you want.
perl -F: -lanE'say $F[1];'
Or if you're stuck with an older Perl:
perl -F: -lane'print $F[1];'
Command line options are documented in perlrun.

Related

Get version of Podspec via command line (bash, zsh) [duplicate]

Given a file, for example:
potato: 1234
apple: 5678
potato: 5432
grape: 4567
banana: 5432
sushi: 56789
I'd like to grep for all lines that start with potato: but only pipe the numbers that follow potato:. So in the above example, the output would be:
1234
5432
How can I do that?
grep 'potato:' file.txt | sed 's/^.*: //'
grep looks for any line that contains the string potato:, then, for each of these lines, sed replaces (s/// - substitute) any character (.*) from the beginning of the line (^) until the last occurrence of the sequence : (colon followed by space) with the empty string (s/...// - substitute the first part with the second part, which is empty).
or
grep 'potato:' file.txt | cut -d\ -f2
For each line that contains potato:, cut will split the line into multiple fields delimited by space (-d\ - d = delimiter, \ = escaped space character, something like -d" " would have also worked) and print the second field of each such line (-f2).
or
grep 'potato:' file.txt | awk '{print $2}'
For each line that contains potato:, awk will print the second field (print $2) which is delimited by default by spaces.
or
grep 'potato:' file.txt | perl -e 'for(<>){s/^.*: //;print}'
All lines that contain potato: are sent to an inline (-e) Perl script that takes all lines from stdin, then, for each of these lines, does the same substitution as in the first example above, then prints it.
or
awk '{if(/potato:/) print $2}' < file.txt
The file is sent via stdin (< file.txt sends the contents of the file via stdin to the command on the left) to an awk script that, for each line that contains potato: (if(/potato:/) returns true if the regular expression /potato:/ matches the current line), prints the second field, as described above.
or
perl -e 'for(<>){/potato:/ && s/^.*: // && print}' < file.txt
The file is sent via stdin (< file.txt, see above) to a Perl script that works similarly to the one above, but this time it also makes sure each line contains the string potato: (/potato:/ is a regular expression that matches if the current line contains potato:, and, if it does (&&), then proceeds to apply the regular expression described above and prints the result).
Or use regex assertions: grep -oP '(?<=potato: ).*' file.txt
grep -Po 'potato:\s\K.*' file
-P to use Perl regular expression
-o to output only the match
\s to match the space after potato:
\K to omit the match
.* to match rest of the string(s)
sed -n 's/^potato:[[:space:]]*//p' file.txt
One can think of Grep as a restricted Sed, or of Sed as a generalized Grep. In this case, Sed is one good, lightweight tool that does what you want -- though, of course, there exist several other reasonable ways to do it, too.
This will print everything after each match, on that same line only:
perl -lne 'print $1 if /^potato:\s*(.*)/' file.txt
This will do the same, except it will also print all subsequent lines:
perl -lne 'if ($found){print} elsif (/^potato:\s*(.*)/){print $1; $found++}' file.txt
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
You can use grep, as the other answers state. But you don't need grep, awk, sed, perl, cut, or any external tool. You can do it with pure bash.
Try this (semicolons are there to allow you to put it all on one line):
$ while read line;
do
if [[ "${line%%:\ *}" == "potato" ]];
then
echo ${line##*:\ };
fi;
done< file.txt
## tells bash to delete the longest match of ": " in $line from the front.
$ while read line; do echo ${line##*:\ }; done< file.txt
1234
5678
5432
4567
5432
56789
or if you wanted the key rather than the value, %% tells bash to delete the longest match of ": " in $line from the end.
$ while read line; do echo ${line%%:\ *}; done< file.txt
potato
apple
potato
grape
banana
sushi
The substring to split on is ":\ " because the space character must be escaped with the backslash.
You can find more like these at the linux documentation project.
Modern BASH has support for regular expressions:
while read -r line; do
if [[ $line =~ ^potato:\ ([0-9]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
done
grep potato file | grep -o "[0-9].*"

Data transformation using sed

I have a file like:
A
B
C
D
E
F
G
H
I
J
K
L
and I want it to come out like
A,B,C,D
E,F,G,H
I'm assuming I'd use sed, but actually I'm not even sure if that's the best tool. I'm open to using anything commonly available on a Linux system.
In perl, I did it like this ... it works, but it's dirty and has a trailing comma. Was hoping for something simpler:
$ perl -ne 'if (/^(\w)\R/) {print "$1,";} else {print "\n";}' test
A,B,C,D,
E,F,G,H,
I,J,K,L,
Set the input record separator to paragraph mode (-00) and then split each record on any remaining whitespace:
$ perl -00 -ne 'print join("," => split), "\n"' test
Add -l to enable automatic newlines (but make sure it comes before -00, because we want $\ to be set to the value of $/ before modification):
$ perl -l -00 -ne 'print join("," => split)' test
Add -a to enable autosplit mode and implicitly split to #F:
$ perl -l -00 -ane 'print join("," => #F)' test
Swap out -n for -p for automatic printing:
$ perl -l -00 -ape '$_ = join("," => #F)' test
You could use
awk 'BEGIN {RS=""; FS="\n"; ORS="\n"; OFS=","} {$1=$1} 1' file
I see the gawk manual says this:
If RS
is set to the null string, then records are separated by blank lines. When RS is set to the null string, the newline character always acts as a field separator, in addition to whatever value FS may have.
So we don't actually need to specify FS to get the desired output:
awk 'BEGIN {RS=""; ORS="\n"; OFS=","} {$1=$1} 1' file
xargs could do it,
$ xargs -n4 < file | tr ' ' ','
A,B,C,D
E,F,G,H
I,J,K,L
Replacing newlines with sed is a bit complicated (see this question). It is easier to use tr for the newlines. The rest can be done by sed.
The following command assumes that yourFile does not contain any ,.
tr '\n' , < yourFile | sed 's/,*$/\n/;s/,,/\n/g'
The tr part converts all newlines to ,. The resulting string will have no newlines.
s/,*$/\n/ removes trailing commas and appends a newline (text files usually end with a newline).
s/,,/\n/g replaces ,, by a newline. Two consecutive commas appear only where your original file contained two consecutive newlines, that is where the sections are separated by an empty line.

sed does not recognize -r flag on AIX

thanks in advance for the help.
I have the following line that does work on linux.
myfile (extract)
active_instance_count=
aq_tm_processes=1
archive_lag_target=0
audit_file_dest=?/rdbms/audit
audit_sys_operations=FALSE
audit_trail=NONE
background_core_dump=partial
background_dump_dest=/home1/oracle/app/oracle/admin/iopecom/bdump
...
cat myfile |sed -r 's/ {1,}//g'|sed -r 's/\t*//g' |grep -v "^#"|sed -s "/^$/d" |sed =|sed 'N;s/\n/\t/'|sed -r "s/#.*//g" | sed "s/\t/;/g"|sed "s/\t/;/g"|sed -e "s,',\o042,g"
The result will be:
1;O7_DICTIONARY_ACCESSIBILITY=TRUE
2;active_instance_count=
3;aq_tm_processes=1
4;archive_lag_target=0
5;audit_file_dest=?/rdbms/audit
6;audit_sys_operations=FALSE
7;audit_trail=NONE
8;background_core_dump=partial
9;background_dump_dest=/home1/oracle/app/oracle/admin/iopecom/bdump
But, I can't figure out, how to perform the same command on AIX server.
Help is very welcome.
Regards.
Antonio.
Unless you have a compelling reason to use sed, you could use alternate tools:
awk -v OFS=';' '{print NR,$0}' filename
would produce the desired output.
You could also use perl:
perl -ne 'print "$.;$_"' filename
It appears that your sed expression would skip lines beginning with a #. As such, you could say:
perl -ne '$,=";"; !/^#/ && print ++$i,$_' filename
or something like:
grep -v '^#' filename | awk ...
reformatting your pipeline:
cat myfile |
sed -r 's/ {1,}//g' | # strip all spaces (1)
sed -r 's/\t*//g' | # strip all tabs (2)
grep -v "^#" | # delete all lines beginning `#` (3)
sed -s "/^$/d" | # delete all empty lines (4)
sed = | # interleave with line numbers (5)
sed 'N;s/\n/\t/' | # join line number and line with `\t` (6)
sed -r "s/#.*//g" | # strip all `#` comments (7)
sed "s/\t/;/g" | # replace all tabs with `;` (8)
sed "s/\t/;/g" | # do it again (9)
sed -e "s,',\o042,g" # replace all ' with " (10)
Boiling that down and using cat -n to provide the line numbers up front gets:
cat -n myfile |
sed "$(print 's/\t/;/')
$(print 's/[ \t]*//g')
s/#.*//g
/^$/d
s/'/\"/g"
which behaves identically unless I'm misreading the aix docs. The $(...) construction is command substitution, it runs that command and substitutes its output. print would be printf on linux.

sed or grep or awk to match very very long lines

more file
param1=" 1,deerfntjefnerjfntrjgntrjnvgrvgrtbvggfrjbntr*rfr4fv*frfftrjgtrignmtignmtyightygjn 2,3,4,5,6,7,8,
rfcmckmfdkckemdio8u548384omxc,mor0ckofcmineucfhcbdjcnedjcnywedpeodl40fcrcmkedmrikmckffmcrffmrfrifmtrifmrifvysdfn drfr4fdr4fmedmifmitfmifrtfrfrfrfnurfnurnfrunfrufnrufnrufnrufnruf"****
need to match the content of param1 as
sed -n "/$param1/p" file
but because the line length (very long line) I cant match the line
what’s the best way to match very long lines?
The problem you are facing is that param1 contains special characters which are being interpreted by sed. The asterisk ('*') is used to mean 'zero or more occurrences of the previous character', so when this character is interpreted by sed there is nothing left to match the literal asterisk you are looking for.
The following is a working bash script that should help:
#!/bin/bash
param1=' 1,deerfntjefnerjfntrjgntrjnvgrvgrtbvggfrjbntr\*rfr4fv\*frfftrjgtrignmtignmtyightygjn 2,3,4,5,6,7,8, rfcmckmfdkckemdio8u548384omxc,mor0ckofcmineucfhcbdjcnedjcnywedpeodl40fcrcmkedmrikmckffmcrffmrfrifmtrifmrifvysdfn'
cat <<EOF | sed "s/${param1}/Bubba/g"
1,deerfntjefnerjfntrjgntrjnvgrvgrtbvggfrjbntr*rfr4fv*frfftrjgtrignmtignmtyightygjn 2,3,4,5,6,7,8, rfcmckmfdkckemdio8u548384omxc,mor0ckofcmineucfhcbdjcnedjcnywedpeodl40fcrcmkedmrikmckffmcrffmrfrifmtrifmrifvysdfn
EOF
Maybe the problem is that your $param1 contains special characters? This works for me:
A="$(perl -e 'print "a" x 10000')"
echo $A | sed -n "/$A/p"
($A contains 10 000 a characters).
echo $A | grep -F $A
and
echo $A | grep -P $A
also works (second requires grep with built-in PCRE support. If you want pattern matching you should use either this or pcregrep. If you don't, use the fixed grep (grep -F)).
echo $A | grep $A
is too slow.

How can I apply Unix's / Sed's / Perl's transliterate (tr) to only a specific column?

I have program output that looks like this (tab delim):
$ ./mycode somefile
0000000000000000000000000000000000 238671
0000000000000000000000000000000001 0
0000000000000000000000000000000002 0
0000000000000000000000000000000003 0
0000000000000000000000000000000010 0
0000000000000000000000000000000011 1548.81
0000000000000000000000000000000012 0
0000000000000000000000000000000013 937.306
What I want to do is on FIRST column only: replace 0 with A, 1 with C, 2 with G, and 3 with T.
Is there a way I can transliterate that output piped directly from "mycode".
Yielding this:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 238671
...
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACT 937.306
Using Perl:
C:\> ./mycode file | perl -lpe "($x,$y)=split; $x=~tr/0123/ACGT/; $_=qq{$x\t$y}"
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 238671
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACC 1548.81
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACG 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACT 937.306
You can use single quotes in Bash:
$ ./mycode file | perl -lpe '($x,$y)=split; $x=~tr/0123/ACGT/; $_="$x\t$y"'
As #ysth notes in the comments, perl actually provides the command line options -a and -F:
-a autosplit mode with -n or -p (splits $_ into #F)
...
-F/pattern/ split() pattern for -a switch (//'s are optional)
Using those:
perl -lawnF'\t' -e '$,="\t"; $F[0] =~ y/0123/ACGT/; print #F'
It should be possible to do it with sed, put this in a file (you can do it command-line to, with -e, just don't forget those semicolons, or use separate -e for each line). (EDIT: Keep in mind, since your data is tab delimited, it should in fact be a tab character, not a space, in the first s//, make sure your editor doesn't turn it into spaces)
#!/usr/bin/sed -f
h
s/ .*$//
y/0123/ACGT/
G
s/\n[0-3]*//
and use
./mycode somefile | sed -f sedfile
or chmod 755 sedfile and do
./mycode somefile | sedfile
The steps performed are:
copy buffer to hold space (replacing held content from previous line, if any)
remove trailing stuff (from first space to end of line)
transliterate
append contents from hold space
remove the newline (from the append step) and all digits following it (up to the space)
Worked for me on your data at least.
EDIT:
Ah, you wanted a one-liner...
GNU sed
sed -e "h;s/ .*$//;y/0123/ACGT/;G;s/\n[0-3]*//"
or old-school sed (no semicolons)
sed -e h -e "s/ .*$//" -e "y/0123/ACGT/" -e G -e "s/\n[0-3]*//"
#sarathi
\AWK solution for this
awk '{gsub("0","A",$1);gsub("1","C",$1);gsub("2","G",$1);gsub("3","T",$1); print $1"\t"$2}' temp.txt