Data transformation using sed - perl

I have a file like:
A
B
C
D
E
F
G
H
I
J
K
L
and I want it to come out like
A,B,C,D
E,F,G,H
I'm assuming I'd use sed, but actually I'm not even sure if that's the best tool. I'm open to using anything commonly available on a Linux system.
In perl, I did it like this ... it works, but it's dirty and has a trailing comma. Was hoping for something simpler:
$ perl -ne 'if (/^(\w)\R/) {print "$1,";} else {print "\n";}' test
A,B,C,D,
E,F,G,H,
I,J,K,L,

Set the input record separator to paragraph mode (-00) and then split each record on any remaining whitespace:
$ perl -00 -ne 'print join("," => split), "\n"' test
Add -l to enable automatic newlines (but make sure it comes before -00, because we want $\ to be set to the value of $/ before modification):
$ perl -l -00 -ne 'print join("," => split)' test
Add -a to enable autosplit mode and implicitly split to #F:
$ perl -l -00 -ane 'print join("," => #F)' test
Swap out -n for -p for automatic printing:
$ perl -l -00 -ape '$_ = join("," => #F)' test

You could use
awk 'BEGIN {RS=""; FS="\n"; ORS="\n"; OFS=","} {$1=$1} 1' file
I see the gawk manual says this:
If RS
is set to the null string, then records are separated by blank lines. When RS is set to the null string, the newline character always acts as a field separator, in addition to whatever value FS may have.
So we don't actually need to specify FS to get the desired output:
awk 'BEGIN {RS=""; ORS="\n"; OFS=","} {$1=$1} 1' file

xargs could do it,
$ xargs -n4 < file | tr ' ' ','
A,B,C,D
E,F,G,H
I,J,K,L

Replacing newlines with sed is a bit complicated (see this question). It is easier to use tr for the newlines. The rest can be done by sed.
The following command assumes that yourFile does not contain any ,.
tr '\n' , < yourFile | sed 's/,*$/\n/;s/,,/\n/g'
The tr part converts all newlines to ,. The resulting string will have no newlines.
s/,*$/\n/ removes trailing commas and appends a newline (text files usually end with a newline).
s/,,/\n/g replaces ,, by a newline. Two consecutive commas appear only where your original file contained two consecutive newlines, that is where the sections are separated by an empty line.

Related

Get version of Podspec via command line (bash, zsh) [duplicate]

Given a file, for example:
potato: 1234
apple: 5678
potato: 5432
grape: 4567
banana: 5432
sushi: 56789
I'd like to grep for all lines that start with potato: but only pipe the numbers that follow potato:. So in the above example, the output would be:
1234
5432
How can I do that?
grep 'potato:' file.txt | sed 's/^.*: //'
grep looks for any line that contains the string potato:, then, for each of these lines, sed replaces (s/// - substitute) any character (.*) from the beginning of the line (^) until the last occurrence of the sequence : (colon followed by space) with the empty string (s/...// - substitute the first part with the second part, which is empty).
or
grep 'potato:' file.txt | cut -d\ -f2
For each line that contains potato:, cut will split the line into multiple fields delimited by space (-d\ - d = delimiter, \ = escaped space character, something like -d" " would have also worked) and print the second field of each such line (-f2).
or
grep 'potato:' file.txt | awk '{print $2}'
For each line that contains potato:, awk will print the second field (print $2) which is delimited by default by spaces.
or
grep 'potato:' file.txt | perl -e 'for(<>){s/^.*: //;print}'
All lines that contain potato: are sent to an inline (-e) Perl script that takes all lines from stdin, then, for each of these lines, does the same substitution as in the first example above, then prints it.
or
awk '{if(/potato:/) print $2}' < file.txt
The file is sent via stdin (< file.txt sends the contents of the file via stdin to the command on the left) to an awk script that, for each line that contains potato: (if(/potato:/) returns true if the regular expression /potato:/ matches the current line), prints the second field, as described above.
or
perl -e 'for(<>){/potato:/ && s/^.*: // && print}' < file.txt
The file is sent via stdin (< file.txt, see above) to a Perl script that works similarly to the one above, but this time it also makes sure each line contains the string potato: (/potato:/ is a regular expression that matches if the current line contains potato:, and, if it does (&&), then proceeds to apply the regular expression described above and prints the result).
Or use regex assertions: grep -oP '(?<=potato: ).*' file.txt
grep -Po 'potato:\s\K.*' file
-P to use Perl regular expression
-o to output only the match
\s to match the space after potato:
\K to omit the match
.* to match rest of the string(s)
sed -n 's/^potato:[[:space:]]*//p' file.txt
One can think of Grep as a restricted Sed, or of Sed as a generalized Grep. In this case, Sed is one good, lightweight tool that does what you want -- though, of course, there exist several other reasonable ways to do it, too.
This will print everything after each match, on that same line only:
perl -lne 'print $1 if /^potato:\s*(.*)/' file.txt
This will do the same, except it will also print all subsequent lines:
perl -lne 'if ($found){print} elsif (/^potato:\s*(.*)/){print $1; $found++}' file.txt
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
You can use grep, as the other answers state. But you don't need grep, awk, sed, perl, cut, or any external tool. You can do it with pure bash.
Try this (semicolons are there to allow you to put it all on one line):
$ while read line;
do
if [[ "${line%%:\ *}" == "potato" ]];
then
echo ${line##*:\ };
fi;
done< file.txt
## tells bash to delete the longest match of ": " in $line from the front.
$ while read line; do echo ${line##*:\ }; done< file.txt
1234
5678
5432
4567
5432
56789
or if you wanted the key rather than the value, %% tells bash to delete the longest match of ": " in $line from the end.
$ while read line; do echo ${line%%:\ *}; done< file.txt
potato
apple
potato
grape
banana
sushi
The substring to split on is ":\ " because the space character must be escaped with the backslash.
You can find more like these at the linux documentation project.
Modern BASH has support for regular expressions:
while read -r line; do
if [[ $line =~ ^potato:\ ([0-9]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
done
grep potato file | grep -o "[0-9].*"

perl -lane change delimiter

I have one liner:
az account list -o table |perl -lane 'print $F[0] if /GS/i'
I want to change default delimiter from '\t' to '-'
Any hint how to do this?
Just wanted to stress that it is oneliner I look for ;)
Plain -a splits on any whitespace, not just tab. The -F option allows you to specify a different delimiter.
az account list -o table |
perl -laF- -ne 'print $F[0] if /GS/i'
perlrun is the manual page that tells you about Perl's command-line options. It says:
-a
turns on autosplit mode when used with a "-n" or "-p". An implicit split command to the #F array is done as the first thing inside the implicit while loop produced by the "-n" or "-p".
perl -ane 'print pop(#F), "\n";'
is equivalent to
while (<>) {
#F = split(' ');
print pop(#F), "\n";
}
An alternate delimiter may be specified using -F.
And:
-Fpattern
specifies the pattern to split on for "-a". The pattern may be surrounded by //, "", or '', otherwise it will be put in single quotes. You can't use literal whitespace or NUL characters in the pattern.
-F implicitly sets both "-a" and "-n".

Use sed to replace every character by itself followed by $n times a char?

I'm trying to run the command below to replace every char in DECEMBER by itself followed by $n question marks. I tried both escaping {$n} like so {$n} and leaving it as is. Yet my output just keeps being D?{$n}E?{$n}... Is it just not possible to do this with a sed?
How should i got about this.
echo 'DECEMBER' > a.txt
sed -i "s%\(.\)%\1\(?\){$n}%g" a.txt
cat a.txt
This might work for you (GNU sed):
n=5
sed -E ':a;s/[^\n]/&\n/g;x;s/^/x/;/x{'"$n"'}/{z;x;y/\n/?/;b};x;ba' file
Append a newline to each non-newline character in a line $n times then replace all newlines by the intended character ?.
N.B. The newline is chosen as the initial substitute character as it is not possible for it to be within a line (sed uses newlines to separate lines) and if the final substitution character already exists within the current line, the substitutions are correct.
Range (also, interval or limiting quantifiers), like {3} / {3,} / {3,6}, are part of regex, and not replacement patterns.
You can use
sed -i "s/./&$(for i in {1..7}; do echo -n '?'; done)/g" a.txt
See the online demo:
#!/bin/bash
sed "s/./&$(for i in {1..7}; do echo -n '?'; done)/g" <<< "DECEMBER"
# => D???????E???????C???????E???????M???????B???????E???????R???????
Here, . matches any char, and & in the replacement pattern puts it back and $(for i in {1..7}; do echo -n '?'; done) adds seven question marks right after it.
This one-liner should do the trick:
sed 's/./&'$(printf '%*s' "$n" '' | tr ' ' '?')'/g' a.txt
with the assumption that $n expands to a positive integer and the command is executed in a POSIX shell.
Efficiently using any awk in any shell on every Unix box after setting n=2:
$ awk -v n="$n" '
BEGIN {
new = sprintf("%*s",n,"")
gsub(/./,"?",new)
}
{
gsub(/./,"&"new)
print
}
' a.txt
D??E??C??E??M??B??E??R??
To make the changes "inplace" use GNU awk with -i inplace just like GNU sed has -i.
Caveat - if the character you want to use in the replacement text is & then you'd need to use gsub(/./,"\\\\\\&",new) in the BEGIN section to make it is treated as literal instead of a backreference metachar. You'd have that issue and more (e.g. handling \1 or /) with any sed solution and any solution that uses double quotes around the script would have more issues with handling $s and the solutions that have a shell script expanding unquoted would have even more issues with globbing chars.

Insert linebreak in a file after a string

I have a unique (to me) situation:
I have a file - file.txt with the following data:
"Line1", "Line2", "Line3", "Line4"
I want to insert a linebreak each time the pattern ", is found.
The output of file.txt shall look like:
"Line1",
"Line2",
"Line3",
"Line4"
I am having a tough time trying to escape ", .
I tried sed -i -e "s/\",/\n/g" file.txt, but I am not getting the desired result.
I am looking for a one liner using either perl or sed.
You may use this gnu sed:
sed -E 's/(",)[[:blank:]]*/\1\n/g' file.txt
"Line1",
"Line2",
"Line3",
"Line4"
Note how you can use single quote in sed command to avoid unnecessary escaping.
If you don't have gnu sed then here is a POSIX compliant sed solution:
sed -E 's/(",)[[:blank:]]*/\1\
/g' file.txt
To save changes inline use:
sed -i.bak -E 's/(",)[[:blank:]]*/\1\
/g' file.txt
Could you please try following. using awk's substitution mechanism here, in case you are ok with awk.
awk -v s1="\"" -v s2="," '{gsub(/",[[:blank:]]+"/,s1 s2 ORS s1)} 1' Input_file
Here's a Perl solution:
perl -pe 's/",\K/\n/g' file.txt
The substitution pattern matches the ",, but the \K says to ignore anything to the left for the replacement (so, ",) will not be replaced. The replacement then effectively inserts the newline.
I used the single quote for the argument to -e, but that doesn't work on Windows where you have to use ". Instead of escaping the ", you can specify it in another way. That's code number 0x22, so you can write:
perl -pe "s/\x22,\K/\n/g" file.txt
Or in octal:
perl -pe "s/\042,\K/\n/g" file.txt
Use this Perl one-liner:
perl -F'/"\K,\s*/' -lane 'print join ",\n", #F;' in_file > out_file
Or this for in-line replacement:
perl -i.bak -F'/"\K,\s*/' -lane 'print join ",\n", #F;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/"\K,\s*/' : Split into #F on a double quote, followed by comma, followed by 0 or more whitespace characters, rather than on whitespace. \K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. This causes to keep the double quote in #F elements, while comma and whitespace are removed during the split.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

Skip/remove non-ascii character with sed

Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa
I've been trying to use sed to modify email addresses in a .csv but the line above keeps tripping me up, using commands like:
sed -i 's/[\d128-\d255]//' FILENAME
from this stackoverflow question
doesn't seem to work as I get an 'invalid collation character' error.
Ideally I don't want to change that combined AE character at all, I'd rather sed just skip right over it as I'm not trying to manipulate that text but rather the email addresses. As long as that AE is in there though it causes my sed substitution to fail after one line, delete the character and it processes the whole file fine.
Any ideas?
This might work for you (GNU sed):
echo "Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa" |
sed 's/\o346/a+e/g'
Chip,Dirkland,Droba+eSphere Inc,cdirkland#hotmail.com,usa
Then do what you have to do and after to revert do:
echo "Chip,Dirkland,Droba+eSphere Inc,cdirkland#hotmail.com,usa" |
sed 's/a+e/\o346/g'
Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa
If you have tricky characters in strings and want to understand how sed sees them use the l0 command (see here). Also very useful for debugging difficult regexps.
echo "Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa" |
sed -n 'l0'
Chip,Dirkland,Drob\346Sphere Inc,cdirkland#hotmail.com,usa$
sed -i 's/[^[:print:]]//' FILENAME
Also, this acts like dos2unix
The issue you are having is the local.
if you want to use a collation range like that you need to change the character type and the collation type.
This fails as \x80 -> \xff are invalid in a utf-8 string.
note \u0080 != \x80 for utf8.
anyway to get this to work just do
LC_ALL=C sed -i 's/[\d128-\d255]//' FILENAME
this will override LC_CTYPE and LC_COLLATE for the one command and do what you want.
I came here trying this sed command s/[\x00-\x1F]/ /g;, which gave me the same error message.
in this case it simply suffices to remove the \x00 from the collation, yielding s/[\x01-\x1F]/ /g;
Unfortunately it seems like all characters above and including \x7F and some others are disallowed, as can be seen with this short script:
for (( i=0; i<=255; i++ )); do
printf "== $i - \x$(echo "ibase=10;obase=16;$i" | bc) =="
echo '' | sed -E "s/[\d$i-\d$((i+1))]]//g"
done
Note that the problem is only the use of those characters to specify a range. You can still list them all manually or per script. E.g. to come back to your example:
sed -i 's/[\d128-\d255]//' FILENAME
would become
c=; for (( i=128; i<255; i++ )); do c="$c\d$i"; done
sed -i 's/['"$c"']//' FILENAME
which would translate to:
sed -i 's/[\d128\d129\d130\d131\d132\d133\d134\d135\d136\d137\d138\d139\d140\d141\d142\d143\d144\d145\d146\d147\d148\d149\d150\d151\d152\d153\d154\d155\d156\d157\d158\d159\d160\d161\d162\d163\d164\d165\d166\d167\d168\d169\d170\d171\d172\d173\d174\d175\d176\d177\d178\d179\d180\d181\d182\d183\d184\d185\d186\d187\d188\d189\d190\d191\d192\d193\d194\d195\d196\d197\d198\d199\d200\d201\d202\d203\d204\d205\d206\d207\d208\d209\d210\d211\d212\d213\d214\d215\d216\d217\d218\d219\d220\d221\d222\d223\d224\d225\d226\d227\d228\d229\d230\d231\d232\d233\d234\d235\d236\d237\d238\d239\d240\d241\d242\d243\d244\d245\d246\d247\d248\d249\d250\d251\d252\d253\d254\d255]//' FILENAME
In this case there is a way to just skip non-ASCII chars, not bothering with removing.
LANG=C sed /someemailpattern/
See https://bugzilla.redhat.com/show_bug.cgi?id=440419 and Will sed (and others) corrupt non-ASCII files?.
How about using awk for this. We setup the Field Separator to nothing. Then loop over each character. Use an if loop to check if it matches our character class. If it does we print it else we ignore it.
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.# ]/) printf $i}'
Test:
[jaypal:~/Temp] echo "Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa" |
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.# ]/) printf $i}'
Chip,Dirkland,DrobSphere Inc,cdirkland#hotmail.com,usa
Update:
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.# ]/) printf $i; printf "\n"}' < datafile.csv > asciidata.csv
I have added printf "\n" after the loop to keep the lines separate.