AWK/Sed string manipulation - sed

I have a string in the following format and I want to convert it to csv format (note the separator is the underscore character "_"
Title_YYYYMMDD_emailname convert to Title,YYYYMMDD,emailname
This is simple enough using sed ...
echo "Report_20131107_jlsmith" | sed 's/_/,/g'
Output:
Report,20131107,jlsmith
But there are complications trying to parse a string that contains underscores in the title field ..
I want to retain the underscores in the title (if any) but change the underscores to commas for the
date and emailname ...
For instance:
Report_Title_20131107_jlsmith convert to: Report_Title,20131107,jlsmith
And a related question: is there a way to compress multiple repeating instances of the underscore character for the entire string?
Report_Title____20131107_jlsmith convert to: Report_Title,20131107,jlsmith

Last request first:
echo "Report_Title____20131107_jlsmith" | awk '{gsub(/_+/,"_")}1'
Report_Title_20131107_jlsmith
First request (using gnu awk)
echo "Report_Title_more_20131107_jlsmith" | awk '{print gensub(/_([0-9]+)_/,",\\1,","g")}'
Report_Title_more,20131107,jlsmith
All in one command
echo "Report_Title___more_20131107_jlsmith" | awk '{gsub(/_+/,"_");print gensub(/_([0-9]+)_/,",\\1,","g")}'
Report_Title_more,20131107,jlsmith

With the format you have shown, you could replace ____YYYYMMDD_ with ,YYYYMMDD, using sed as follows
echo 'Report_Title____20131107_jlsmith' | sed 's/__*\([0-9]\{8\}\)__*/,\1,/g'
Report_Title,20131107,jlsmith

Using sed
sed -r -e 's/_+/_/g' -e 's/_([^_]+)_([^_]+)$/,\1,\2/'
Or more robust with stringent regex
sed -r -e 's/_+/_/g' -e 's/^(.+)_([0-9]{8})_(\w+)$/\1,\2,\3/'

Related

sed - Replace comma after first regex match

i m trying to perform the following substitution on lines of the general format:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......
as you see the problem is that its a comma separated file, with a specific field containing a comma decimal. I would like to replace that with a dot .
I ve tried this, to replace the first occurence of a pattern after match, but to no avail, could someone help me?
sed -e '/,"/!b' -e "s/,/./"
sed -e '/"/!b' -e ':a' -e "s/,/\./"
Thanks in advance. An awk or perl solution would help me as well. Here's an awk effort:
gawk -F "," 'substr($10, 0, 3)==3 && length($10)==12 { gsub(/,/,".", $10); print}'
That yielded the same file unchanged.
CSV files should be parsed in awk with a proper FPAT variable that defines what constitutes a valid field in such a file. Once you do that, you can just iterate over the fields to do the substitution you need
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")"; OFS="," }
{ for(i=1; i<=NF;i++) if ($i ~ /[,]/) gsub(/[,]/,".",$i);}1' file
See this answer of mine to understand how to define and parse CSV file content with FPAT variable. Also see Save modifications in place with awk to do in-place file modifications like sed -i''.
The following sed will convert all decimal separators in quoted numeric fields:
sed 's/"\([-+]\?[0-9]*\)[,]\?\([0-9]\+\([eE][-+]\?[0-9]+\)\?\)"/"\1.\2"/g'
See: https://www.regular-expressions.info/floatingpoint.html
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^",]*"[^"]*)*"[^",]*),/\1./;ta' file
This regexp matches a , within a pair of "'s and replaces it by a .. The regexp is anchored to the start of the line and thus needs to be repeated until no further matches can be matched, hence the :a and the ta commands which causes the substitution to be iterated over whilst any substitution is successful.
N.B. The solution expects that all double quotes are matched and that no double quotes are quoted i.e. \" does not appear in a line.
If your input always follows that format of only one quoted field containing 1 comma then all you need is:
$ sed 's/\([^"]*"[^"]*\),/\1./' file
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC, .......
If it's more complicated than that then see What's the most robust way to efficiently parse CSV using awk?.
Assuming you have this:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC
Try this:
awk -F',' '{print $1,$2,$3,$4"."$5,$6,$7}' filename | awk '$1=$1' FS=" " OFS=","
Output will be:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC
You simply need to know the field numbers for replacing the field separator between them.
In order to use regexp as in perl you have to activate extended regular expression with -r.
So if you want to replace all numbers and omit the " sign, then you can use this:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/g'
If you want to replace first occurrence only you can use that:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/1'
https://www.gnu.org/software/sed/manual/sed.txt

Replace first word with third one in every line, but words are separated by ":"

I'm trying to learn sed but getting stuck when trying to replace first word wih the 3rd. I was thinking about the above code, but it doesn't work.
Also, is there any way of splitting the line if the words are separated by ":" using sed?
sed "s/\(^[a-z,0-9]*\) \(.*\) \([a-z,0-9]*\)/\1 \2 \1/"
From your comment below it sounds like you actually want to replace the third word with the first one rather than the other way around. If so then:
$ echo 'first:second:third' | sed 's/\(\([^:]*\).*:\).*/\1\2/'
first:second:first
or if you have many fields to manipulate:
$ echo 'first:second:third' | sed 's/\([^:]*\):\([^:]*\):\([^:]*\)/\1:\2:\1/'
first:second:first
but you should really use awk for anything involving fields anyway:
$ echo 'first:second:third' | awk 'BEGIN{FS=OFS=":"} {$3=$1} 1'
first:second:first

How to exclude end of lines of textfiles via terminal?

Given a file ./wordslist.txt with <word> <number_of_apparitions> such as :
aš toto 39626
ir 35938
tai 33361
tu 28520
kad 26213
...
How to exclude the end-of-lines digits in order to collect in output.txt data such :
aš toto
ir
tai
tu
kad
...
Note :
Sed, find, cut or grep prefered. I cannot use something which keeps [a-z] things since my data can contain ascii letters, non-ascii letters, chinese characters, digits, etc.
I suggest:
cut -d " " -f 1 wordslist.txt > output.txt
Or :
sed -E 's/ [0-9]+$//' wordslist.txt > output.txt.
Use awk for print first word in this case.
awk '{print $1}' your_file > your_new_file
awk solution to simply print input line excluding last column
$ awk '{NF--; print}' wordslist.txt
aš toto
ir
tai
tu
kad
Note:
This will only work in some awks. Per POSIX incrementing NF adds a null field but decrementing NF is undefined behavior (thanks #EdMorton for the info)
This doesn't check if last column is numeric and field separation in output will be single space only
If there can be empty lines in input file, use awk 'NF{NF--}1'
The following works :
sed -r 's/ [0-9]+$//g' wordslist.txt

Remove from the beginning till certain part in a string

I work with strings like
abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf
and I need to get a new one where I remove in the original string everything from the beginning till the last appearance of "_" and the next characters (can be 3, 4, or whatever number)
so in this case I would get
_adf
How could I do it with "sed" or another bash tool?
Regular expression pattern matching is greedy. Hence ^.*_ will match all characters up to and including the last _. Then just put the underscore back in:
echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | sed 's/^.*_/_/'
sed 's/^(.*)_([^_]*)$/_\2/' < input.txt
Do you need to modify the string, or just find everything after the last underscore? The regex to find the last _{anything} would be /(_[^_]+)$/ ($ matches the end of the string), or if you also want to match a trailing underscore with nothing after it, /(_[^_]*)$/.
Unless you really need to modify the string in place instead of just finding this piece, or you really want to do this from the command line instead of a script, this regex is a bit simpler (you tagged this with perl, so I wasn't sure quite how committed to using just the command line as opposed to a simple script you were).
If you do need to modify the string in place, sed -i 's/(_[^_]+)$/\1/' myfile or sed -i 's/(_[^_]+)$/\1/g' myfile. The -i (edit: I decided not to be lazy and look up the proper syntax...) the -i flag will just overwrite the old file with the new one. If you want to create a new file and not clobber the old one, sed -e 's/.../.../g' oldfile > newfile. The g after the s/// will do this for all instances in the file you pass into sed; leaving it out just replaces the first instance.
If the string is not by itself at the end of the line, but rather embedded in other text. but just separated by whitespace, replace the $ with \s, which will match a whitespace character (the end of a word).
If you have strings like these in bash variables (I don't see that specified in the question), you can use parameter expansion:
s="abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf"
t="_${s##*_}"
echo "$t" # ==> _adf
In Perl, you could do this:
my $string = "abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf";
if ( $string =~ m/(_[^_]+)$/ ) {
print $1;
}
[Edit]
A Perl one liner approach (ie, can be run from bash directly):
perl -lne 'm/(_[^_]+)$/ && print $1;' infile > outfile
Or using substitution:
perl -pe 's/.*(_[^_]+)$/$1/' infile > outfile
Just group the last non-underscore characters preceded by the last underscore with \(_[^_]*\), then reference this group with \1:
sed 's/^.*\(_[^_]*\)$/\1/'
Result:
$ echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | sed 's/^.*\(_[^_]*\)$/\1/'
_adf
A Perl way:
echo 'abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf' | \
perl -e 'print ((split/(_)/,<>)[-2..-1])'
output:
_adf
Just for fun:
echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | tr _ '\n' | tail -n 1 | rev | tr '\n' _ | rev

Sed - pattern matching with binary value as separator?

Is it possible to use binary values in sed pattern matching?
I have a one line strings which contain plain text fields separated by binary value 1 as separator.
Is it possible to use sed to much everything up to binary separator 1?
Or should I use awk?
Example string where \x1 represents binary value 1:
key1=value1\x1key2=value2\x1key3=value3
Example expected output, values for key1 and key2:
value1 value2
edit: Here are a couple of options for printing the values based on a list of keys, couldn't figure out a more concise way with awk but one probably exists:
$ echo -e 'key1=value1\001key2=value2\001key3=value3' > test
$ sed 's/\x01/\n/g' test | awk -F= '{if ($1 == "key1" || $1 == "key2") print $2}'
value1
value2
$ sed 's/\x01/\n/g' test | perl -pe 's/((key1|key2)=(.*)|.*)/\3/'
value1
value2
You can't match everything up to the first \x1 since sed does not support non-greedy matching, your options are to use a different language, or something like the following:
$ sed 's/\x01/\n/g' test | head -n 1
key1=value1
The answer to the following question has a good example of using a Perl regex for non-greedy matches:
Non greedy regex matching in sed?
You have to find a way to get the \x1 in binary in the command there as sed doesn't parse it. For example to convert them all to new lines:
sed -e "s/$(echo -e \\001)/\n/g" filename
Type Control-A at the point where you want the character \001 to appear.
I would find this a lot easier than handling all the necessary escaping to get echo to produce the correct string if there are any backslashes in the regex - and I find there often are such backslashes.