sed command to change date format - date

Here is a snippet of the file I'm working with:
709ENVUN07,SET1,FE10,GB0009252882,GB,GBX,NULL,S,O,LO,1510.00000000,173,N,F,28022007,07:51:15,3717
208ATNHG07,SET1,FE10,GB0009252882,GB,GBX,NULL,S,O,LO,1550.00000000,1800,N,F,18012007,15:48:21,654681
As you can see the date is in this format: 28022007, 18012007
Using sed I've successfully changed to the format I wish.
gzip -dc allGlaxoOrderHistory.CSV.gz |sed 's/\([0-9]\{2\}\)\([0-9]\{2\}\)\(2[0-9]\{3\}\)/\1-\2-\3/g' > newOrderHistory.csv
However sed is also changing GB0009252882 to GB00-09-252882 as you can see below
709ENVUN07,SET1,FE10,GB00-09-252882,GB,GBX,NULL,S,O,LO,1510.00000000,173,N,F,28-02-2007,07:51:15,3717
208ATNHG07,SET1,FE10,GB00-09-252882,GB,GBX,NULL,S,O,LO,1550.00000000,1800,N,F,18-01-2007,15:48:21,654681
Question is how do I change 28022007, 18012007 to this 28-02-2007 ,18-01-2007 without GB0009252882 changing too.

[edit]
Your date field is the 15th from the start. You can write your pattern like this:
sed 's/\(\([^,]*,\)\{14\}..\)\(..\)/\1-\3-/'
Where ,[^,]*, describes a field (with separator).
You can also work by fields more easily with awk. You only need to set the input and output delimiter to ,
With awk (Gnu), target the 15th field:
awk -F, -vOFS=, '{$15=gensub(/(..)(..)(....)/, "\\1-\\2-\\3", "g", $15)}1' yourfile
The parameter -F, set the input delimiter and -vOFS=, the output delimiter. The 1 at the end is used as a shortcut for print).

Related

sed - Replace comma after first regex match

i m trying to perform the following substitution on lines of the general format:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......
as you see the problem is that its a comma separated file, with a specific field containing a comma decimal. I would like to replace that with a dot .
I ve tried this, to replace the first occurence of a pattern after match, but to no avail, could someone help me?
sed -e '/,"/!b' -e "s/,/./"
sed -e '/"/!b' -e ':a' -e "s/,/\./"
Thanks in advance. An awk or perl solution would help me as well. Here's an awk effort:
gawk -F "," 'substr($10, 0, 3)==3 && length($10)==12 { gsub(/,/,".", $10); print}'
That yielded the same file unchanged.
CSV files should be parsed in awk with a proper FPAT variable that defines what constitutes a valid field in such a file. Once you do that, you can just iterate over the fields to do the substitution you need
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")"; OFS="," }
{ for(i=1; i<=NF;i++) if ($i ~ /[,]/) gsub(/[,]/,".",$i);}1' file
See this answer of mine to understand how to define and parse CSV file content with FPAT variable. Also see Save modifications in place with awk to do in-place file modifications like sed -i''.
The following sed will convert all decimal separators in quoted numeric fields:
sed 's/"\([-+]\?[0-9]*\)[,]\?\([0-9]\+\([eE][-+]\?[0-9]+\)\?\)"/"\1.\2"/g'
See: https://www.regular-expressions.info/floatingpoint.html
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^",]*"[^"]*)*"[^",]*),/\1./;ta' file
This regexp matches a , within a pair of "'s and replaces it by a .. The regexp is anchored to the start of the line and thus needs to be repeated until no further matches can be matched, hence the :a and the ta commands which causes the substitution to be iterated over whilst any substitution is successful.
N.B. The solution expects that all double quotes are matched and that no double quotes are quoted i.e. \" does not appear in a line.
If your input always follows that format of only one quoted field containing 1 comma then all you need is:
$ sed 's/\([^"]*"[^"]*\),/\1./' file
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC, .......
If it's more complicated than that then see What's the most robust way to efficiently parse CSV using awk?.
Assuming you have this:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC
Try this:
awk -F',' '{print $1,$2,$3,$4"."$5,$6,$7}' filename | awk '$1=$1' FS=" " OFS=","
Output will be:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC
You simply need to know the field numbers for replacing the field separator between them.
In order to use regexp as in perl you have to activate extended regular expression with -r.
So if you want to replace all numbers and omit the " sign, then you can use this:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/g'
If you want to replace first occurrence only you can use that:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/1'
https://www.gnu.org/software/sed/manual/sed.txt

Using command line to remove text?

I have a huge file that contains lines that follow this format:
New-England-Center-For-Children-L0000392290
Southboro-Housing-Authority-L0000392464
Crew-Star-Inc-L0000391998
Saxony-Ii-Barber-Shop-L0000392491
Test-L0000392334
What I'm trying to do is narrow it down to just this:
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Test
Can anyone help with this?
Using GNU awk:
awk -F\- 'NF--' OFS=\- file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Set the input and output field separator to -.
NF contains number of fields. Reduce it by 1 to remove the last field.
Using sed:
sed 's/\(.*\)-.*/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Simple greedy regex to match up to the last hyphen.
In replacement use the captured group and discard the rest.
Version 1 of the Question
The first version of the input was in the form of HTML and parts had to be removed both before and after the desired text:
$ sed -r 's|.*[A-Z]/([a-zA-Z-]+)-L0.*|\1|' input
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
Version 2 of the Question
In the revised question, it is only necessary to remove the text that starts with -L00:
$ sed 's|-L00.*||' input2
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Both of these commands use a single "substitute" command. The command has the form s|old|new|.
The perl code for this would be: perl -nle'print $1 if(m{-.*?/(.*?-.*?)-})
We can break the Regex down to matching the following:
- for that's between the city and state
.*? match the smallest set of character(s) that makes the Regex work, i.e. the State
/ matches the slash between the State and the data you want
( starts the capture of the data you are interested in
.*?-.*? will match the data you care about
) will close out the capture
- will match the dash before the L####### to give the regex something to match after your data. This will prevent the minimal Regex from matching 0 characters.
Then the print statement will print out what was captured (your data).
awk likes these things:
$ awk -F[/-] -v OFS="-" '{print $(NF-3), $(NF-2)}' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
This sets / and - as possible field separators. Based on them, it prints the last_field-3 and last_field-2 separated by the delimiter -. Note that $NF stands for last parameter, hence $(NF-1) is the penultimate, etc.
This sed is also helpful:
$ sed -r 's#.*/(\w*-\w*)-\w*\.\w*</loc>$#\1#' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
It selects the block word-word after a slash / and followed with word.word</loc> + end_of_line. Then, it prints back this block.
Update
Based on your new input, this can make it:
$ sed -r 's/(.*)-L\w*$/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
It selects everything up to the block -L + something + end of line, and prints it back.
You can use also another trick:
rev file | cut -d- -f2- | rev
As what you want is every slice of - separated fields, let's get all of them but last one. How? By reversing the line, getting all of them from the 2nd one and then reversing back.
Here's how I'd do it with Perl:
perl -nle 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && print $2' filename
Note: the original question was matching input lines like this:
<loc>http://www.example.com/bp/Lowell-MA/Special-Restaurant-L0000423916.htm</loc>
<loc>http://www.example.com/bp/Houston-TX/Eliot-Cleaning-L0000422797.htm</loc>
<loc>http://www.example.com/bp/New-Orleans-LA/Kennedy-Plumbing-L0000423121.htm</loc>
The -n option tells Perl to loop over every line of the file (but not print them out).
The -l option adds a newline onto the end of every print
The -e 'perl-code' option executes perl-code for each line of input
The pattern:
/regex/ && print
Will only print if the regex matches. If the regex contains capture parentheses you can refer to the first captured section as $1, the second as $2 etc.
If your regex contains slashes, it may be cleaner to use a different regex delimiter ('m' stands for 'match'):
m{regex} && print
If you have a modern Perl, you can use -E to enable modern feature and use say instead of print to print with a newline appended:
perl -nE 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && say $2' filename
This is very concise in Perl
perl -i.bak -lpe's/-[^-]+$//' myfile
Note that this will modify the input file in-place but will keep a backup of the original data in called myfile.bak

Using sed to swap columns X and X+1 inline in delimited file

I have a file with multiple lines and for line 2 to the end of the file I want to swap fields 8 and 9. The file is comma separated and I'd like to do the swap inline so I can run it on a batch of files using * wildcard. If this can be accomplished similarly with awk then that works for me too.
example:
header1,header2,header3,...,header8,header9,...,headerN
field1.1,...,field1.9,field1.8,...,field1.N
field2.1,...,field2.9,field2.8,...,field2.N
field3.1,...,field3.9,field3.8,...,field3.N
...
I think the command would look similar to sed -r -i '2,$s/^(([^,]*,){8})([^,]*,)([^,]*,)(.*)/\1\3\2\4/' temp*.log,
but \2 is not what I expect, it is the 7th field. I know that \2 will not be the 8th field because I have double parentheses there, but I'm not sure how to fix it. Could somebody please explain what this equation is doing and specifically what [^,] is doing and how the {8} is applied?
Thanks in advance.
In awk, you might use:
awk -F',' 'BEGIN {OFS=","} {t = $8; $8 = $9; $9 = t; print}'
In sed, the command is more convoluted, but it could be done.
sed -e 's/^\(\([^,]*,\)\{7\}\)\([^,]*,\)\([^,]*,\)/\1\4\3/'
Add the -i .bak option if your version of sed (e.g. GNU or BSD) supports it.
This uses the universally available sed regexes (it would work on even archaic versions of sed). You could lose most of the backslashes if you used 'extended regular expressions' instead:
sed -r -i 's/^(([^,]*,){7})([^,]*,)([^,]*,)/\1\4\3\5/'
Note the nested remembered (captured) patterns. The outer set is \1, the inner set would be \2 but that gets repeated 7 times, so you'd have the seventh field as \2. Anyway, that's why the eighth and ninth columns are switched with \4 and \3. \5 are the remaining columns.
(I note in passing that it would have been helpful to have some sample data in sufficiently the correct format to test with. It was a nuisance having to edit what is shown in the question to be able to test the code.)
If you need to do much CSV work, then either use Perl and its CSV modules (Text::CSV and Text::CSV_XS) or Python and its CSV module, or get CSVfix.
$2 is the second part in the RE
Denumbered by first occurence of (.
So in
'2,$s/^(([^,]*,){8})([^,]*,)([^,]*,)(.*)/\1\3\2\4/'
You could see (followind alignment):
$1 = (([^,]*,){8})
$2 = ([^,]*,)
$3 = ([^,]*,)
$4 = ([^,]*,)
and finaly $5 = (.*)
In this specific case, $2 must hold the last match of the height ({8}).
it seems that awk is the right tool:
awk -F',' -v OFS=',' '{t=$8;$8=$9;$9=t}7' file
This might work for you (GNU sed):
sed -ri '1!s/(,[^,]*)(,[^,]*)/\2\1/4' file
This swaps the 9th field with the 8th i.e. 8 / 2 = 4, if you wanted the 7th with the 8th:
sed -ri '1!{s/^/,/;s/(,[^,]*)(,[^,]*)/\2\1/4;s/^,//}' file

Using sed or awk, how can I alter the first field in a delimited line?

I have a delimited file whose first few fields look like this:
2774013300|184500|2012-01-04 23:00:00|
and I want to alter certain rows whose first field equals or exceeds 8 characters.
I want to truncate the value in the first column.
In the case of 2774013300 I want its value to become become 27740133.
I would like to do this in sed, preferably, or awk.
Using sed, I can find any number that exceeds 8 digits at the beginning of the line, but am not quite sure how to truncate it, using, I would assume, substitute.
sed -n -e /'^[0-9]\{10,\}/p' infile
I am thinking I could use grouping for the first 8 characters and return those in a substitute command, but I'm not quite sure how to do that.
In awk, I can detect the first field, but am not quite sure how to use substr to alter the first field and then return the remaining fields, so a full line is preserved.
awk -F'|' '{ if (length($1) > 9) { print $1; print length($1);} }' infile
Depending on the subtleties of your situation, you can use
sed 's/^\([0-9]\{8\}\)[0-9]*/\1/' infile
or
sed 's/^\([0-9]\{8\}\)[0-9]\{1,\}/\1/' infile
which with GNU sed can be simplified to
sed -r 's/^([0-9]{8})[0-9]+/\1/' infile
or, if you need to, add -n and p.
Example:
$ sed 's/^\([0-9]\{8\}\)[0-9]*/\1/' <<<'2774013300|184500|2012-01-04 23:00:00|'
27740133|184500|2012-01-04 23:00:00|
Using awk:
awk -F'|' 'BEGIN{OFS=FS}length($1)>9{$1=substr($1, 0,9)}{print}'
example:
$ echo "2774013300|184500|2012-01-04 23:00:00|" | awk -F'|' 'BEGIN{OFS=FS}length($1)>9{$1=substr($1, 0,9)}{print}'
27740133|184500|2012-01-04 23:00:00|

How do I get rid of lines not matching a timestamp via sed?

I am not sure why sed is not working as expected in this particular instance. I have lines of the form:
12:42:46.675 token
where I expect the timestamp to alwas have that format. Unfortunately every now and then there are lines in the file which do not begin with a timestamp and I want to get rid of those. I tried filtering out everything that does not match the above with:
sed -n /^\d{2}:\d{2}:\d{2}.\d{3}/p
but the above filters everything out, even if I give sed the -r option. What is the correct way of doing that with sed? And is there an alternative with grep?
Using grep to only display lines starting with timestamp format:
grep -E '^([0-9]{2}:){2}[0-9]{2}\.[0-9]{3} ' file
Sed doesn't accept \d, use [0-9] instead. And both { and } are not metacharacters, they are literal for sed so you will need to escape them for the special behaviour, it would result like:
sed -n '/^[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}.[0-9]\{3\}/p' infile
EDIT: Also surround the expression between quotes (better singles than double) to avoid shell expansion.