Separate date and time with a comma - sed

I have access log with lines
http://***.com ,**.**.**.**,2013-06-07 12:03:58 ,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0.
I need the date and time to be separated by a comma using sed

Seems like you want something like this,
$ sed 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) \([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\)/\1,\2/g' file
http://***.com ,**.**.**.**,2013-06-07,12:03:58 ,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0.
\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) Capture the date string.
Match the in-between space character.
\([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\) Capture the time string.
Then replace the matched strings with \1 characters inside group index 1, , \2 characters inside group index 2.
\(\) called capturing group in Basic regular expressions. So for example \([0-9]\{4\}\) would capture the 4 digit number into a group.

This is much more simple to do with awk
awk -F, '{sub(/ /,",",$3)}1' OFS=, file
http://***.com ,**.**.**.**,2013-06-07,12:03:58 ,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0.
Separate the files using ,, replace first space in 3rd field with ,

A bit secure in a complicated sed
sed 's/\(,[-0-9]\{10\}\) *\([0-9:]\{8\},\)/\1,\2/g' YourFile
A bit easy in a very complicated sed vs a simple awk
sed 's/ */,/' YourFile
but in both case (as earlier reply post) it assumes that there is only line like the sample in the file.If not you have to give the other possible file (as a http log , any error taht can occure like bad URL message, ... )

Related

How to sed replace UTF-8 characters with HTML entities in another file?

I'm running cygwin under windows 10
Have a dictionary file (1-dictionary.txt) that looks like this:
labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "
The separators between are TABs (\ts).
The dictionary file is encoded as UTF-8.
Want to replace words and symbols in the first column with words and HTML entities in the second column.
My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.
Sample text looks like this:
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
I run the following sed one-liner in a shell script (./3-script.sh):
sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt
The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.
However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:
vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)
If i use only the specific symbol (not the full word) I get results like this:
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
The ASCII quote symbol is appended with " - it is not replaced.
Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.
The expected output would look like this:
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?
I tried it, just replace all & with \& in your 1-dictionary.txt will solve your problem.
Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add \ to prepare them to be escaped.
And the to part will have special characters too, mainly \ and &, add extra \ to prepare them to be escaped too.
Above linked to GNU sed's document, for other sed version, you can also check man sed.

sed text string modification by prepending a fixed value

I have a csv file with 500 lines and 8 columns. One column is for the office phone number. About 200 of them are in the form -1234 but need to be 123-456-7890.
Each of the 200 instances have a different last four as they are desk phone numbers. The area code and exchange part are the same for all. So as an example, we have three data points:
-2329
-5679
-8891
These three items need to be converted to the full phone number so will need to look like:
212-598-2329
212-598-5679
212-598-8891
Anybody have the sed wizardry to pull this off ?
An example line of data looks like:
Bill,Smith,Mr.,bsmith#mydomain.org,bsmith,-5315,800-878-5554,\N,\N,\N,\N
I would use awk:
awk -F, '{$6="212-598"$6}1' OFS=, input.file
Using , as the input and output field delimiter it is simple to access the sixth field and prepend the static value.
alternative with sed
$ sed 's/,-/,212-598-/' missing_phone
Bill,Smith,Mr.,bsmith#mydomain.org,bsmith,212-598-5315,800-878-5554,\N,\N,\N,\N

using sed to remove special chars and add spaces instead

I have a block of text i'd like to change up:
^#^A^#jfits^#^A^#pin^#^A^#sadface^#^A^#secret^#^A^#test^#^A^#tools^#^A^#ttttfft^#^A^#tty^#^A^#vuln^#^A^#yes^#^
using sed i'd like to remove all the ^#^A^ (and variations of those chars) with a few spaces.
I tried:
cat -A file | sed 's/\^A\^\#/ /'
but thats obviously wrong, can someone help?
if you can enumerate the allowed characters then you can do something like
sed -e 's/[^a-zA-Z0-9]/ /g'
which will replace everything not in the set of alphanumeric characters with a space.
If you just want to replace all 'non-printable' characters with spaces then you can use a character class[1] with
sed -e 's/[^[:print:]]/ /g'
some older versions of sed may not support this syntax though but it is standardized in the unix specification so you should not feel guilty for using it.[2]
[1] http://sed.sourceforge.net/sedfaq3.html
[2] http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03
It looks like ^A is not two characters, but in fact just one control character. So you should write something like \x01 instead.
Anyway, there are three character ranges, \x00-\x1f are control characters, \x20-\x7f are ascii, and others are... something that depends on encoding.
I don't know sed well, but if you want ascii only, that's how I would've done it in perl:
head /dev/urandom | perl -pe 's/[^\x20-\x7f]/ /gi'
If only replace ^A and ^#, you can use this:
sed 's/[\x01\x0]/ /g' file
Then I find more similar answers in SO which already discussed.
https://superuser.com/questions/75130/how-to-remove-this-symbol-with-vim
Replacing Control Character in sed

Sed replace in html file

How could I append 'index.html' to all links in a html file that do not end with that word ?
So that, for example, href="http://mysite/" would become href="http://mysite/index.html".
I am not a sed expert, but think this works:
sed -e "s_\"\(http://[^\"]*\)/index.html\"_\"\1\"_g" \
-e "s_\"\(http://[^\"]*[^/]\)/*\"_\"\1/index.html\"_g"
The first replacement finds URLS already ending in /index.html and deletes this ending.
The second replacement adds the /index.html as required. It deals with cases that end in / and also those that don't.
More than one version of sed exists. I'm using the one that comes in XCode for OS X.
for href ending with /
sed '\|href="http://.*/| s||\1index.html' YourFile
if there is folder ref without ending /, you should specifie what is consider as a file or not (like last name with a dot inside for file, ...)
What about this:
echo 'href="http://mysite/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://mysite/index.html"
echo 'href="http://www.google.com/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://www.google.com/index.html"
In general this is an almost unsolvable problem. If your html is "reasonably well behaved", the following expression searches for things that "look a lot like a URL"; you can see it at work at http://regex101.com/r/bZ9mR8 (this shows the search and replace for several examples; it should work for most others)
((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_#-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\#\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?
The result of the above match should be replaced with
\1index.html
Unfortunately this requires regex wizardry that is well beyond the rather pedestrian capabilities of sed, so you will have to unleash the power of perl, as follows:
perl -p -e '((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_#-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\#\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?/\index.html/gi'
It looks a bit daunting, I know. But it works. The only problem - if a link ends in /, it will add /index.html. You could easily take the output of the above and process it with
sed 's/\/\/index.html/\/index.html/g'
To replace a double-backslash-before-index.html with a single backslash...
Some examples (several more given in the link above)
http://www.index.com/ add /index.html
http://ex.com/a/b/" add /index.html
http://www.example.com add /index.html
http://www.example.com/something do nothing
http://www.example.com/something/ add /index.html
http://www.example.com/something/index.html do nothing

How can I remove all non-word characters except the newline?

I have a file like this:
my line - some words & text
oh lóok i've got some characters
I want to 'normalize' it and remove all the non-word characters. I want to end up with something like this:
mylinesomewordstext
ohlóokivegotsomecharacters
I'm using Linux on the command line at the moment, and I'm hoping there's some one-liner I can use.
I tried this:
cat file | perl -pe 's/\W//'
But that removed all the newlines and put everything one line. Is there someway I can tell Perl to not include newlines in the \W? Or is there some other way?
This removes characters that don't match \w or \n:
cat file | perl -C -pe 's/[^\w\n]//g'
#sth's solution uses Perl, which is (at least on my system) not Unicode compatible, thus it loses the accented o character.
On the other hand, sed is Unicode compatible (according to the lists on this page), and gives a correct result:
$ sed 's/\W//g' a.txt
mylinesomewordstext
ohlóokivegotsomecharacters
In Perl, I'd just add the -l switch, which re-adds the newline by appending it to the end of every print():
perl -ple 's/\W//g' file
Notice that you don't need the cat.
The previous response isn't echoing the "ó" character. At least in my case.
sed 's/\W//g' file
Best practices for shell scripting dictate that you should use the tr program for replacing single characters instead of sed, because it's faster and more efficient. Obviously use sed if replacing longer strings.
tr -d '[:blank:][:punct:]' < file
When run with time I get:
real 0m0.003s
user 0m0.000s
sys 0m0.004s
When I run the sed answer (sed -e 's/\W//g' file) with time I get:
real 0m0.003s
user 0m0.004s
sys 0m0.004s
While not a "huge" difference, you'll notice the difference when running against larger data sets. Also please notice how I didn't pipe cat's output into tr, instead using I/O redirection (one less process to spawn).