sed regexp pattern misunderstanding - sed

I try to parse log prase by sed:
echo 195.236.222.1 - - [24/Jul/2012:07:35:25 +0300] "GET / HTTP/1.1" 200 387 "http://www.google.fi/url?sa=t&rct=j&q=tarinat&source=web&cd=9&ved=0CGoQFjAI&url=http%3A%2F%2Fwww.suomi24.fi%2F&ei=XyQOUKi0CeWA4gTjz4D4Cg&usg=AFQjCNE6wg5zPXup3d3PRoqU-BtpiNCccw" "Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1" |
sed -r 's/.*(\&q=.*)\&.*/\1/'
I would like to get "&q=tarinat" but unfortunately have:
\&q=tarinat&source=web&cd=9&ved=0CGoQFjAI&url=http%3A%2F%2Fwww.suomi24.fi%2F&ei=XyQOUKi0CeWA4gTjz4D4Cg
Don't understand the reason why I get the whole string till the end. Any assistance or hints would be highly appreciated.

The .* is quite greedy. You could replace this with a negative character match [^&]* which says match anything but a & character
echo 195.236.222.1 - - [24/Jul/2012:07:35:25 +0300] "GET / HTTP/1.1" 200 387 "http://www.google.fi/url?sa=t&rct=j&q=tarinat&source=web&cd=9&ved=0CGoQFjAI&url=http%3A%2F%2Fwww.suomi24.fi%2F&ei=XyQOUKi0CeWA4gTjz4D4Cg&usg=AFQjCNE6wg5zPXup3d3PRoqU-BtpiNCccw" "Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1" |
sed -r 's/.*(\&q=[^&]*)\&.*/\1/'

The regex .* is greedy. You don't want it to be greedy, so you should probably write:
sed -r 's/.*(\&q=[^&]*)\&.*/\1/'

A simple way using grep:
grep -o "&q=[^&]*"
Result:
&q=tarinat

Related

wget many long URLs from a .txt file

I have a couple hundred 10-second mp4s to download. The URLs to these files are listed in a file called urls.txt and they look like
http://v16.muscdn.com/thirty_two_alphanumeric_characters/5cf790de/video/tos/maliva/tos-maliva-v-0068/thirty_two_alphanumeric_characters/?rc=ang7cmg8OmZtaTMzZzczM0ApQHRAbzVHOjYzMzM0NTQ2ODMzMzQ1b0BoNXYpQGczdyl2KUBmamxmc3JneXcxcHpAKTY0ZHEzY2otcTZyb18tLWIxNnNzLW8jbyM2QS8wLS00LTQtLzYzMjYtOiNvIzphLW8jOmA6YC1vI2toXitiZmBjYmJeYDAvOg%3D%3D
so the total length of the url is 329 characters.
When I try wget -i urls.txt I get Error 414 URI Too Long
But when I try to wget a random URL from the file by copy/pasting it into my terminal it works fine and downloads the one file.
So then I tried the following bash script to wget each URL in the file, but that gave me the same error.
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "Text read from file: $line"
wget $line --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25.0.1364.160 Chrome/25.0.1364.160 Safari/537.22"
done < "urls.txt"
I also tried to change the line-ending characters by doing dos2unix on the file but it made no difference.
What else can I try?
If all your URLs are already in a single file, why don't you simply invoke wget as:
$ wget --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25.0.1364.160 Chrome/25.0.1364.160 Safari/537.22" -i urls.txt

How to see the search found using curl o wget?

I need to see the searches found using curl or wget , when it find results with '301' status code.
This is my variable using curl.
website=$(curl -s --head -w %{http_code} https://launchpad.net/~[a-z]/+archive/pipelight -o /dev/null | sed 's#404##g')
echo $website
301
The above works, but only display if the site exists with '301' status code.
I want
echo $website
https://launchpad.net/~mqchael/+archive/pipelight
You can add the "effective URL" to your output. Change %{http_code} to "%{http_code} %{url_effective} ".
From there, it's just a matter of fussing with the regular expression. Change the sed string to 's#404 [^ ]* ##g'. That will eliminate (assuming you don't already know) not just the 404s, but will also eat the URL that follows it.
So:
curl -s --head -w "%{http_code} %{url_effective} " https://launchpad.net/~[a-z]/+archive/pipelight -o /dev/null | sed 's#404 [^ ]* ##g''s#404 [^ ]* ##g'
will give you:
301 https://launchpad.net/~j/+archive/pipelight
You may want to replace the HTTP codes with new-lines, after that.

How to ignore specific type of files to download in wget?

How do I ignore .jpg, .png files in wget as I wanted to include only .html files.
I am trying:
wget -R index.html,*tiff,*pdf,*jpg -m http://example.com/
but it's not working.
Use the
--reject jpg,png --accept html
options to exclude/include files with certain extensions, see http://www.gnu.org/software/wget/manual/wget.html#Recursive-Accept_002fReject-Options.
Put patterns with wildcard characters in quotes, otherwise your shell will expand them, see http://www.gnu.org/software/wget/manual/wget.html#Types-of-Files
# -r : recursive
# -nH : Disable generation of host-prefixed directories
# -nd : all files will get saved to the current directory
# -np : Do not ever ascend to the parent directory when retrieving recursively.
# -R : don't download files with this files pattern
# -A : get only *.html files (for this case)
For instance:
wget -r -nH -nd -np -A "*.html" -R "*.gz, *.tar" http://www1.ncdc.noaa.gov/pub/data/noaa/1990/
Worked example to download all files excluding archives:
wget -r -k -l 7 -E -nc \
-R "*.gz, *.tar, *.tgz, *.zip, *.pdf, *.tif, *.bz, *.bz2, *.rar, *.7z" \
-erobots=off \
--user-agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" \
http://misis.ru/
this is what I get from wget --help:
Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions.
-R, --reject=LIST comma-separated list of rejected extensions.
--accept-regex=REGEX regex matching accepted URLs.
--reject-regex=REGEX regex matching rejected URLs.
--regex-type=TYPE regex type (posix|pcre).
-D, --domains=LIST comma-separated list of accepted domains.
--exclude-domains=LIST comma-separated list of rejected domains.
--follow-ftp follow FTP links from HTML documents.
--follow-tags=LIST comma-separated list of followed HTML tags.
--ignore-tags=LIST comma-separated list of ignored HTML tags.
-H, --span-hosts go to foreign hosts when recursive.
-L, --relative follow relative links only.
-I, --include-directories=LIST list of allowed directories.
--trust-server-names use the name specified by the redirection
url last component.
-X, --exclude-directories=LIST list of excluded directories.
-np, --no-parent don't ascend to the parent directory.
so you can use -R or --reject to reject extentions this way:
wget -R="index.html,*.tiff,*.pdf,*.jpg" http://example.com/
and in my case here is final command which I wanted to recursively download/update none-html files from an indexed website directory:
wget -N -r -np -nH --cut-dirs=3 -nv -R="*.htm*,*.html" http://example.com/1/2/3/

diff ignore blank likes

How can I get GNU diff ignore the blank lines in the following example?
File a:
x
do
done
File b:
x
do
done
Neither file has trailing white spaces in any line.
Using GNU diff 3.1 on Mac OS X I get:
diff -w a b
2d1
< do
3a3
> do
Same when I add various promising looking options:
diff --suppress-blank-empty -E -b -w -B -I '^[[:space:]]*$' --strip-trailing-cr -i a b
2d1
< do
3a3
> do
What am I missing here?
diff --version
diff (GNU diffutils) 3.1
I think the problem here is that diff is seeing do as being removed from the first file, and added to the second, maybe because there isn't enough context around the change.
If you reverse the order of the files as arguments, diff reports that the space is added and removed, and will then ignore it with --ignore-blanks-lines.
Looking at it as a unified diff, this is a little more clear:
$ diff test.txt test2.txt -u
--- test.txt 2015-10-20 10:50:52.585167600 -0700
+++ test2.txt 2015-10-20 10:51:01.042167600 -0700
## -1,4 +1,4 ##
x
-do
+do
done
prp#QW7PRP09-14 ~/temp
$ diff test2.txt test.txt -u
--- test2.txt 2015-10-20 10:51:01.042167600 -0700
+++ test.txt 2015-10-20 10:50:52.585167600 -0700
## -1,4 +1,4 ##
x
-
do
+
done
And the result with the --ignore-blank-lines, and the order switched:
prp#QW7PRP09-14 ~/temp
$ diff test2.txt test.txt -B -u

Grep/Find/Xargs: Search between two strings in folder or result of Wget

I have a folder full of html files, some of which have the following line:
var topicName = "website/something/something_else/1234/12345678_.*, website/something/something_else/1234/12345678_.*//";
I need to get all instances of the text between inverted commas into a text file. I've been trying to combine FIND.exe and XARGS.exe to do this, but have not been successful.
I've been looking at things like the following, but don't know where to start to combine all three to get the output I want.
grep -rI "var topicName = " *
Ideally, I want to combine this with a call to wget also. So in order (a) do a recurisive mirror of a website (maybe limiting the results to Html files) i.e:
wget -mr -k robots=off --user-agent="Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11" --level=1 http://www.website.com/someUrl
(b) go through the html in each result and check if it contains the text 'var topicName', (c) if so, get the text between 'var topicName =' and '"' and write all the values to a text file at the end.
I'd appreciate any help at all with this.
Thanks.
For grabbing the text from the HTML into a file:
If your version of grep supports it, the -o switch tells it to only print the matched portion of the line.
With this in mind, 2 grep invocations should sort you out (provided you can identify uniquely ONLY the lines you wish to grab the text for); something like this:
grep -Rn "var topicName =" html/ | grep -o '"[^"]*"' > topicNames.dat
If it's unacceptable to leave the " symbols in there, you could pass it via sed after the second grep:
grep -Rn "var topicName =" html/ | grep -o '"[^"]*"' | sed 's/"//g' > topicNames.dat