I've to crawl website http://docbao.com.vn/ using wget, but wget always message
HTTP request sent, awaiting response... No data received.
Retrying.
For example, I crawled all webpages in a category http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec , then the result was
congnh#congnh-pc:~/Source/datasection/congnh-crawler/sh$ wget "http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec" -O -
--2013-02-20 23:53:16-- http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Resolving docbao.com.vn (docbao.com.vn)... 123.30.51.174
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:17-- (try: 2) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:19-- (try: 3) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:22-- (try: 4) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:27-- (try: 5) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:32-- (try: 6) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:38-- (try: 7) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:45-- (try: 8) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:53-- (try: 9) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
...
Why wget retry "unlimitedly"? or what's the problem?
Thanks
Cong
Sorry for stating the obvious, but: wget retries because it does not receive any data. It sends the HTTP header and the remote host closes the connection immediately after that. I can just guess that this non-standard behaviour is due to a misconfiguration on the server side, maybe a deliberate one.
After poking around a bit, I found out that the content will get served once you signal you can handle a gzip-encoded response. You can do so by adding --header="accept-encoding: gzip" to your wget command. This again is problematic for crawling with wget, since it cannot recurse into gzipped content. You will need to write a script to handle this situation, or use another tool which can handle such content.
On a sidenote: Please be aware that not all websites allow scraping their content. Please check their TOS before you do so.
Related
I am getting an error in using Wget to download individual files instead of all files via manifest.
I want to download the files as described on this website
https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/index.html
I used the command they gave just fine
wget -i https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/manifest.txt
However, I just want to download just certain files, not every file in the manifest. I was looking at the manifest file and its contents looked like this
corpus-2018-05-03/s2-corpus-00.gz
corpus-2018-05-03/s2-corpus-01.gz
corpus-2018-05-03/s2-corpus-02.gz
corpus-2018-05-03/s2-corpus-03.gz
corpus-2018-05-03/s2-corpus-04.gz
corpus-2018-05-03/s2-corpus-05.gz
So I just changed the command to something like this
wget -i https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/s2-corpus-02.gz
The command runs fine at first, but after it downnloads the file I get some warning and/or error messages. And I'm not sure what they mean. Here is the output
--2018-08-11 00:03:47-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/s2-corpus-02.gz
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.128.152
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 996588773 (950M) [application/x-gzip]
Saving to: ‘s2-corpus-02.gz’
s2-corpus-02.gz 100%[===================>] 950.42M 38.5MB/s in 25s
2018-08-11 00:04:13 (37.5 MB/s) - ‘s2-corpus-02.gz’ saved [996588773/996588773]
�7�sa����=���xT���~��%����3X�M�|~�X^Z%\�?�`��Fx?�%��\���/�5/�$��P����g+�v�j: Bad port number
s2-corpus-02.gz: Invalid URL https://�*�b�:ۅF�Cg��$�Bj�H�gLM逖N�l���ZUV�[�;&mu�̸��&�y��X�%��;�˝1|)�$�d˝�: Bad port number
s2-corpus-02.gz: Invalid URL https://{Y1��&�������\�Y�Ey�Զ�:E3;ɜ Q!: Bad port number
n]��g: Invalid host name
%��)]kZ�R�e����� Ӡ�{)]��B��0��OV�%T��: Invalid host name
s2-corpus-02.gz: Invalid URL https://7�s�s����{���!ސ#: Invalid host name
s2-corpus-02.gz: Invalid URL https://���ݔ�v�G7NI:,J�����i�YKN�o�.e�N�z< R� DZ$+4;!C�B���ZJ"�>��2�#`ǼU3��x��D� bqh���: Bad port number
�5�3���݂5�LLT�]���j0)dv7:2�]�x���a���fv�#��$=!Y�ږ�9U �#H*�Ǹ: Bad port number
uc;�]*�m������:����o4Z�`c�#,U��ze"vrY;,!̝rF���aL�L��7�Ն-�zs�w;Zu\^����e��H��m��{ʪ*��l���O: Bad port number
s2-corpus-02.gz: Invalid URL https://�:�D����: Bad port number
ٶ����1�>g�y���=͛����hv���O�b�o��m���i��&��w��/���{�k| �Q(zq��ϔ���: Bad port number
���^盩Y��'DIfe*��&��ƫO�|�80��湏��~9: Invalid host name
^zs��멨�u�o\?��#`x����{�>�˝�d��CI�C��4Fg������9j?�w�(X�N���7: Bad port number
s2-corpus-02.gz: Invalid URL https://��j�q(�Ur��1�KMq�1]��#d�aԌ����:�3�pEzbaj(��B��*}kK��ΊOu;B��V: Bad port number
s2-corpus-02.gz: Invalid URL https://�����`m���<�5��!;p3���~�`�)�Q���0�:!�n��`�r���D0ǖ�&r'�*.i�!��mM����n�oڀ�Zk�l�H1���t�: Bad port number
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/%1F%8B%08
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/%1F%8B%08
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/%C9E%DF%C4$%0C.eL%7B%93%82%F1J%04%C3m%14%8Dl%9Ckk%AB%1B%7C%9B%B4%17%A26m
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/%C9E%DF%C4$%0C.eL%7B%93%82%F1J%04%C3m%14%8Dl%9Ckk%AB%1B%7C%9B%B4%17%A26m
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
Warning: wildcards not supported in HTTP.
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/%D4%9F%B5%F8%C3j%86%86%DEm6%CB%F5%EF%CE%CF%D7qn1n~%ED%EF%FA%99]%9D%F5%AB%DB%F3%A5]%C6%D5%B9sF4%A2%B52%A5%E8%99%16%3Ey%E3%92%16%9C%7B%CB%A2%60%C2%0B%99l%AD%9E%D0C%AFB*%CF%C5%A7%3C%10_q%B7%DDn%EE%FA%15%8D%CF??Y%D8%3C%CA%DFn1]%F7%DB%EA*v%F9%81y%F0j&j%D90%F3%E4%1F%FF%F3%C9%EE%CA%AB%B3%8B%B3%EAzio%17%FD%AC%DF_+Ykpu%7Dp%ED7go%CE%AA%9B8_b%96'%97
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
Warning: wildcards not supported in HTTP.
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/%D4%9F%B5%F8%C3j%86%86%DEm6%CB%F5%EF%CE%CF%D7qn1n~%ED%EF%FA%99]%9D%F5%AB%DB%F3%A5]%C6%D5%B9sF4%A2%B52%A5%E8%99%16%3Ey%E3%92%16%9C%7B%CB%A2%60%C2%0B%99l%AD%9E%D0C%AFB*%CF%C5%A7%3C%10_q%B7%DDn%EE%FA%15%8D%CF??Y%D8%3C%CA%DFn1]%F7%DB%EA*v%F9%81y%F0j&j%D90%F3%E4%1F%FF%F3%C9%EE%CA%AB%B3%8B%B3%EAzio%17%FD%AC%DF_+Ykpu%7Dp%ED7go%CE%AA%9B8_b%96'%97
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
Warning: wildcards not supported in HTTP.
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/%D6H%95/%DD%CF%F7%BBr%C7%DB%D7o.%DF%BF%ADh2%AB%D3%F2%CF%EB%97/_Vo%BB%19%A6uu_]%F6%F3%F9v%D1%F92%C5%F8%B8Hq%15%17%3E%92H%D00%5E%B8%F5fe%FD%06%0F%7B%F9y%13%17%EB%7C]%EAW%D5%FB%EB%ABo%AAnQ%D9j%DE%BBn%16+%1BN%EF%20%B6q%F1%B1[%F5%8B9%C4%B9%BA%B3%1Fc%E5b%5CT!~%8C%B3~%19C%E5%EE%AB%CD],%B7%BF~y%F3M%F5%A9_%FDHb%7B%BB%EA%B7Kt%F0.%AEc%15%F7/%B3+%7C%9C%C7%D5-]d%D7U%A4)%D9tx%F2%AA,%84j=.%83%DC%B0%0D%9A%8B%1E%CD%AA%18nc%B5%88%1Bz%C1%FA%ACz%D5%7FB
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
Warning: wildcards not supported in HTTP.
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/%D6H%95/%DD%CF%F7%BBr%C7%DB%D7o.%DF%BF%ADh2%AB%D3%F2%CF%EB%97/_Vo%BB%19%A6uu_]%F6%F3%F9v%D1%F92%C5%F8%B8Hq%15%17%3E%92H%D00%5E%B8%F5fe%FD%06%0F%7B%F9y%13%17%EB%7C]%EAW%D5%FB%EB%ABo%AAnQ%D9j%DE%BBn%16+%1BN%EF%20%B6q%F1%B1[%F5%8B9%C4%B9%BA%B3%1Fc%E5b%5CT!~%8C%B3~%19C%E5%EE%AB%CD],%B7%BF~y%F3M%F5%A9_%FDHb%7B%BB%EA%B7Kt%F0.%AEc%15%F7/%B3+%7C%9C%C7%D5-]d%D7U%A4)%D9tx%F2%AA,%84j=.%83%DC%B0%0D%9A%8B%1E%CD%AA%18nc%B5%88%1Bz%C1%FA%ACz%D5%7FB
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
The name is too long, 243 chars total.
Trying to shorten...
New name is R�է.�%10�����4��M?%C7%90%F9I%97%E7%D1%DF%D9E%B7%9E%9FT%DDY%3C;%A9%5E__%DC%5C%5CUO1%16+%7B%BA%EE%D0%A8%87%F7%8D%13:5%07%CF%CE%AA?F%BCj%8E%0EmrW%F2%A4%F6%D36%7BH%16%FE%FC%88f%A1%F1%D0%0C%A9C%AB%1E%AE%B3%3CAG&%7B%98%91%2F%0C%CE%FF?)%FF%DF.
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/R%BF%D5%A7.%90%10%85%B8%9C%F5%F74%B1%DBM?%C7%90%F9I%97%E7%D1%DF%D9E%B7%9E%9FT%DDY%3C;%A9%5E__%DC%5C%5CUO1%16+%7B%BA%EE%D0%A8%87%F7%8D%13:5%07%CF%CE%AA?F%BCj%8E%0EmrW%F2%A4%F6%D36%7BH%16%FE%FC%88f%A1%F1%D0%0C%A9C%AB%1E%AE%B3%3CAG&%7B%98%91/%0C%CE%FF?)%FF%DF%9B%142
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
The name is too long, 243 chars total.
Trying to shorten...
New name is R�է.�%10�����4��M?%C7%90%F9I%97%E7%D1%DF%D9E%B7%9E%9FT%DDY%3C;%A9%5E__%DC%5C%5CUO1%16+%7B%BA%EE%D0%A8%87%F7%8D%13:5%07%CF%CE%AA?F%BCj%8E%0EmrW%F2%A4%F6%D36%7BH%16%FE%FC%88f%A1%F1%D0%0C%A9C%AB%1E%AE%B3%3CAG&%7B%98%91%2F%0C%CE%FF?)%FF%DF.
Incomplete or invalid multibyte sequence encountered
--2018-08-11 00:04:20-- https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/R%BF%D5%A7.%90%10%85%B8%9C%F5%F74%B1%DBM?%C7%90%F9I%97%E7%D1%DF%D9E%B7%9E%9FT%DDY%3C;%A9%5E__%DC%5C%5CUO1%16+%7B%BA%EE%D0%A8%87%F7%8D%13:5%07%CF%CE%AA?F%BCj%8E%0EmrW%F2%A4%F6%D36%7BH%16%FE%FC%88f%A1%F1%D0%0C%A9C%AB%1E%AE%B3%3CAG&%7B%98%91/%0C%CE%FF?)%FF%DF%9B%142
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.152|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-08-11 00:04:20 ERROR 400: Bad Request.
This is just a small part of the output. It continues and doesn't seem to finish running until I manually stop the execution.
This is just a bad mistake with wget. From the manual page,
-i file, --input-file=file (Read URLs from a local or external file.)
So the command used, tries to parse URLs from the binary content of https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/corpus-2018-05-03/s2-corpus-02.gz and "wget" those urls. The invalid URLs (from binary content) just lead to more errors.
The proper and simple solution is to modify the contents of manifest.txt, before using it with wget -i.
wget http://www.icerts.com/images/logo.jpg --header "Referer: www.icerts.com"
--2018-03-16 16:41:28-- http://www.icerts.com/images/logo.jpg
Resolving www.icerts.com (www.icerts.com)... 192.243.111.11
Connecting to www.icerts.com (www.icerts.com)|192.243.111.11|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.icerts.com/images/logo.jpg [following]
--2018-03-16 16:41:30-- https://www.icerts.com/images/logo.jpg
Connecting to www.icerts.com (www.icerts.com)|192.243.111.11|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-03-16 16:41:32 ERROR 404: Not Found.
Also not able to install any software using sudo apt-get install, it is always showing same ERROR 404: Not Found
I'm trying to wget from the "Download Now" link on this website but it returns the following error:
$wget https://www.spigotmc.org/resources/supervanish.1331/download?version=46330
--2015-09-27 19:13:53-- https://www.spigotmc.org/resources/supervanish.1331/download?version=46330
Resolving www.spigotmc.org (www.spigotmc.org)... 198.41.204.94, 198.41.205.94, 2400:cb00:2048:1::c629:cd5e, ...
Connecting to www.spigotmc.org (www.spigotmc.org)|198.41.204.94|:443... connected.
HTTP request sent, awaiting response... 503 Service Temporarily Unavailable
2015-09-27 19:13:53 ERROR 503: Service Temporarily Unavailable.
Here is the page:
https://www.spigotmc.org/resources/supervanish.1331
Here is the URL:
https://www.spigotmc.org/resources/supervanish.1331/download?version=46330
You cannot wget because the server returns an error.
If you want to retrieve an error page, perhaps this page will help you.
I want to download the result of a Express.js REST API which is very slow to process (~10 minutes). I tried few timeout options with wget but it gives up after few minutes while I ask it to wait around ~60 000 years.
wget "http://localhost:5000/slowstuff" --http-user=user --http-password=password --read-timeout=1808080878708 --tries=1
--2015-02-26 11:14:21-- http://localhost:5000/slowstuff
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:5000... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Authentication selected: Basic realm="Authorization Required"
Reusing existing connection to [localhost]:5000.
HTTP request sent, awaiting response... No data received.
Giving up.
EDIT:
The problem doesn't come from the wget timeout value. With a timeout set to 4 seconds, the error is different: Read error (Connection timed out) in headers. And I have exactly the same problem with curl.
I think the problem comes from my API. It looks like a timeout of 2 minutes is set by default in NodeJS.
Now, I need to find how to change this value.
This
--http-password=password--read-timeout=1808080878708
is missing a blank. Use
--http-password=password --read-timeout=1808080878708
I am following the quick start guide and I am able to get both the Cobrand and User token's, however when I try to make a POST request to https://rest.developer.yodlee.com/services/srest/restserver/v1.0/jsonsdk/SiteTraversal/searchSite, I receive a 404 doc not found. I am able to use wget to download the file for /authenticate/login as shown below, but wget receives a 404.
zachallett# ~/code/yodlee/sampleapp
$ wget https://rest.developer.yodlee.com/services/srest/restserver/v1.0/jsonsdk/SiteTraversal/searchSite
--2013-12-09 14:48:02-- https://rest.developer.yodlee.com/services/srest/restserver/v1.0/jsonsdk/SiteTraversal/searchSite
Resolving rest.developer.yodlee.com... 216.35.6.163
Connecting to rest.developer.yodlee.com|216.35.6.163|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-12-09 14:48:03 ERROR 404: Not Found.
zachallett# ~/code/yodlee/sampleapp
$ wget https://rest.developer.yodlee.com/services/srest/restserver/v1.0/authenticate/login
--2013-12-09 14:48:16-- https://rest.developer.yodlee.com/services/srest/restserver/v1.0/authenticate/login
Resolving rest.developer.yodlee.com... 216.35.6.163
Connecting to rest.developer.yodlee.com|216.35.6.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘login’
[ <=> ] 16 --.-K/s in 0s
2013-12-09 14:48:16 (1.53 MB/s) - ‘login’ saved [16]
I have tried using this from an external REST client and it works.
Here is the request
POST /services/srest/restserver/v1.0/jsonsdk/SiteTraversal/searchSite HTTP/1.1
Host: rest.developer.yodlee.com
Cache-Control: no-cache
Content-Type: application/x-www-form-urlencoded
cobSessionToken=xxxxxxxxxxxxxxxxxxxxxxxx&userSessionToken=xxxxxxxxxxxxxxxxxxxxxxx&siteSearchString=dag