Nutch inconsistently ignores redirects - redirect

I ran into trouble with crawling (nutch 1.9/openjdk7) pretty simple redirect cases.
Here is a packet capture for the process.
Time Source Destination Protocol Info
12.988003 99.99.99.99 8.8.4.4 DNS Standard query 0xc165 A bloomberg.com
13.032343 8.8.4.4 99.99.99.99 DNS Standard query response 0xc165 A 69.191.212.191 A 69.191.251.238
13.124471 99.99.99.99 69.191.212.191 HTTP GET /robots.txt HTTP/1.0
13.228846 69.191.212.191 99.99.99.99 HTTP HTTP/1.1 301 Moved Permanently (text/html)
13.264230 99.99.99.99 8.8.4.4 DNS Standard query 0x7089 A www.bloomberg.com
13.344767 8.8.4.4 99.99.99.99 DNS Standard query response 0x7089 CNAME www.bloomberg.com.edgekey.net CNAME e4569.x.akamaiedge.net A 23.214.189.136
13.351030 99.99.99.99 23.214.189.136 HTTP GET /robots.txt HTTP/1.0
13.359121 23.214.189.136 99.99.99.99 HTTP HTTP/1.0 200 OK (text/plain)
13.448604 99.99.99.99 69.191.212.191 HTTP GET / HTTP/1.0
13.537211 69.191.212.191 99.99.99.99 HTTP HTTP/1.1 301 Moved Permanently (text/html)
13.640146 99.99.99.99 69.191.212.191 HTTP GET / HTTP/1.0
13.738564 69.191.212.191 99.99.99.99 HTTP HTTP/1.1 301 Moved Permanently (text/html)
Nutch tries to fetch http://bloomberg.com which replies with a 301 redirect to http://www.bloomberg.com. The redirect is handled correctly for robots.txt. However, for 'get /', fetcher keeps trying the original hostname, which keeps replying 301. No matter how big http.redirect.max, fetching fails (I've checked 10).
Nutch 1.9 running on
OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.12.04.1)
OpenJDK Client VM (build 24.65-b04, mixed mode, sharing)
Is this a bug (could you confirm it then) or just a misconfiguration?
Thanks.

This was a bug, 1.10 must to be shipped with the fix:
https://github.com/apache/nutch/commit/ed052df8822380ccfa89a9ffa1df324933669a59

Related

REST client gives error, while HTTP looks fine

My interface (an MKR Wifi 1010 Arduino) runs a very simple REST API, but when testing it with Mulesoft's Advanced Rest Client, I get this error:
The requested URL can't be reached
The service might be temporarily down or it may have moved permanently to a new web address.
The response status "0" is not allowed. See HTTP spec for more details: https://tools.ietf.org/html/rfc2616#section-6.1.1
When I check it with telnet though, it looks fine:
[bf#localhost ~]$ telnet 192.168.178.185 80
Trying 192.168.178.185...
Connected to 192.168.178.185.
Escape character is '^]'.
GET /api/gps HTTP/1.1
Host: 192.168.178.185
HTTP/1.1 200 OK
Connection: close
Content-Length: 9
Content-Type: application/json
"Success"
Connection closed by foreign host.
My question now is, is the rest client broken, or am I missing something in my reply? Of course I want any REST client to be able to process my interface correctly.

haproxy : http frontend to https backend

This is the exact same question as http request to https request using haproxy
However, the accepted answer does not work for me and I dont understand why
haproxy.cfg:
global
daemon
maxconn 15
defaults
mode tcp
balance first
frontend google
bind *:10005
default_backend google-url
backend google-url
server xxx google.com:443 ssl verify none
when I call curl --location --request GET 'http://localhost:10005', I receive a response that comes from google but with a 404 status
The requested URL / was not found on this server. That’s all we know.
I tried both mode tcp and mode http, same result
If I activate the logs with
mode http
bind *:10005
default_backend google-url
option httplog
log stdout format raw local0
I have this
127.0.0.1:52588 [16/Jun/2022:08:24:49.976] google google-url/xxx 0/0/49/20/69 404 1884 - - ---- 2/2/0/0/0 0/0 "GET / HTTP/1.1"
127.0.0.1:52588 [16/Jun/2022:08:24:49.938] google google/<NOSRV> -1/-1/-1/-1/1038 400 0 - - CR-- 2/2/0/0/0 0/0 "<BADREQ>"
In case this has some impact, I'm running haproxy in kubernetes and then I "port-forward" 10005 (but this does not seem to be the issue because the logs demonstrate that haproxy is correctly receiving the request and using the correct backend...)
Your curent HAProxy configuration will accept your request:
curl --location --request GET 'http://localhost:10005'
(corresponds to the first log entry)
and proxy it to Google as:
curl --location -H 'Host: localhost' --request GET 'https://www.google.com/'
(note the Host header implied; I bet this is not what you'd expect).
Google will respond with 404 and HAProxy will log the BADREQ.
This happens because HAProxy can't infer that when client request's Host header is localhost it should re-write it to google.com (or better: www.google.com) simply because it proxies to a host with that name.
You need to configure:
backend google-url
server xxx google.com:443 ssl verify none
http-request set-header host www.google.com

Sometimes server connect failed via loadbalance

My server is load balanced to the backend server via gcp https lb and the backend server uses pm2 start -i options different ports and distribute them to those nodes using haproxy.
connection log
Log taken from server using code.
io.on('connection', (socket) => {
console.debug("connection!", socket.id);
}
Once every two times, the server connection fails.
Below is Log through haproxy -d.
fail
success
this log is success after failure
00000041:http-in.clireq[000a:ffffffff]: GET /socket.io/?EIO=3&transport=websocket&sid=rf3JvyUz2KCKV4KDAABS HTTP/1.1
00000041:http-in.clihdr[000a:ffffffff]: User-Agent: websocket-sharp/1.0
00000041:http-in.clihdr[000a:ffffffff]: Host: mydomain.com
00000041:http-in.clihdr[000a:ffffffff]: Upgrade: websocket
00000041:http-in.clihdr[000a:ffffffff]: Sec-WebSocket-Key: 9JCkV46YNC66nIUaZQZl9w==
00000041:http-in.clihdr[000a:ffffffff]: Sec-WebSocket-Version: 13
00000041:http-in.clihdr[000a:ffffffff]: X-Cloud-Trace-Context: 1210ae7f3bb6e56c817e7f5ad30e1d24/17748396103009389644
00000041:http-in.clihdr[000a:ffffffff]: Connection: Upgrade
00000041:http-in.clihdr[000a:ffffffff]: Via: 1.1 google
00000041:http-in.clihdr[000a:ffffffff]: X-Forwarded-For: source ip, dest ip
00000041:http-in.clihdr[000a:ffffffff]: X-Forwarded-Proto: https
00000041:websockets.srvrep[000a:000b]: HTTP/1.1 400 Bad Request
00000041:websockets.srvhdr[000a:000b]: Connection: close
00000041:websockets.srvhdr[000a:000b]: Content-type: text/html
00000041:websockets.srvhdr[000a:000b]: Content-Length: 18
00000041:websockets.srvcls[000a:adfd]
00000041:websockets.clicls[adfd:adfd]
00000041:websockets.closed[adfd:adfd]
00000042:http-in.accept(0007)=000a from [130.211.3.23:53189] ALPN=<none>
00000042:http-in.clireq[000a:ffffffff]: GET /socket.io/?EIO=3&transport=websocket&sid=xmZEqtHokEmBfs2QAABT HTTP/1.1
00000042:http-in.clihdr[000a:ffffffff]: User-Agent: websocket-sharp/1.0
00000042:http-in.clihdr[000a:ffffffff]: Host: mydomain.com
00000042:http-in.clihdr[000a:ffffffff]: Upgrade: websocket
00000042:http-in.clihdr[000a:ffffffff]: Sec-WebSocket-Key: 8PKkFxEv3c3KqIUW8dQbLA==
00000042:http-in.clihdr[000a:ffffffff]: Sec-WebSocket-Version: 13
00000042:http-in.clihdr[000a:ffffffff]: X-Cloud-Trace-Context: de0fa56d1689317bb9212879da8edfcb/11012650454108009103
00000042:http-in.clihdr[000a:ffffffff]: Connection: Upgrade
00000042:http-in.clihdr[000a:ffffffff]: Via: 1.1 google
00000042:http-in.clihdr[000a:ffffffff]: X-Forwarded-For: source ip, dest ip
00000042:http-in.clihdr[000a:ffffffff]: X-Forwarded-Proto: https
00000042:websockets.srvrep[000a:000b]: HTTP/1.1 101 Switching Protocols
00000042:websockets.srvhdr[000a:000b]: Upgrade: websocket
00000042:websockets.srvhdr[000a:000b]: Connection: Upgrade
00000042:websockets.srvhdr[000a:000b]: Sec-WebSocket-Accept: tW86o/zu95tHQayPP7IlGNXi96s=
There is no difference, but two results. i don't know why 400 bad request
haproxy.cfg
global
maxconn 4096
defaults
mode http
balance roundrobin
option redispatch
option forwardfor
timeout connect 5s
timeout queue 5s
timeout client 50s
timeout server 50s
frontend http-in
bind *:80
default_backend servers
# Any URL beginning with socket.io will be flagged as 'is_websocket'
acl is_websocket path_beg /socket.io
acl is_websocket hdr(Upgrade) -i WebSocket
acl is_websocket hdr_beg(Host) -i ws
# The connection to use if 'is_websocket' is flagged
use_backend websockets if is_websocket
backend servers
server server1 10.168.0.50:80
# server server2 [Address]:[Port]
backend websockets
balance source
option http-server-close
option forceclose
cookie io prefix indirect nocache # using the `io` cookie set upon handshake
server ws-server1 10.168.0.50:5000 weight 1 maxconn 1024 cookie ws-server1 check
server ws-server2 10.168.0.50:5001 weight 1 maxconn 1024 cookie ws-server2 check
#server ws-server3 10.168.0.50:5002 weight 1 maxconn 1024 check
i use cookie SRVNAME insert options and server name SA, SB
but socket.io document read change cookie io prefix indirect nocache and server name ws-server1, ws-server2
My Tested:
client used long polling and websocket
testClient give option {transports: ['websocket']} always success.. but real client not use only websocket options
I don't know why it fails.
If only use ws-server1 the connection will always succeed. but ws-server2 use sometime connection failed. I guess sticky session problem. I try haproxy.cfg add cookie option but The problem is not solved.
How can I solve this problem?

301 moved permanently with socket.http

In python (and my browser), I am able to send a request to https://www.devrant.com/api/devrant/rants?app=3&sort=algo&limit=10&skip=0 and get a response, as expected, but with Lua, I get HTTP/1.1 301 Moved Permanently. Here is what I have tried so far:
http = require("socket.http");
print(http.request("https://www.devrant.com/api/devrant/rants?app=3&sort=algo&limit=10&skip=0")
which outputs an HTTP error page (moved permanently) and
301 table: 0x8f32470 http/1.1 301 Moved Permanently
the table's contents are:
location https://www.devrant.com/api/devrant/rants?app=3&sort=algo&limit=10&skip=0
content-type text/html
server nginx/1.10.0 (Ubuntu)
content-length 194
connection close
date Mon, 11 Dec 2017 01:41:35
Why does only Lua get this error? If I request to google, I get the google home page HTML. If I request to status.mojang.com, I get the mojang server statuses in a JSON response string, so the socket is functional for certain.
It's because you are using socket.http to request a page from https URL; since socket.http doesn't handle https, it sends the request to port 80, which gets forwarded to https URL, but socket library doesn't follow that redirect, as it doesn't "know" what to do with https, so it simply reports 301.
You need to install and use luasec and use ssl.https instead of socket.http, which will make it work.

404 redirect to another server/domain

I'm looking for a solution with redirects to another domain if the response from HTTP server was 404.
acl not_found status 404
acl found_ceph status 200
use_backend minio_s3 rsprep ^HTTP/1.1\ 404\ (.*)$ HTTP/1.1\ 302\ Found\nLocation:\ / if not_found
use_backend ceph if found_ceph
But still not working, this rule goes to minio_s3 backend.
Thank you for you advice.
When the response from this backend has status 404, first add a Location header that will send the browser to example.com with the original URI intact, then set the status code to 302 so the browser executes a redirect.
backend my-backend
mode http
server my-server 203.0.113.113:80 check inter 60000 rise 1 fall 2
http-response set-header Location http://example.com%[capture.req.uri] if { status eq 404 }
http-response set-status 302 if { status eq 404 }
Test:
$ curl -v http://example.org/pics/funny/cat.jpg
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to example.org (127.0.0.1) port 80 (#0)
> GET /pics/funny/cat.jpg HTTP/1.1
> User-Agent: curl/7.35.0
> Host: example.org
> Accept: */*
The actual back-end returns 404, but we don't see it. Instead...
< HTTP/1.1 302 Moved Temporarily
< Last-Modified: Thu, 04 Aug 2016 16:59:51 GMT
< Content-Type: text/html
< Content-Length: 332
< Date: Sat, 07 Oct 2017 00:03:22 GMT
< Location: http://example.com/pics/funny/cat.jpg
The response body from the back-end's 404 error page will still be sent to the browser, but -- as it turns out -- the browser will not display it, so no harm done. This requires HAProxy 1.6 or later.
#Michael's answer is rather good, but isno't working for me for two reasons:
Mainly because the %[capture.req.uri] tag resolves to empty (HA Proxy 1.7.9 Docker image)
Also due to the fact that the original assumptions are incomplete, due to the fact that the frontend section is missing...
So I struggled for a while, as you find all kinds of answers on the Internet, between those guys who swear the 404 logic should be put in the frontend, vs those who choose the backend, and any possible kind of tags...
This is my answer, which works for me.
My use case is that if an image is not found on the backend behind HA Proxy, then an S3 bucket is checked.
The entry point is: https://myhostname:8080/path/to/image.jpeg
defaults
mode http
global
log 127.0.0.1:514 local0 debug
frontend come_on_over_here
bind :8080
# The following two lines are here to save values while we have access to them. They won't be available in the backend section.
http-request set-var(txn.path) path
http-request set-var(txn.query) query
http-request replace-value Host localhost:8080 dev.local:80
default_backend onprems_or_s3_be
backend onprems_or_s3_be
log global
acl path_photos var(txn.path) -m beg /path/prefix/i/want/to/strip/off
acl p_ext_jpeg var(txn.path) -m end .jpeg
acl is404 status eq 404
http-response set-header Location https://mybucket.s3.eu-west-3.amazonaws.com"%[var(txn.path),regsub(^/path_prefix_i_want_to_strip_off/,/)]?%[var(txn.query)]" if path_photos p_ext_jpeg is404
http-response set-status 301 if is404
server onprems_server dev.local:80 check