How does s3fs work behind the scenes? - s3fs

Does s3fs make range requests default 10mb size for each file ?Also I trained a model using data from s3fs so the time compared to ebs increased linearly with the no of epoch.Why is it so ?

By default s3fs does issue range requests. You can observe this by running s3fs -f -o curldbg which emits the HTTP requests and responses. Sample output for a 40MB file:
> GET /filename HTTP/1.1
Range: bytes=131072-10616831
< HTTP/1.1 206 Partial Content
< Content-Range: bytes 131072-10616831/40776154
> GET /filename HTTP/1.1
Range: bytes=10616832-21102591
> GET /filename HTTP/1.1
Range: bytes=31588352-40776153
> GET /filename HTTP/1.1
Range: bytes=21102592-31588351
< HTTP/1.1 206 Partial Content
< Content-Range: bytes 21102592-31588351/40776154
< HTTP/1.1 206 Partial Content
< Content-Range: bytes 31588352-40776153/40776154
< HTTP/1.1 206 Partial Content
< Content-Range: bytes 10616832-21102591/40776154
> GET /filename HTTP/1.1
Range: bytes=0-131071
< HTTP/1.1 206 Partial Content
< Content-Range: bytes 0-131071/40776154
Note that requests are issued out of order.
s3fs may be slower than EBS for your use case; s3fs works well for bulk data transfer but not for random access.

Related

Uploading a file with google cloud API with a PUT at root of server?

I have a server using the google Drive API. I tried with a curl PUT request to upload a simple file (test.txt) at http://myserver/test.txt. As you can see, I did the PUT request at the root of my server. The response I get is the following:
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2UqANa4Bj6ilL7z5HZH0wlQi_ufxDiHPtb2zq1Gzcx7IxAEcOt-AOlWsbX1q_lsZUwWt_hyKOA3weAeVpQvPQTwbQhLhIA
ETag: "6e809cbda0732ac4845916a59016f954"
x-goog-generation: 1548877817413782
x-goog-metageneration: 1
x-goog-hash: crc32c=jwfJwA==
x-goog-hash: md5=boCcvaBzKsSEWRalkBb5VA==
x-goog-stored-content-length: 6
x-goog-stored-content-encoding: identity
Content-Type: text/html; charset=UTF-8
Accept-Ranges: bytes
Via: 1.1 varnish
Content-Length: 0
Accept-Ranges: bytes
Date: Wed, 30 Jan 2019 19:50:17 GMT
Via: 1.1 varnish
Connection: close
X-Served-By: cache-bwi5139-BWI, cache-cdg20732-CDG
X-Cache: MISS, MISS
X-Cache-Hits: 0, 0
X-Timer: S1548877817.232336,VS0,VE241
Vary: Origin
Access-Control-Allow-Methods: POST,PUT,PATCH,GET,DELETE,OPTIONS
Access-Control-Allow-Headers: Cache-Control,X-Requested-With,Authorization,Content-Type,Location,Range
Access-Control-Allow-Credentials: true
Access-Control-Max-Age: 300
I know you're not supposed to use the API that way. I did that for testing purposes. I understand every headers returned but can't figure out if my file has been uploaded because I don't have enough knowledge of this API.
My question is very simple :
Just by looking at the response, can you tell me if my file has been uploaded ?
If yes can I retrieve it and how ?
The HTTP status code traditionally indicates, for any given request, if it was successful. The status code in the response is always on the first line:
HTTP/1.1 200 OK
200 type status codes mean success. You should take some time to familiarize yourself with HTTP status codes if you intend to work with HTTP APIs.

Play Framework chunked responses not including chunk sizes

I'm trying to test out chunked responses in the Play Framework.
I have a sample play app that's running on Play 2.3.7
The documentation for Play chunked responses says this:
However when I try it exactly as advertised, I get:
➜ ~ curl -i -v localhost:9000
* Rebuilt URL to: localhost:9000/
* Trying ::1...
* Connected to localhost (::1) port 9000 (#0)
> GET / HTTP/1.1
> Host: localhost:9000
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
Content-Type: text/plain; charset=utf-8
< Transfer-Encoding: chunked
Transfer-Encoding: chunked
<
* Connection #0 to host localhost left intact
kikifoobar%
I don't see the chunk lengths and CLRFs added in between the different chunks. What's going on here? Is there some kind of minimum chunk size that needs to be hit? If so, the documentation doesn't really advertise that...
Duh doih - curl auto unchunks the responses.
You'll want to do:
curl -iv --raw localhost:9000 in the example above

Questions on proper REST api design specifically on the PUT action when updating a resource

I'm creating a REST interface (aren't we all), and I want to UPDATE a resource.
So, I think to use a PUT.
So, i read this.
My take away is that i PUT to a URL like this
/hc/api/v1/organizer/event/762d36c2-afc5-4c51-84eb-9b5b0ef2990c
with a payload, then a permanent redirect to the URL that it can GET an updated version of the resource.
In this case it happens to be the same URL, different action.
So my questions are:
Is my understanding of updating a resource correct in using a PUT, and is my understanding of the use of the PUT correct.
When a client gets a redirect does it do the same action on the redirected URL as it did on the original URL? If its "depends" is there a standard most clients follow?
I ask the 2nd question, because POSTMAN and my JQuery AJAX calls are choking. JQuery because of net::ERR_TOO_MANY_REDIRECTS. So is it redirecting and trying the PUT again, which it will get another REDIRECT?
curl blows up too but even though it says if it gets a 301 it will switch to a GET, it doesn't really seem to do that when i look at the output (below).
When curl follows a redirect and the request is not a plain GET (for example POST or PUT), it will do the following request with a GET if the HTTP response was 301, 302, or 303. If the response code was any other 3xx code, curl will re-send the following request using the same unmodified method.
CURL OUTPUT (edited for brevity) (also note how it says its going to switch to a GET [incorrectly from a POST], but then it seems to do a PUT anyway):
curl -X PUT -H "Authorization: Basic AUTHZ==" -H "Content-Type: application/json" -H "Cache-Control: no-cache" -H "Postman-Token: e80657f0-a8f5-af77-1d9d-d7bc22ed0b30" -d '{ JSONDATA"}' http://localhost:8080/hc/api/v1/organizer/event/762d36c2-afc5-4c51-84eb-9b5b0ef2990c -v -L
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> PUT /hc/api/v1/organizer/event/762d36c2-afc5-4c51-84eb-9b5b0ef2990c HTTP/1.1
> User-Agent: curl/7.37.1
> Host: localhost:8080
> Accept: */*
> Authorization: Basic AUTHZ==
> Content-Type: application/json
> Cache-Control: no-cache
> Postman-Token: e80657f0-a8f5-af77-1d9d-d7bc22ed0b30
> Content-Length: 203
>
* upload completely sent off: 203 out of 203 bytes
< HTTP/1.1 301 Moved Permanently
< Connection: keep-alive
< X-Powered-By: Undertow/1
< Set-Cookie: rememberMe=deleteMe; Path=/hc; Max-Age=0; Expires=Fri, 20-Feb-2015 03:53:28 GMT
< Set-Cookie: JSESSIONID=uwI3_41LAa7vlvapTsrZdw10.macbook-air; path=/hc
* Server WildFly/8 is not blacklisted
< Server: WildFly/8
< Location: /hc/api/v1/organizer/event/762d36c2-afc5-4c51-84eb-9b5b0ef2990c
< Content-Length: 0
< Date: Sat, 21 Feb 2015 03:53:28 GMT
<
* Connection #0 to host localhost left intact
* Issue another request to this URL: 'http://localhost:8080/hc/api/v1/organizer/event/762d36c2-afc5-4c51-84eb-9b5b0ef2990c'
* Switch from POST to GET
* Found bundle for host localhost: 0x7f9e4b415430
* Re-using existing connection! (#0) with host localhost
* Connected to localhost (127.0.0.1) port 8080 (#0)
> PUT /hc/api/v1/organizer/event/762d36c2-afc5-4c51-84eb-9b5b0ef2990c HTTP/1.1
> User-Agent: curl/7.37.1
> Host: localhost:8080
> Accept: */*
> Authorization: Basic dGVzdHVzZXIxOlBhc3N3b3JkMQ==
> Content-Type: application/json
> Cache-Control: no-cache
> Postman-Token: e80657f0-a8f5-af77-1d9d-d7bc22ed0b30
>
< HTTP/1.1 500 Internal Server Error
< Connection: keep-alive
< Set-Cookie: JSESSIONID=fDXxlH2xI-0-DEaC6Dj5EhD9.macbook-air; path=/hc
< Content-Type: text/html; charset=UTF-8
< Content-Length: 8593
< Date: Sat, 21 Feb 2015 03:53:28 GMT
<
...failure ensues... It actually does a PUT
thanks in advance.
I think you're reading too much into the 301 redirect section.
If you want to update a resource using PUT, return:
201: if the resource was created
200: with the updated resource
The 301 in question only applies if there actually is a redirect in question - like, if something can be identified by name, and you need to redirect it to a url that has the id or something. (Maybe you refactor and people are still consuming the old endpoint).
So, do you really need to redirect your PUT requests? Because you should be sending back the updated resource within the same loop using 200, like stated above, instead of "redirecting to GET".
EDIT: Fix some spelling.

301 curl does not show without -v

I was looking at the 301s that several 2.level domains use to redirect to their www 3.level domain, and I thought curl on its own was enough, for example
curl myvote.io
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
here.
</BODY></HTML>
However, I had to use curl -v to get any output on another domain :
curl -v evitaochel.com
* Rebuilt URL to: evitaochel.com/
* Hostname was NOT found in DNS cache
* Trying 62.116.130.8...
* Connected to evitaochel.com (62.116.130.8) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: evitaochel.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Date: Mon, 13 Oct 2014 16:18:02 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://www.evitaochel.com
< Content-Length: 0
< Content-Type: text/html; charset=UTF-8
<
* Connection #0 to host evitaochel.com left intact
If anything, I was expecting myvote.io to be the weirder one,
curl -v myvore.io
* Rebuilt URL to: myvote.io/
* Hostname was NOT found in DNS cache
* Trying 216.239.36.21...
* Connected to myvote.io (216.239.36.21) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: myvote.io
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://www.myvote.io/
< Date: Mon, 13 Oct 2014 16:30:40 GMT
< Content-Type: text/html; charset=UTF-8
* Server ghs is not blacklisted
< Server: ghs
< Content-Length: 218
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< Alternate-Protocol: 80:quic,p=0.01
<
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
here.
</BODY></HTML>
* Connection #0 to host myvote.io left intact
shows that it includes some extensions and is served by ghs, Google I guess. Any ideas what could be the cause, and if the cause is always visible in "curl -v" or could be some hidden configuration?
curl doesn't show any response headers when used without any option, that's just how it works. Use -v or even -i to get to see the headers only.
A redirect page (301, 302 or whatever) MAY contain a body but it also MAY NOT. That is up to the site.
Since you get HTTP redirects, you may want to use -L too to make curl follow them.
I use --head when testing redirects with curl. Using this flag causes curl to issue a special type of HTTP request that doesn't include the document and doesn't follow redirects. Then curl shows the HTTP headers.
From the manual:
Fetch the headers only! HTTP-servers feature the command HEAD which this uses to get nothing but the header of a document.
$ curl --head http://myvote.io/
HTTP/1.1 302 Found
Location: https://myvote.io/
...

curl, play & expect 100 continue header

consider a web service written in play, which excepts POST request (for uploads). now, when testing this with a medium size image (~75K) I've found out a strange behaviour. well, code speaks more clearly than long explanations, so:
$ curl -vX POST localhost:9000/path/to/upload/API -H "Content-Type: image/jpeg" -d #/path/to/mascot.jpg
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 9000 (#0)
> POST /path/to/upload/API HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:9000
> Accept: */*
> Content-Type: image/jpeg
> Content-Length: 27442
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=utf-8
< Content-Length: 16
<
* Connection #0 to host localhost left intact
{"success":true}
as you can see, curl decides to add the header Content-Length: 27442, but it's not true, the real size is 75211, and in play, I indeed got a body in size only 27442. of coarse, this is not the intended behaviour. so I tried a different tool, instead of curl I used the POST tool from libwww-perl:
cat /path/to/mascot.jpg | POST -uUsSeE -c image/jpeg http://localhost:9000/path/to/upload/API
POST http://localhost:9000/path/to/upload/API
User-Agent: lwp-request/6.03 libwww-perl/6.05
Content-Length: 75211
Content-Type: image/jpeg
200 OK
Content-Length: 16
Content-Type: application/json; charset=utf-8
Client-Date: Mon, 16 Jun 2014 09:21:00 GMT
Client-Peer: 127.0.0.1:9000
Client-Response-Num: 1
{"success":true}
this request succeeded. so I started to pay more attention to the differences between the tools. for starter: the Content-Length header was correct, but also, the Expect header was missing from the second try. I want the request to succeed either way. so the full list of headers as seen in play (via request.headers) is:
for curl:
ArrayBuffer((Content-Length,ArrayBuffer(27442)),
(Accept,ArrayBuffer(*/*)),
(Content-Type,ArrayBuffer(image/jpeg)),
(Expect,ArrayBuffer(100-continue)),
(User-Agent,ArrayBuffer(curl/7.35.0)),
(Host,ArrayBuffer(localhost:9000)))
for the libwww-perl POST:
ArrayBuffer((TE,ArrayBuffer(deflate,gzip;q=0.3)),
(Connection,ArrayBuffer(TE, close)),
(Content-Length,ArrayBuffer(75211)),
(Content-Type,ArrayBuffer(image/jpeg)),
(User-Agent,ArrayBuffer(lwp-request/6.03 libwww-perl/6.05)),
(Host,ArrayBuffer(localhost:9000)))
So my current thoughts are: the simpler perl tool used a single request, which is bad practice. the better way would be to wait for a 100 continue confirmation (especially if you gonna' upload a several GB of data...). curl would continue to send data until it receives a 200 OK or some bad request error code. So why play sends the 200 OK response without waiting for the next chunk? is it because curl specifies the wrong Content-Length? if it's wrong at all... (perhaps this refers to the size of the current chunk?).
so where's the problem lies? in curl or in the play webapp? and how do I fix it?
the problem was in my curl command. I used the -d argument, which is a short for --data or --data-ascii, when I should have used --data-binary argument.