wget works fine for some .jpgs but downloads an .html file instead for others - wget

I want to download web images from the command line.
This works fine sometimes, other times it doesn't and I can't figure out why.
Here's an example (Wikimedia Commons picture of the day):
wget https://commons.wikimedia.org/wiki/Main_Page#/media/File:01_Calanche_Piana.jpg
This somehow gets me an .html
HTTP request sent, awaiting response... 200 OK
Length: 185986 (182K) [text/html]
Saving to: 'Main_Page'
The following however (it's the same picture but with explicitly selected resolution) gets me a .jpg (which is what I want)
wget https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/01_Calanche_Piana.jpg/640px-01_Calanche_Piana.jpg
...
HTTP request sent, awaiting response... 200 OK
Length: 118796 (116K) [image/jpeg]
Saving to: '640px-01_Calanche_Piana.jpg'
I tried adding -O test.jpg to the first example, this will still be an .html file though.
Does anyone know why the command works in one case but not in the other?

why the command works in one case but not in the other?
This one
https://commons.wikimedia.org/wiki/Main_Page#/media/File:01_Calanche_Piana.jpg
despite what last letter might suggest is link to HTML page, note that there is # which is used to denote URI fragment, whilst this one
https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/01_Calanche_Piana.jpg/640px-01_Calanche_Piana.jpg
is URL to actual image. If you wondering what type of file is under given URL, but do not want to download that file you might do
wget -S --spider https://www.example.com
It will show you response headers, there might be many of them, but for determining type of resource Content-Type should suffice.

Related

TYPO3 7.6: 404 error page: HTML wrapped in numbers

I created my own “404 Page not found” error page on a TYPO3 website and implemented it via the /typo3conf/LocalConfiguration.php as follows, using the page’s Speaking URL path:
return [
...
'FE' => [
...
'pageNotFound_handling' => '/page-not-found/',
]
]
Now when I call a non-existing page, the error page gets displayed but there is a 4-digit alphanumeric number (hexadecimal as far as I’ve seen by now) BEFORE the HTML source code and a “0” AFTER it. Example (the number in the beginning is different after most of the reloads):
37b3
<!DOCTYPE html>
...
</html>
0
When calling the error page URL itself the page is returned correctly without those numbers.
Having the RealURL extension activated or deactivated does not make a difference.
Thanks a lot in advance!
I added the full description from the install tool and I guess we might find the solution there.
How TYPO3 should handle requests for non-existing/accessible pages.
empty (default)
The next visible page upwards in the page tree is shown.
'true' or '1'
An error message is shown.
String
Static HTML file to show (reads content and outputs with correct headers), e.g. notfound.html or http://www.example.org/errors/notfound.html.
Prefix "REDIRECT:"
If prefixed with "REDIRECT:" it will redirect to the URL/script after the prefix.
Prefix "READFILE:"
If prefixed with "READFILE" then it will expect the remaining string to be a HTML file which will be read and outputted directly after having the marker "###CURRENT_URL###" substituted with REQUEST_URI and ###REASON### with reason text, for example: READFILE:fileadmin/notfound.html.
Prefix "USER_FUNCTION:"
If prefixed with "USER_FUNCTION:" a user function is called, e.g. USER_FUNCTION:fileadmin/class.user_notfound.php:user_notFound->pageNotFound where the file must contain a class user_notFound with a method pageNotFound() inside with two parameters $param and $ref.
What you configured:
You're passing a string, thus TYPO3 expects to find a file - which you don't have, because it's more like an URL.
From what you try to achieve I'd go with REDIRECT:/page-not-found/.
Thanks for pointing this one out btw, I will remove the string configuration from the core since it does not make sense to have more people trip into this pitfall.
In short: change the following line in the FE section of your LocalConfiguration.php:
'pageNotFound_handling' => '/your404page.html',
to
'pageNotFound_handling' => 'REDIRECT:/your404page.html',
Cause
The actual cause is a combination of chunked Content-Encoding and the TYPO3 not being able to decode that in some cases. In your case the page not found handler eventually uses GeneralUtility::getUrl() to retrieve the error page.
If you have [SYS][curlUse] enabled it will use cUrl to retrieve the page and there is no problem.
If you don't have [SYS][curlUse] enabled it will open a socket, read the headers and then read the rest of the body. If the webserver uses "chunked" Content-Encoding the body will contain blocks of data and each block starts with a line with the length in hexadecimal format. The content ends with an empty block (with of course a line with the length "0").
cUrl apparently knows how to decode chunked data.
getUrl() itself does not know how to handle chunked data and uses the content as is as the page content.
In TYPO3 8 LTS the guzzle library is used to handle HTTP requests. In the guzzle code I can't find anything about handling chunked data. Guzzle will check if the cUrl PHP extension is present and use that as preferred transport. In most installations cUrl is present and since this decodes chunked data automagically no problem is visible. I have to test guzzle with PHP that has cUrl disabled to see if the issue is also present in v8/master.
Workaround/solution
If the PHP extension cUrl is enabled in your installation you can simply set [SYS][curlUse] in the Install Tool. The numbers around the 404 page content will disappear.

How do I know what to name a file downloaded using HTTP?

I am creating an HTTP client downloader in Python. I am able to correctly download a file such as http://www.google.com/images/srpr/logo11w.png just fine. However, I'm not sure what to actually name the thing.
There is of course the filename at the end of the URL, but is this always reliable?
If I recall correctly, wget uses the following heuristic:
If a Content-Disposition header exists, get the filename from there.
If the filename component of the URL exists (e.g. http://myserver/filename), use that.
If there is no filename component (e.g. http://www.google.com), derive the filename from the Content-Type header (such as index.html for text/html)
In all cases, if this filename is already present in the directory use a numerical suffix, such as index (1).html, or overwrite, depending on configuration.
There are plenty of other flags that control other heuristics, such as creating .html for ASP/DHTML content-types.
In short, it really depends how far you want to go. For most people, doing the first two + basic Content-Type->name mapping should be enough.

wget --spider doesn't get file size for some links

I want to get some file size. Some people recommend wget --spider. However, when I run it on some links, like http://autos.cn.yahoo.com/ then it said Length: unspecified [text/html]. Is there way to solve this or I could use another way to get file size without actually downloading it? Thank you!
This happens because the server doesn't send the Content-Length header, or it's being sent malformed. You can ignore this by using the --ignore-length option:
$ wget --ignore-length http://autos.cn.yahoo.com/

Setting a response header in a CQ5 publishing instance

I want to disable caching in a CQ component and I have the following line in my jsp (documentation):
response.setHeader("Dispatcher", "no-cache");
If I insert the component in a page and load the page in an authoring instance everything works as expected and I get an HTTP header named Dispatcher with the content no-cache.
Now if I do the same on a publishing instance (same configuration with CQ_RUNMODE='publish' and same content) the component works but for setting the HTTP header.
Any idea on why the two instances could behave differently?
Update
I tried to set other headers and the instance behaves in the same way: in the authoring mode the headers are generated in the publishing mode not (same configuration but for the CQ_RUNMODE)
Update 2
I was trying to reduce my example by removing everything that is unnecessary from the page (layout, code for headers, footer, ...) and I noticed that after a certain size threshold my header is correctly generated.
In other words by removing stuff from the page (even simple HTML) I reach a certain point where the header appears (if the page is small enough).
Any idea on why CQ is only generating the header for very small pages?
If you're trying to set the header in a component far down on the page, you could be getting the issue that you're trying to write it after the response has been committed.
If you need to flag the page as not cached & you cannot avoid placing the code higher in the buffer, you could instead write in a check for this node type at the start of the JSP (using node.listChildren() for example), or provide a page property that let's the editor control if the page is cached or not.
You didn't indicate which version of CQ5 you're using - I just tested with a minimal JSP script on a CQ 5.5 GA publish instance, and the header is correctly set:
$ curl -u admin:admin http://localhost:4503/tmp/x.tidy.json
{
"sling:resourceType": "x",
...
}
$ curl -u admin:admin http://localhost:4503/apps/x/x.jsp
<%
response.setHeader("Dispatcher","no-cache");
%>
Here's the content.
$ curl -D - -u admin:admin http://localhost:4503/tmp/x.html
HTTP/1.1 200 OK
Connection: Keep-Alive
...
Dispatcher: no-cache
Here's the content.
You might want to start with this minimal test and compare with what you're doing.

Is it ok to return application/octet-stream from a REST interface?

Am I breaking any laws in the REST bible by returning application/octet-stream for my responses ? The REST endpoint receives 5 image urls.
{ "image1": "http://ww.o.com/1.gif",
"image2": "http://www.foo.be/2.gif" }
and it will download these and return them as application/octet-stream.
CLARIFICATION: The client that invokes this REST interface is a mobile app. Every additional network connections made will reduce battery life by a few milliamps. I am forced to use REST because it is a company standard. If not, I will do my own binary protocol.
It is not so good, as the client will not know what to do with such binary data except of storing those bytes somewhere or sending them further to some other process (if this is all you need to do with your data, then it is fine).
You may take a look at multipart content types. IMO, a multipart message containing several image/gif parts would be a better alternative.
From the sounds of this, this sounds much more like an RPC call. Specifically, "here's a list of URLs, send me back an archive".
That process is not particularly RESTful, as REST is not an RPC based system.
What you need to do is treat the archives as reources, and a way to create and then serve them up.
For example you could:
POST /archives
Content-Type: application/json
{ "image1": "http://ww.o.com/1.gif",
"image2": "http://www.foo.be/2.gif" }
As a result, you would get
HTTP/1.1 201 Created
Location: http://example.com/archives/1234
Content-Type: application/json
Then, you could make a request to http://example.com:
GET /archives/1234
Accept: multipart/mixed
Here, you will get the actual archive in a single request (like you want), only it's a multipart formatted result. (multipart/x-zip would work too, that's a zip file)
If you did:
GET /archives/1234
Accept: application/json
You would get back the JSON you sent originally (so you could, perhaps, edit and update the archive, something you may not want to support sending up the binary images).
To change it you would simply POST back the update:
PUT /archives/1234
Content-Type: application/json
{ "image1": "http://ww.o.com/1.gif",
"image2": "http://www.foo.be/2.gif",
"image3": "http://www.foo2.foo/4.gif" }
The resource is /archives/1234, that's its name.
It has two representations in this case: the JSON version, and the actual, binary archive. Your service distinguishes between the two using the content type specified in the Accept header. That header is the client telling you what it wants.
When you're done with the archive, simply DELETE it
DELETE /archives/1234
Or you can have the server expire the resource at some later time.
Why not have five separate REST calls?
Seems cleaner and divides more logically. It will also run the downloads in parallel, 2 or more at a time depending on the browser you are using.
They are called REST principles not laws, but no you are not "breaking" them, IMO. REST is about resources being addressable by a URL, and (where appropriate) available in multiple formats. It doesn't say what the format should be. There's a simple description of what REST means in this article.
However, as #Andrey says there are nicer ways to handle sending multiple data objects than inventing your own adhoc format. The Multipart mimeType / format is one alternative, and another is to send the objects packed up as a tar, zip or a similar archive file format.
IMO. the real problem with using "application/octet-stream" and is that it doesn't tell anyone anything about how the data is actually formatted. Rather your client has "know" how it is formatted, and interpret it accordingly. And the problems with inventing your own format are interoperability and (possibly) having to design, implement and maintain libraries to support it, possibly may times over.