TYPO3 how to render a Page on 404 with correct status code - typo3

we are using:
TYPO3 8.7.27
RealUrl 2.5.0
following scenario - a user enters a link that does not exist within our site - so we expect the behaviour that a 404 page gets rendered - we managed to achive that but we do not have the correct status code because we used a simply redirect within the install tool:
[FE][pageNotFound_handling] = "REDIRECT:/404"
[FE][pageNotFound_handling_statheader] = "HTTP/1.0 404 Not Found"
we also use our 404 page for cHash comparison errors but thats just a sidenote.
so what happens is the user requests the data from the wrong url, we send the correct 404 followed by a redirect to a certain page.
how can we directly render a page in the first place, the url should stay the same and we just render our whole TYPO3 page (no static file) with the 404 text-information.

You should use this instead:
[FE][pageNotFound_handling] = "/404"
This will instruct TYPO3 to download the page at the given "URL" and output its content together with the expected 404 status code. Notice that it might be necessary to use an absolute URL here.
From the DefaultConfigurationDescription.php:
pageNotFound_handling
... String: Static HTML file to show (reads content and outputs with correct headers), e.g. "notfound.html" or "http://www.example.org/errors/notfound.html"
You can drop the pageNotFound_handling_statheader option since it defaults to 404.

Related

TYPO3: Bingbot creates an ext_form error which get cached

We have a problem whit one of our TYPO3 installations. The Bingbot which visits the site calls a controller of the old ext_form extension without parameters and creates an error.
207.46.13.XXX - - [16/Oct/2018:00:18:48 +0200] "GET example.html?tx_form_form%5Baction%5D=process HTTP/1.1" 200 10256 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
The problem for us is not that this happens but that TYPO3 is caching the site with "Oops, an error occurred! Code: 20181016001848e0153dcf" as content.
Is there a way to say TYPO3 to not cache the site if an error occorres or to send the bot to the 404 site if he calls the site with false parameters.
There are several things you can do:
exclude the page with the parameters in robots.txt (Edit: After consideration this solution is probably unsuitable for the specific problem)
redirect in .htaccess if the page is called without required parameter, a redirect should be recognized by the bot too.
check why the bot is even calling the page without required parameter, perhaps you can avoid it.
bing can be configured for a specific URL, this page can be a starting point for you.
EDIT:
Example for htaccess (not tested):
RewriteCond %{QUERY_STRING} .+
RewriteRule ^example.html?tx_form_form%5Baction%5D=process$ example.html [L,R=301,QSD]
The target example.html can be anything, either a custom 404 page or just the list-view. The code 301 in [L,R=301,QSD] can be adjusted according to HTTP Status Codes (3xx). If it's redirected to a (custom) 4xx page the HTTP status code should be accordingly (404, or perhaps another 4xx-status-message i.e. 400).

Why AEM returns 403 for requests without extensions?

By default all the GET requests go to DefaultGetServlet first. Based on the extension, it delegates the request to renderers. Now, if there is no extension in the request URI, why AEM sends 403 (Forbidden) ? At the most, if AEM is unable to serve this, it may send a BAD REQUEST instead. AEM sends 403 even if you are logged in as an admin user (Which has highest level of authorization, if that helps).
Example:
http://localhost:4502/content/geometrixx/en/events
this URL will be responded with 403. Whereas
http://localhost:4502/content/geometrixx/en/events.html
will be served without any problems.
Adding to the above, as mentioned by Ahmed:
With the URL "http://localhost:4502/content/geometrixx/en/events" StreamRendererServlet will get executed and resolves to redirect logic ending with /.
// redirect to this with trailing slash to render the index
String url = request.getResourceResolver().map(request,resource.getPath())+ "/";
response.sendRedirect(url);
Once redirected to "http://localhost:4502/content/geometrixx/en/events/"
The same StreamRendererServlet resolves to directory listing logic.
// trailing slash on url means directory listing
if ("/".equals(request.getRequestPathInfo().getSuffix())) {
renderDirectory(request, response, included);
return;
}
In the renderDirectory as indexing will be false,
if (index) {
renderIndex(resource, response);
} else {
response.sendError(HttpServletResponse.SC_FORBIDDEN);
}
a 403 Forbidden response will be sent.
You can change this behavior by enabling "Auto Index" for "Apache Sling GET Servlet" felix configuration console.
As of this sling ticket SLING-1231 closed in 2009, if no renderer is found the return status code should be 404.
you can see that in the sling sourcecode for DefaultGetServlet.java in the doGet method. source
The following tested on AEM 6.3 but should be the same for 6.0+
For example, if you tried to visit http://localhost:4502/content/geometrixx/en/events.something you'd get a 404 and the sling progress tracker would log No renderer for extension something
Now, if I may rephrase your question, why does extension=null return a 403?
If you look at the sling progress tracker response, you'll probably notice this log:
Using org.apache.sling.servlets.get.impl.helpers.StreamRendererServlet to render for extension=null
Which means that for a null extension, Sling will use the StreamRendererServlet(source) to try and render the resource. Somewhere in that code or probably a filter applied after causes the 403 response code you see. You'll have to debug that one yourself and find out where exactly a 403 is returned.
Adding on to what Ahmed said:
Without extension, Sling assumes that you are trying to list the contents of that directory path and looks for an index file under that path. When it doesn't finds that index file, it throws back the forbidden error.
If you add an index file under the events node and try to request the same extensionless url, it will serve that index file.
That is to say, when you add the index file (index.html) under /content/geometrixx/en/events,
all requests to http://localhost:4502/content/geometrixx/en/events or http://localhost:4502/content/geometrixx/en/events/index.html will return the same result.

Surveymonkey: create webhook to get response in sugarcrm

I am trying to create a surveymonkey webhook to receive my survey response and i am passing my SugarCRM custom entry point URL as "Subscription Url". But i am getting error " 'mycustomEntryPointUrl' did not return a success status code. Status code is 301". My Entry point is working fine if i run it in browser using its URL also my Sugar is working smoothly.
So, i just want to know any other reason which can cause this error.
Yes so HTTP status code 301 means the page has moved permanently. If you visit it in your browser, for example, you would see a network request to the page specified with a status code of 301, then a second one to the new page. Our API request won't do any redirect, so if a 301 is returned it will raise an error.
This sometimes happens when you go to a page with http and then it redirects to https due to rules on your server.
You also want to make sure your subscription URL supports a HEAD request without any redirect.

Passing incorrect post arguments in www::mechanize

I am writing a web scraper and I use WWW::Mechanize module. I am performing a post, and pass invalid values to the arguments of the post. What I extracting is all the links from that page and print them to a text file. I would say that it's ok because the text file is empty which means that the page was not found but my problem is that the success() method is ok, and the status() method is 200.
I know it sounds a little strange but I try to get a page not found status or something to know that the page is not valid.
Does anyone have any idea of what is happening?
Whether or not your code will work depends on how the target site responds to requests for missing pages. If the server handles it by serving up an error page, you will get a successful (200) response, even though the page you requested isn't there.
More information from Google on "soft 404s" -- where missing pages return a valid page.
Here is an example from SO of configuring Apache to return a 200 response instead of a 404:
How can I replace Apache HTTP code 404 to 200

Only new pages return 404 when trying to scrape - old pages scrape just fine

I am getting a 404 error when trying to have Facebook scrape my new url's. I have pages that are a couple days old that for some reason keep returning this 404 bad request code.
Older pages from more than a week ago load just fine.
Example - Older Page URL mattlambert.info/blog/pride-is-rust-on-metal/
Result - Response Code: 200 (I dropped the http : // because apparently I can only post one link here.)
Example - Newer Page URL http://mattlambert.info/blog/the-pharisee-in-me/
Result - Response Code: 404
There is not a 404 response code in the headers of either page. Yet, only the older one can be scraped.
Steps to Reproduce: Go to Object Debugger
Enter Both URL's into the field "Input URL, Access Token, or Open Graph Action ID"
Example - Older Page URL Result - Response Code: 200
Example - Newer Page URL Result - Response Code: 404
Expected Behavior: both pages should return a response code: 200. You can go to both url's and see they are both working. Viewing the source of each page, it is clear that there is nothing in the header that would cause this.
Actual Behavior: Problem exists only for new blog posts I have created in the last couple days. Linter tool says there is no data.
Any ideas?