robots.txt - is this working? - robots.txt

I just ran into a robots.txt that looks like this:
User-agent: *
Disallow: /foobar
User-agent: badbot
Disallow: *
After disallowing only a few folders for all, does the specific badbot rule even apply?
Note: This question is merely for understanding the above ruleset. I know using robots.txt is not a proper security mechanism and I'm neither using nor advocating it.

Each bot only ever complies to at most a single record (block).
A block starts with one or more User-agent lines, typically followed by Disallow lines (at least one is required). Blocks are separated by blank lines.
A bot called "badbot" will look for a record with the line User-agent: badblock (or similar, as the bot "should be liberal in interpreting this field"). If no such line is found, it will look for a record with the line User-agent: *. If even this doesn’t exist, the bot is allowed to do everything (= default).
So in your example, the bot called "badbot" will follow only the second record (you probably mean Disallow: / instead of Disallow: *), while all other bots only follow the first record.

Related

Return Https Status Code 410 Error for all Website Pages

I want to return 410 errors from my whole website pages so how to do this in just a few lines of code. I have used this syntax Redirect 410 [page-path] in .htaccess file but it is just to block one page what if I want for the whole website?
In doing so I think I have type this syntax 100 times. So is there any code that can help me to return 410 errors for the whole website.
You can use a pattern to match any URL on your site. Such a pattern is the regular expression .*. . means "any character" and * means "zero or more of them", so .* means "anything or nothing."
There are at least two different apache modules that can do this for you. You can usually choose to use either of them. The code would work either in your virtual host conf file or in a .htaccess file.
Using mod_alias:
RedirectMatch gone .*
RedirectMatch documentation
Using mod_rewrite:
RewriteEngine on
RewriteRule .* - [G]
Rewrite gone documentation

Use perl WWW::Mechanize on a local file

I'm currently working on a Perl script and I use the CPAN module WWW:Mechanize to get HTML pages from websites.
However, I would like to be able to work on offline HTML files as well (that I would save myself beforehand most likely) so I don't need the internet each time I'm trying a new script.
So basically my question is how can I transform this :
$mech->get( 'http://www.websiteadress.html' );
into this :
$mech->get( 'C:\User\myfile.html' );
I've seen that file:// could be useful but I obviously don't know how to use it as I get errors every time.
The get() method from WWW::Mechanize takes a URL as its argument. So you just need to work out what the correct URL is for your local file. You're on the right lines with the "file://" scheme.
I think you will need:
$mech->get( 'file:///C:/User/myfile.html' );
Note two important things that people often get wrong.
URLs only understand forward slashes (/), so you need to convert Windows' warped backslash (\) monstrosities. Update: As Borodin points out in a comment, this isn't true - you can use backslashes in URLs. However, backslashes often have special meanings in Perl strings, so I'd advise using forward slashes whenever possible.
The scheme is file, which is followed by :// (with two slashes), then the hostname (which is an empty string) a slash (/) and then your local path (C:/). So that means that there are three slashes after file:. That seems wrong, so people often omit one of them. Update: description made more accurate following advice from Borodin in a comment.
Wikipedia (as always) has a lot more information - file URI scheme

Robots.txt: Disallow repeated subdirectories but allow main directories

I have these directories, there are many of them:
/dir100/media
/dir200/media
/dir300/media
i want to Disallow all */media directories
how can i do this ?
You were almost there in your question!
# User agent that should be disallowed, '*' is far 'all'
User-agent: *
Disallow: /*/media
# A less restrictive rule that would also work:
# Disallow: /dir*/media
In general search engines do want to see every resource that might be referenced from your pages, and if these resources are disallowed for crawling and are critical to understand the pages through rendering, there's a chance Google and other search engines will have a hard time understanding the pages. Keep that in mind when setting up disallow directives.

htaccess redirect code: wildcard in middle of path

I have searched for an answer here but have not been able to find one that meets my precise criteria.
I would like to redirect the following url:
http://site.com/term1/9/title
to
http://site.com/term1/title
There are two additional considerations:
The "term1" above could be any word; and
The number "9" above could be any number from 1 through 100.
I would be grateful for any assistance with this.
Many thanks in advance, david
Apache RewriteRule directives use perl compatible regular expressions. This should work:
RewriteRule ^([\w\d]+)/(\d+)/(.*) $1/$3
I tested it on an htaccess tester and I got the correct result. Of course, you may need to adjust the regular expression if you expect any other characters in the first part of the address. If you really need to catch the 1-100 range, you can use a more complex regular expression instead of just \d+, in the second parentheses: [0-9]{1,2}|100.

Magento "Cannot send headers; headers already sent in" error [duplicate]

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
When running my script, I am getting several errors like this:
Warning: Cannot modify header information - headers already sent by (output started at /some/file.php:12) in /some/file.php on line 23
The lines mentioned in the error messages contain header() and setcookie() calls.
What could be the reason for this? And how to fix it?
No output before sending headers!
Functions that send/modify HTTP headers must be invoked before any output is made.
summary ⇊
Otherwise the call fails:
Warning: Cannot modify header information - headers already sent (output started at script:line)
Some functions modifying the HTTP header are:
header / header_remove
session_start / session_regenerate_id
setcookie / setrawcookie
Output can be:
Unintentional:
Whitespace before <?php or after ?>
The UTF-8 Byte Order Mark specifically
Previous error messages or notices
Intentional:
print, echo and other functions producing output
Raw <html> sections prior <?php code.
Why does it happen?
To understand why headers must be sent before output it's necessary
to look at a typical HTTP
response. PHP scripts mainly generate HTML content, but also pass a
set of HTTP/CGI headers to the webserver:
HTTP/1.1 200 OK
Powered-By: PHP/5.3.7
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
<html><head><title>PHP page output page</title></head>
<body><h1>Content</h1> <p>Some more output follows...</p>
and <img src=internal-icon-delayed>
The page/output always follows the headers. PHP has to pass the
headers to the webserver first. It can only do that once.
After the double linebreak it can nevermore amend them.
When PHP receives the first output (print, echo, <html>) it will
flush all collected headers. Afterward it can send all the output
it wants. But sending further HTTP headers is impossible then.
How can you find out where the premature output occurred?
The header() warning contains all relevant information to
locate the problem cause:
Warning: Cannot modify header information - headers already sent by
(output started at /www/usr2345/htdocs/auth.php:52) in
/www/usr2345/htdocs/index.php on line 100
Here "line 100" refers to the script where the header() invocation failed.
The "output started at" note within the parenthesis is more significant.
It denominates the source of previous output. In this example, it's auth.php
and line 52. That's where you had to look for premature output.
Typical causes:
Print, echo
Intentional output from print and echo statements will terminate the opportunity to send HTTP headers. The application flow must be restructured to avoid that. Use functions
and templating schemes. Ensure header() calls occur before messages
are written out.
Functions that produce output include
print, echo, printf, vprintf
trigger_error, ob_flush, ob_end_flush, var_dump, print_r
readfile, passthru, flush, imagepng, imagejpeg
among others and user-defined functions.
Raw HTML areas
Unparsed HTML sections in a .php file are direct output as well.
Script conditions that will trigger a header() call must be noted
before any raw <html> blocks.
<!DOCTYPE html>
<?php
// Too late for headers already.
Use a templating scheme to separate processing from output logic.
Place form processing code atop scripts.
Use temporary string variables to defer messages.
The actual output logic and intermixed HTML output should follow last.
Whitespace before <?php for "script.php line 1" warnings
If the warning refers to output inline 1, then it's mostly
leading whitespace, text or HTML before the opening <?php token.
<?php
# There's a SINGLE space/newline before <? - Which already seals it.
Similarly it can occur for appended scripts or script sections:
?>
<?php
PHP actually eats up a single linebreak after close tags. But it won't
compensate multiple newlines or tabs or spaces shifted into such gaps.
UTF-8 BOM
Linebreaks and spaces alone can be a problem. But there are also "invisible"
character sequences that can cause this. Most famously the
UTF-8 BOM (Byte-Order-Mark)
which isn't displayed by most text editors. It's the byte sequence EF BB BF, which is optional and redundant for UTF-8 encoded documents. PHP however has to treat it as raw output. It may show up as the characters  in the output (if the client interprets the document as Latin-1) or similar "garbage".
In particular graphical editors and Java-based IDEs are oblivious to its
presence. They don't visualize it (obliged by the Unicode standard).
Most programmer and console editors however do:
There it's easy to recognize the problem early on. Other editors may identify
its presence in a file/settings menu (Notepad++ on Windows can identify and
remedy the problem),
Another option to inspect the BOMs presence is resorting to an hexeditor.
On *nix systems hexdump is usually available,
if not a graphical variant which simplifies auditing these and other issues:
An easy fix is to set the text editor to save files as "UTF-8 (no BOM)"
or similar to such nomenclature. Often newcomers otherwise resort to creating new files and just copy&pasting the previous code back in.
Correction utilities
There are also automated tools to examine and rewrite text files
(sed/awk or recode).
For PHP specifically there's the phptags tag tidier.
It rewrites close and open tags into long and short forms, but also easily
fixes leading and trailing whitespace, Unicode and UTF-x BOM issues:
phptags --whitespace *.php
It's safe to use on a whole include or project directory.
Whitespace after ?>
If the error source is mentioned as behind the
closing ?>
then this is where some whitespace or the raw text got written out.
The PHP end marker does not terminate script execution at this point. Any text/space characters after it will be written out as page content
still.
It's commonly advised, in particular to newcomers, that trailing ?> PHP
close tags should be omitted. This eschews a small portion of these cases.
(Quite commonly include()d scripts are the culprit.)
Error source mentioned as "Unknown on line 0"
It's typically a PHP extension or php.ini setting if no error source
is concretized.
It's occasionally the gzip stream encoding setting
or the ob_gzhandler.
But it could also be any doubly loaded extension= module
generating an implicit PHP startup/warning message.
Preceding error messages
If another PHP statement or expression causes a warning message or
notice being printed out, that also counts as premature output.
In this case you need to eschew the error,
delay the statement execution, or suppress the message with e.g.
isset() or #() -
when either doesn't obstruct debugging later on.
No error message
If you have error_reporting or display_errors disabled per php.ini,
then no warning will show up. But ignoring errors won't make the problem go
away. Headers still can't be sent after premature output.
So when header("Location: ...") redirects silently fail it's very
advisable to probe for warnings. Reenable them with two simple commands
atop the invocation script:
error_reporting(E_ALL);
ini_set("display_errors", 1);
Or set_error_handler("var_dump"); if all else fails.
Speaking of redirect headers, you should often use an idiom like
this for final code paths:
exit(header("Location: /finished.html"));
Preferably even a utility function, which prints a user message
in case of header() failures.
Output buffering as a workaround
PHPs output buffering
is a workaround to alleviate this issue. It often works reliably, but shouldn't
substitute for proper application structuring and separating output from control
logic. Its actual purpose is minimizing chunked transfers to the webserver.
The output_buffering=
setting nevertheless can help.
Configure it in the php.ini
or via .htaccess
or even .user.ini on
modern FPM/FastCGI setups.
Enabling it will allow PHP to buffer output instead of passing it to the webserver instantly. PHP thus can aggregate HTTP headers.
It can likewise be engaged with a call to ob_start();
atop the invocation script. Which however is less reliable for multiple reasons:
Even if <?php ob_start(); ?> starts the first script, whitespace or a
BOM might get shuffled before, rendering it ineffective.
It can conceal whitespace for HTML output. But as soon as the application logic attempts to send binary content (a generated image for example),
the buffered extraneous output becomes a problem. (Necessitating ob_clean()
as a further workaround.)
The buffer is limited in size, and can easily overrun when left to defaults.
And that's not a rare occurrence either, difficult to track down
when it happens.
Both approaches therefore may become unreliable - in particular when switching between
development setups and/or production servers. This is why output buffering is
widely considered just a crutch / strictly a workaround.
See also the basic usage example
in the manual, and for more pros and cons:
What is output buffering?
Why use output buffering in PHP?
Is using output buffering considered a bad practice?
Use case for output buffering as the correct solution to "headers already sent"
But it worked on the other server!?
If you didn't get the headers warning before, then the output buffering
php.ini setting
has changed. It's likely unconfigured on the current/new server.
Checking with headers_sent()
You can always use headers_sent() to probe if
it's still possible to... send headers. Which is useful to conditionally print
info or apply other fallback logic.
if (headers_sent()) {
die("Redirect failed. Please click on this link: <a href=...>");
}
else{
exit(header("Location: /user.php"));
}
Useful fallback workarounds are:
HTML <meta> tag
If your application is structurally hard to fix, then an easy (but
somewhat unprofessional) way to allow redirects is injecting a HTML
<meta> tag. A redirect can be achieved with:
<meta http-equiv="Location" content="http://example.com/">
Or with a short delay:
<meta http-equiv="Refresh" content="2; url=../target.html">
This leads to non-valid HTML when utilized past the <head> section.
Most browsers still accept it.
JavaScript redirect
As alternative a JavaScript redirect
can be used for page redirects:
<script> location.replace("target.html"); </script>
While this is often more HTML compliant than the <meta> workaround,
it incurs a reliance on JavaScript-capable clients.
Both approaches however make acceptable fallbacks when genuine HTTP header()
calls fail. Ideally you'd always combine this with a user-friendly message and
clickable link as last resort. (Which for instance is what the http_redirect()
PECL extension does.)
Why setcookie() and session_start() are also affected
Both setcookie() and session_start() need to send a Set-Cookie: HTTP header.
The same conditions therefore apply, and similar error messages will be generated
for premature output situations.
(Of course, they're furthermore affected by disabled cookies in the browser
or even proxy issues. The session functionality obviously also depends on free
disk space and other php.ini settings, etc.)
Further links
Google provides a lengthy list of similar discussions.
And of course many specific cases have been covered on Stack Overflow as well.
The WordPress FAQ explains How do I solve the Headers already sent warning problem? in a generic manner.
Adobe Community: PHP development: why redirects don't work (headers already sent)
Nucleus FAQ: What does "page headers already sent" mean?
One of the more thorough explanations is HTTP Headers and the PHP header() Function - A tutorial by NicholasSolutions (Internet Archive link).
It covers HTTP in detail and gives a few guidelines for rewriting scripts.
This error message gets triggered when anything is sent before you send HTTP headers (with setcookie or header). Common reasons for outputting something before the HTTP headers are:
Accidental whitespace, often at the beginning or end of files, like this:
<?php
// Note the space before "<?php"
?>
       To avoid this, simply leave out the closing ?> - it's not required anyways.
Byte order marks at the beginning of a php file. Examine your php files with a hex editor to find out whether that's the case. They should start with the bytes 3F 3C. You can safely remove the BOM EF BB BF from the start of files.
Explicit output, such as calls to echo, printf, readfile, passthru, code before <? etc.
A warning outputted by php, if the display_errors php.ini property is set. Instead of crashing on a programmer mistake, php silently fixes the error and emits a warning. While you can modify the display_errors or error_reporting configurations, you should rather fix the problem.
Common reasons are accesses to undefined elements of an array (such as $_POST['input'] without using empty or isset to test whether the input is set), or using an undefined constant instead of a string literal (as in $_POST[input], note the missing quotes).
Turning on output buffering should make the problem go away; all output after the call to ob_start is buffered in memory until you release the buffer, e.g. with ob_end_flush.
However, while output buffering avoids the issues, you should really determine why your application outputs an HTTP body before the HTTP header. That'd be like taking a phone call and discussing your day and the weather before telling the caller that he's got the wrong number.
I got this error many times before, and I am certain all PHP programmer got this error at least once before.
Possible Solution 1
This error may have been caused by the blank spaces before the start of the file or after the end of the file.These blank spaces should not be here.
ex)
THERE SHOULD BE NO BLANK SPACES HERE
echo "your code here";
?>
THERE SHOULD BE NO BLANK SPACES HERE
Check all files associated with file that causes this error.
Note: Sometimes EDITOR(IDE) like gedit (a default linux editor) add one blank line on save file. This should not happen. If you are using Linux. you can use VI editor to remove space/lines after ?> at the end of the page.
Possible Solution 2:
If this is not your case, then use ob_start to output buffering:
<?php
ob_start();
// code
ob_end_flush();
?>
This will turn output buffering on and your headers will be created after the page is buffered.
Instead of the below line
//header("Location:".ADMIN_URL."/index.php");
write
echo("<script>location.href = '".ADMIN_URL."/index.php?msg=$msg';</script>");
or
?><script><?php echo("location.href = '".ADMIN_URL."/index.php?msg=$msg';");?></script><?php
It'll definitely solve your problem.
I faced the same problem but I solved through writing header location in the above way.
You do
printf ("Hi %s,</br />", $name);
before setting the cookies, which isn't allowed. You can't send any output before the headers, not even a blank line.
COMMON PROBLEMS:
(copied from: source)
====================
1) there should not be any output (i.e. echo.. or HTML codes) before the header(.......); command.
2) remove any white-space(or newline) before <?php and after ?> tags.
3) GOLDEN RULE! - check if that php file (and also, if you include other files) have UTF8 without BOM encoding (and not just UTF-8). That is problem in many cases (because UTF8 encoded file has something special character in the start of php file, which your text-editor doesnt show)!!!!!!!!!!!
4) After header(...); you must use exit;
5) always use 301 or 302 reference:
header("location: http://example.com", true, 301 ); exit;
6) Turn on error reporting, and find the error. Your error may be caused by a function that is not working. When you turn on error reporting, you should always fix top-most error first. For example, it might be "Warning: date_default_timezone_get(): It is not safe to rely on the system's timezone settings." - then farther on down you may see "headers not sent" error. After fixing top-most (1st) error, re-load your page. If you still have errors, then again fix the top-most error.
7) If none of above helps, use JAVSCRIPT redirection(however, strongly non-recommended method), may be the last chance in custom cases...:
echo "<script type='text/javascript'>window.top.location='http://website.com/';</script>"; exit;
It is because of this line:
printf ("Hi %s,</br />", $name);
You should not print/echo anything before sending the headers.
A simple tip: A simple space (or invisible special char) in your script, right before the very first <?php tag, can cause this !
Especially when you are working in a team and somebody is using a "weak" IDE or has messed around in the files with strange text editors.
I have seen these things ;)
Another bad practice can invoke this problem which is not stated yet.
See this code snippet:
<?php
include('a_important_file.php'); //really really really bad practise
header("Location:A location");
?>
Things are okay,right?
What if "a_important_file.php" is this:
<?php
//some php code
//another line of php code
//no line above is generating any output
?>
----------This is the end of the an_important_file-------------------
This will not work? Why?Because already a new line is generated.
Now,though this is not a common scenario what if you are using a MVC framework which loads a lots of file before handover things to your controller? This is not an uncommon scenario. Be prepare for this.
From PSR-2 2.2 :
All PHP files MUST use the Unix LF (linefeed) line ending.
All PHP files MUST end with a single blank line.
The closing ?> tag MUST be omitted from files containing only php
Believe me , following thse standards can save you a hell lot of hours from your life :)
Sometimes when the dev process has both WIN work stations and LINUX systems (hosting) and in the code you do not see any output before the related line, it could be the formatting of the file and the lack of Unix LF (linefeed)
line ending.
What we usually do in order to quickly fix this, is rename the file and on the LINUX system create a new file instead of the renamed one, and then copy the content into that. Many times this solve the issue as some of the files that were created in WIN once moved to the hosting cause this issue.
This fix is an easy fix for sites we manage by FTP and sometimes can save our new team members some time.
Generally this error arise when we send header after echoing or printing. If this error arise on a specific page then make sure that page is not echoing anything before calling to start_session().
Example of Unpredictable Error:
<?php //a white-space before <?php also send for output and arise error
session_start();
session_regenerate_id();
//your page content
One more example:
<?php
includes 'functions.php';
?> <!-- This new line will also arise error -->
<?php
session_start();
session_regenerate_id();
//your page content
Conclusion: Do not output any character before calling session_start() or header() functions not even a white-space or new-line