I am trying to search and find a content from a site by using Perl Mechanize.It worked fine in the beginning after few execution i am getting 403 Forbidden instead of the search results,
$m = WWW::Mechanize->new();
$url="http://site.com/search?q=$keyword";
$m->get($url);
$c = $m->content;
print "$c";`
how can solve this problem. Please give me some suggestions.
Before beginning to scrape a site, you should make sure that you are authorized to do so. Most sites have a Terms Of Service (TOS), that lay out how you can use the site. Most sites disallow automatic access, and place strong restrictions on the intellectual property.
A site can defend against unwanted access on three levels:
Conventions: The /robots.txt almost every site has should be honored by your programs. Do not assume that a library you are using will take care of that; honoring the robots.txt is your responsibility. Here is a excerpt from the stackoverflow robots.txt:
User-Agent: *
Disallow: /ask/
Disallow: /questions/ask/
Disallow: /search/
So it seems SO doesn't like bots asking questions, or using the site search. Who would have guessed?
It is also expected that a developer will use the API and similar services to access the content. E.g. Stackoverflow has very customizable RSS feeds, has published snapshots of the database, even has an online interface for DB queries, and an API you can use.
Legal: (IANAL!) Before accessing a site for anything other than your personal, immediate consumption, you should read the TOS, or whatever they are called. They state if and how you may access the site and reuse content. Be aware that all content has some copyright. The copyright system is effectively global, so you aren't exempt from the TOS just by being in another country than the site owner.
You implicitly accept the TOS by using a site (by any means).
Some sites license their content to everybody. Good examples are Wikipedia and Stackoverflow, which license user submissions under CC-BY-SA (or rather, the submitting users license their content to the site under this license). They cannot restrict the reuse of content, but can restrict the access to that content. E.g. the Wikipedia TOS contains this a section Refraining from certain activities:
Engaging in Disruptive and Illegal Misuse of Facilities
[…]
Engaging in automated uses of the site that are abusive or disruptive of the services […]
[…] placing an undue burden on a Project website or the networks or servers connected with a Project website;
[…] traffic that suggests no serious intent to use the Project website for its stated purpose;
Knowingly accessing, […] or using any of our non-public areas in our computer systems without authorization […]
Of course, this is just meant to make disallow a DDOS, but while Bots are an important part of Wikipedia, other sites do tend to frown on them.
Technical measures: … like letting connections from an infringing IP time out, or sending a 403 error (which is very polite). Some of these measures may be automated (e.g. triggered by useragent strings, weird referrers, URL hacking, fast requests) or by watchful sysadmins tailing the logs.
If the TOS etc. don't make it clear that you may use a bot on the site, you can always ask the site owner for written permission to do so.
If you think there was a misunderstanding, and you are being blocked despite regular use of a site, you can always contact the owner/admin/webmaster and ask them re-open your access.
Related
We have a corporate website that receives external email, processes them, and shows them in the browser to the user. We will be showing the emails in HTML format if they are available in this format. However, this basically means that we will be showing user-generated HTML code (you could send any HTML in an email, as far as I know).
What are the security risks here? What steps to take in order to minimize these risks?
I can currently think of:
Removing all javascript
Perhaps removing external CSS? Not sure if this is a security risk
Not loading images (to limit tracking... not sure if this poses a security risk or just a privacy risk)
Would that be all? Removing HTML tags is always error prone so I am wondering if there is a better way to somehow disable external scripts when displaying e-mail?
The security risks are, as far as I know, the same as with Cross-Site-Scripting (XSS).
OWASP describes the risks as following:
XSS can cause a variety of problems for the end user that range in severity from an annoyance to complete account compromise. The most severe XSS attacks involve disclosure of the user’s session cookie, allowing an attacker to hijack the user’s session and take over the account. Other damaging attacks include the disclosure of end user files, installation of Trojan horse programs, redirect the user to some other page or site, or modify presentation of content. An XSS vulnerability allowing an attacker to modify a press release or news item could affect a company’s stock price or lessen consumer confidence.
Source
Defending against it must be with layers of defense, such as but not limited to:
Sanitizing the HTML with something like DOMPurify.
Making use of HTTP only cookies for security sensitive cookies so they can't be read from JavaScript. Source
Adding a Content Security Policy so the browser only trusts scripts from domains you tell it to trust. Source
Depending on your requirements it might also be possible to load the email content into a sandbox iframe, as an additional security measurement. This can be done like this:
var sanitizedHTML = DOMPurify('<div>...</div>');
var iframe = document.getElementById('iframeId');
iframe.srcdoc = sanitizedHTML;
I'm in a conundrum, and could really use some help...
I'm having difficulty trying to find information regarding how to enable a site -that already has X-FRAME-OPTIONS: SAMEORIGIN encoded- to be loaded into an iframe from a couple of specific domains (i.e. domain.com would be the common TLD). The issue is, that although this would be quite simple to do -using X-FRAME-OPTIONS: ALLOW-FROM http://domain.com-, if that were the only domain which would ever have a need to access the target site via an iframe. In reality however, I actually need to figure out how to set it up for (currently) three sub-domains -with the possibility of allowing from even more in the future- of the original TLD (i.e. example1.domain.com, example2.domain.com, and example3.domain.com), to be able to access the site while loaded inside the intended iframe. The only info I've been able to find regarding this issue so far is quite a bit outdated, and says that there is NO POSSIBLE WAY to allow a wildcard reference (or any other form of multiple domain reference) for a particular domain that would also apply to it's subsequent sub-domains (or anything along those lines) that seems to be effective at both functioning as intended and also preventing 'Clickjackin' by malicious individuals from occurring. I was hoping that someone more knowledgeable (and better versed in X-FRAME-OPTIONS) than myself might actually be able to offer me a feasible resolution.
Thanks in advance.
If you can entertain approaches outside of X-Frame-Options, consider creating a server-to-server API that can be called to access the content in question, and then allow it to be displayed without requiring framing.
That is, instead of ClientSite containing an IFRAME referencing the FramedPage which does the page assembly within the web browser, ClientSite calls an API on the backend to get the content directly from you and inserts the content into the page on the server before delivering the page to the user's web browser.
This gives you substantial flexibility. You could require an API key, apply basic server-to-server IP address whitelisting, or whatever suits, to prevent unwanted callers of your API.
According to Roy Fielding's Hypermedia As The Engine of Application State (HATEOAS), each resource should be accompagnied with a list of actions (or links) that can be done on that resource.
If the actions are included in the entity (as apposed to using the links attribute of Json-Schema), how do I tell the user-agent that a specific option is not available to the authenticated user?
The backend could do the filtering, but then the same resource URL could have different representations depending on the authenticated user. And this does not seem REST-friendly or caching friendly.
The other option is to leave all links, and let the user-agent receive a 403 Forbidden when the action is not available to the authenticated user. This can be annoying to the user.
How to inform the user-agent on the available actions when those can change depending on the authenticated user, while remaining REST-friendly?
You are correct. Creating representations that vary based on user permission is not particularly cache friendly. Is it possible to classify permission variants into just a few categories? e.g. resource-low-security, resource-medium-security resource-high-security
Sometimes this approach is possible, sometimes it is not. The other aspect to consider is whether caching is critical for this particular resource. Maybe it is now?
Also, it is not necessary to wait until the user clicks on a link to find out if the user has the permissions to follow it. The client could perform an OPTIONS request on links in the background to discover which links are available and dynamically disable the links that are not accessible.
There is no single answer to this problem. Different solutions will work in different cases depending on the requirements.
Consider that a REST API is a website for a robot to browse.
Do websites return HTML resources (pages) containing links that you're not permitted to see?
Whether it does or not, it doesn't change how "hypermediary" the website is.
but then the same resource URL could have different representations depending on the authenticated user
Consider the same about the homepage of a website. A resource is conceptual, the home page is the concept, what it looks like changes.
How does the web deal with the caching of pages for logged-in and logged-out views?
The first way is to bar caching of those resources; not everything must be cachable, the constraint is simply that resources can be labeled accordingly.
The second is using control semantics, or headers if you're using HTTP for your REST API.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Vary
I'm developing a small CMS in PHP and we're putting on social integration.
The content is changed by a single administrator who as right for publishing news, events and so on...
I'd to add this feature, when the admin publishes something it's already posted on facebook wall. I'm not very familiar with facebook php SDK, and i'm a little bit confused about it.
If (make it an example) 10 different sites are using my CMS, do I have to create 10 different facebook application? (let's assume the 10 websites are all in different domains and servers)
2nd, is there a way for authenticating with just PHP (something like sending username&password directly) so that the user does not need to be logged on facebook?
thanks
You might want to break up your question in to smaller understandable units. Its very difficult to understand what you are driving at.
My understanding of your problem could be minimal, but here goes...
1_ No you do not create 10 different facebook application. Create a single facebook application and make it a service entry point. So that all your cms sites could talk to this one site to interact with facebook. ( A REST service layer).
2_ Facebook api does not support username and password authentication. They only support oauth2.0. Although Oauth is not trivial, but since they have provided library for that, implementing authentication is pretty trivial.
Please read up on http://developers.facebook.com/docs/.
Its really easy and straight forward and well explained.
Your question is so vague and extensive that it cannot be answered well here.
If you experience any specific implementation problems, this is the right place.
However to answer atleast a part of your question:
The most powerful tool when working with facebook applications is the Graph API.
Its principle is very simple. You can do almonst any action on behalf of any user or application. You have to generate a token first that identifies the user and the proper permissions. Those tokens can be made "permanent" so you can do background tasks. Usually they are only active a very short time so you can perform actions while interacting with the user. The process of generating tokens involves the user so that he/she has to confirm the privileges you are asking for.
For websites that publish something automatically you would probably generate a permanent token one time that is active as long as you remove the app in your privacy settings.
Basically yuo can work with any application on any website. There is no limitation. However there are two ways of generating tokens. One involves on an additional request and one is done client side, which is bound to one domain oyu specifiedin your apps settings.
Addendum:
#ArtoAle
you are right about every app beeing assighend to exactly one domain. however once you obtained a valid token it doesnt matter from where or who you use it within the graph api.
let me expalin this a little bit:
it would make no sense since it is you doing the request. there is no such thing as "where the request is coming from". of course there is the "referer" header information, but it can be freely specified and is not used in any context of this.
the domain you enter in your apps settings only restricts where facebook redirects the user to.
why?
this ensures that some bad guy cannot set up a website on any domain and let the user authorize an app and get an access token with YOUR application.
so this setting ensures that the user and the access token are redirected back to YOUR site and not to another bad site.
but there is an alternative. if you use the control flow for desktop applications you don't get an access token right after the user has been redirected back. you get a temporary SESSION-TOKEN that you can EXCCHANGE for an access token. this exchange is done server side over the REST api and requires your application secret. So at this point it is ensured that it is YOU who gets the token.
This method can be done on any domain or in case of desktop applications on no domain at all.
This is a quote from the faceboo docs:
To convert sessions, send a POST
request to
https://graph.facebook.com/oauth/exchange_sessions
with a comma-separated list of
sessions you want to convert:
curl client_id=your_app_id \
-F client_secret=your_app_secret \
-F sessions=2.DbavCpzL6Yc_XGEI0Ip9GA__.3600.1271649600-12345,2.aBdC...
\
https://graph.facebook.com/oauth/exchange_sessions
The response from the request is a
JSON array of OAuth access tokens in
the same order as the sessions given:
[ {
"access_token": "...",
"expires": 1271649600, }, ... ]
However you don't need this method as its a bit more complex. For your use case i would suggest using a central point of authorization.
So you would specify your ONE domain as a redirect url. This domain is than SHARED between your websites. there you can obtain the fully valid access token and seamlessly redirect the user back to your specific project website and pass along the access token.
This way you can use the traditional easy authentication flow that is probably also more future proof.
The fact remains. Once the access token is generated you can perform any action from any domain, there is no difference as ther is literally no "domain" where the request is coming from (see above).
apart from that, if you want some nice javascript features to work - like the comments box or like button, you need to setup up open graph tags correctly.
if you have some implementation problems or as you said "domain errors" please describe them more clearly, include the steps you made and if possible an error message.
our client needs to shortcuts to particular pages
We need to redirect non existent urls like
http://site.com/promotion1
to the actual URL similar to
http://site.com/promotions/promotion1/tabid/799/language/en-AU/Default.aspx
...
I've sent a list of appropriate DNN modules to our client but it may take them forever to get back to me.
In the mean time they still submitting requests to us to create redirects for them.
if there's no cost involved then i wont have to wait for them to get back to me.
so I'm looking for a Quick and free way to enable the clients to set these up on this own.
I've looked at:
MAS.ActionRedirect
Ventrian Friendly URL Provider
DotNetNuke URL Rewriting HTTP Module
But haven't had much luck in the small amount of time i have available.
Has anyone got some suggestions on how to achieve our goal with either the above resources or maybe some additional resource i haven't found yet?
(DNN v4.9)
You should be able to use the built-in friendly URL functionality within DNN, or use a URL rewriter module within IIS.
You can read my answer about using the DNN Friendly URL functionality for more details, or look into the IIS URL Rewrite module.