YQL robots.txt restricted URL issues - sinatra

I'm developing a webapp that includes the following YQL query:
SELECT * FROM html WHERE url="{URL}" and xpath="*"
I deployed a new version last week, and noticed that the page was hanging on the YQL query. When I came back yesterday, the problem seemed to have fixed itself over the weekend. I just deployed a new version to the server and the problem has come back again. The server stack is Ngnix / Passenger / Sinatra
Punching the query into YQL Console I get an error:
"Requesting a robots.txt restricted URL:"
I've added the following robots.txt:
User-agent: Yahoo Pipes 2.0
Allow: /
But that doesn't seem to do anything.
Thoughts? It's pretty curious to me why YQL is reporting the URL is robots.txt restricted when it's not.

I've had the same problem. I have a suspicion that this is in part a problem on Yahoo's end.
In my Sinatra apps I added...
get 'robots.txt' do
"User-agent: * Allow: /"
end
This would work occasionally... and then access would be denied again for a period of time.
If you are using this to avoid cross-domain issues with javascript... I eventually gave in and used a local PHP script to retrieve data rather than use YQL.

Consider added &diagnostics=true in the YQL query. It worked for me.

Related

In Safari or Firefox missing trailing spaces ruins requests to a server with axios

I am building an app with React and using axios there for working with APi(built with Python).
I had a weird bug, luckily I found a reason of that - all requests without trailing slash were failed with 401 error in Safari and Firefox, e.g. /users
I was so lucky to have one of my requests with slash and it was working well, so when I tried to add slashes to my other requests it have make them work! e.g. /users/ etc.
It's not so complex for me to add it, but sometimes, when I am passing ids for example, it requires me to use /users/${id}/ or '/users/' + id + '/' which is not cool.
My question is if it a browsers bug, or axios bug, or it could be solved on a backend server?
I asked backend developer to check this and he fixed that on his side

FaceBook loads HTTPS hosted iframe apps via HTTP POST (S3 & CloudFront errors)

I have been trying to write a bucket policy that will allow (X-HTTP-Method-Override) because my research shows that Facebook loads HTTPS hosted iframe apps via HTTP POST instead of HTTP GET which causes S3 and CloudFront errors.
Can anyone please help me with this problem?
This is what's returned from S3 if I served my Facebook app directly from S3:
<?xml version="1.0" encoding="UTF-8" ?>
- <Error>
<Code>MethodNotAllowed</Code>
<Message>The specified method is not allowed against this resource.</Message>
<ResourceType>OBJECT</ResourceType>
<Method>POST</Method>
<RequestId>B21565687724CCFE</RequestId>
<HostId>HjDgfjr4ktVxqlIBeIlvXT3UzBNuPg8b+WbhtNHOvNg3cDNpfLH5GIlyUUpJKZzA</HostId>
</Error>
This is what's returned from CloudFront if I served my Facebook app from CloudFront with S3 as the origin:
ERROR
The request could not be satisfied.
Generated by cloudfront (CloudFront)
I think the solution should be to write a bucket policy that makes use of X-HTTP-Method-Override... Probably I am wrong though. A solution to this problem would be highly appreciated.
After trying many different ways to get this to work, it turns out that it simply is not possible to make the POST to static content work on S3 as things stand. Even if you allow POST through Cloudfront, enable CORS, change the bucket policy so that the Cloudfront origin identity can GET/PUT etc. it will still throw an error.
As an aside, S3 is not the only thing that balks at responding to such a POST request to static content. If you configure nginx as an origin for a Facebook iframe you will get the same 405 error, though you can work around that problem in a couple of ways (essentially rewriting it to a GET under the covers). You can also change the page (though still static) to be a dynamic extension (.aspx or .php) to work around the issue with nginx.
You can host all your other content on S3 of course, and just move the page that you POST to onto a different origin. With a decent cache time you should see minimal traffic, but it will mean keeping your content in two places. What I ended up doing was:
Creating EC2 instances in an autoscaling group (just in case) to serve the content
They used a cron job to sync the content from S3 every 5 minutes
No change in workflow was required (still just upload content to S3)
It's not ideal, nor is it particularly efficient, but hopefully it will save others a lot of fruitless testing trying to get this to work on S3 alone.
You can set your Cloudfront distribution to allow POST methods.
If you go into your dashboard and edit the Behavior for the distribution
- Then select Allowed HTTP Methods - GET, HEAD, PUT, POST, PATCH, DELETE, OPTIONS
This allows the POST from Facebook to go through to your origin.
I was fighting with S3 and CloudFront for last couple of days. and I confirm that with any bucket policy we cannot redirect POST calls from Facebook to S3 static (JS enriched) contents.
The only solution seems to be the one Adam Comerford mentioned in this thread:
Having a light application which receives Facebook calls then fetching the content from S3 or CloudFront.
If anyone has any other solution or idea it will be appreciated.
you can't change POST to GET - that's the way Facebook loads app page because it also sends data about the current user as POST body (see signed_request for more details). I would suggest you look into fixing your app to make sure it properly responds to POST request.

HTTP error: 403 while parsing a website

So I'm trying to parse from this website http://dl.acm.org/dl.cfm . This website doesn't allow web scrapers, so hence I get an HTTP error: 403 forbidden.
I'm using python, so I tried mechanize to fill the form (to automate the filling of the form or a button click), but then again I got the same error.
I can't even open the html page using urllib2.urlopen() function, it gives the same error.
Can anyone help me with this problem?
If the website doesn't allow web scrapers/bots, you shouldn't be using bots on the site to begin with.
But to answer your question, I suspect the website is blocking urllib's default user-agent. You're probably going to have to spoof the user-agent to a known browser by crafting your own request.
headers = {"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"}
req = urllib2.Request("http://dl.acm.org/dl.cfm", headers=headers)
urllib2.urlopen(req)
EDIT: I tested this and it works. The site is actively blocking based on user-agents to stop badly made bots from ignoring robots.txt

Tab Page Error: The requested method GET is not allowed

I have just set up a custom tab on my page for the first time. I have thoroughly followed the setup guide and seem to have everything on the Facebook side setup correctly.
However when I view my page it throws the following error:
Method Not Allowed The requested method GET is not allowed for the
URL /Facebook/index.html. Additionally, a 404 Not Found error was
encountered while trying to use an ErrorDocument to handle the
request. Apache/1.3.41 Server at feebnaturals.com.au Port 80
I believe it may be some kind of Apache server config issue, however I'm not that Apache savvy, so not sure where to start.
I had the same problem, but instead of GET, it was POST method which was not allowed. This is a setting on your server. Not server savvy myself, but it seems that my provider didn't allow this method on html-page, but makes no problem on doing the same for php-pages. So all I did was rename my page from .html to .php, updated the app settings in facebook and all works fine now.
This is definitely an error on your side, check your server logs and see what they say - it looks like you've configured the page to only work via a POST request and it's being requested in a GET request

why do i get this error "Unknown host http:80"?

i'm developing an application for blackbery, i'm displaying a webpage using Eclipse and net.rim.device.api.browser.field.* api when i click a submit buttom in a form i get this error "Unknown host http:80", can anyone helpme?
Don't know anything about Blackberries, but it looks like you're entering a URL where your program is only expecting a host name.
It sounds like form on the web page is not properly set up, causing the post action to post to an invalid URL. It would help if you included the app code and the form HTML.
In this 2005 forum thread people complain about getting that kind of error on their Blackberries.
I'm on the server side and I can see some Proxy servers trying to access my server with either HTTP/1.0 and no HTTP_HOST (which my app requires) or using the wrong HTTP_HOST.
For example, I am getting requests for widgets.twimg.com , www.google-analytics.com , servedby.jumpdisplay.com . My server doesn't host those domains so the response is obviously not any of the sites on the server, and instead I'm giving back an error.
So, it might be that your Blackberry is not providing the right HTTP_HOST to the server (or none at all) and the server doesn't know what to do with it.
To me, that's Blackberry (or whatever proxy that might exist between you and the server) 's fault.