Auto submit form for web crawling - forms

I've got an old ASPX+XML website created by an external agency here. I only have access to sections of the XML as the web.config is locked.
I want to crawl this site to scrape the pages and capture the relational data. I can do a blank search which returns back all the data - from here a web crawler would be fine. However, I cannot find a web crawler that will hit search - I've tried a JavaScript that submits the form on page load but this still does not work (I guess it's not fast enough).
The URL does not contain the query string (so I cant just do a blank search and copy the results URL for example).
Any ideas?

Related

AMP errors in web master tool

I have implemented AMP successfully for my webpages and google started indexing it, which I came to know via WebMaster tool. I am facing some issues which is present and disappears in short span of time.
Issue logged are:
User authored JavaScript found on page
The pages doesn't contain any script tags except schema.
This error is showing for few pages from 120 pages instead of following same
template. Below is the image link:
Have some more query:
I have observe different amp urls getting redirected to its original page when the same amp url is being used in Web Browser.
Is Google taking care of it or its on us to do the redirection?
I am planning to implement the sign in and share buttons on my web pages which will be using javascript. But if I do so, I do get validation error. So what is the right approach.
Can anyone please help me on this?
Please ensure that all script tags are of type application/ld+json. There should be no executable code in these script tags.
Redirection is something that you must be doing on your end. Google doesn't do any sort of redirection from AMP to non-amp pages if the URL is hit directly. In fact that URL schema that Google uses in their carousel is entirely their own, and just includes the path to your page inside it. E.g. https://cdn.ampproject.org/v/www.yoursitehere.com/path/to/article.html
Social sharing using Javascript inserted in the page is not allowed, as no Javascript is allowed. If you want to use social sharing, use a non-javascript implemention, or try out the amp-social-share
thanks for the response. As per the query which I asked
Please ensure that all script tags are of type application/ld+json. There should be no executable code in these script tags - I am not using any Script as of now except amp only
Redirection is something that you must be doing on your end. Google doesn't do any sort of redirection from AMP to non-amp pages if the URL is hit directly. In fact that URL schema that Google uses in their carousel is entirely their own, and just includes the path to your page inside it. E.g. https://cdn.ampproject.org/v/www.yoursitehere.com/path/to/article.html -
Understood
Social sharing using Javascript inserted in the page is not allowed, as no Javascript is allowed. If you want to use social sharing, use a non-javascript implementation, or try out the amp-social-share - Implemented Social Share and its working fine
Can we implement AMP for eCommerce sites where a lot of JavaScript, forms, plugins can be included? As of my knowledge AMP wants to keep it simple and thus restrict as many JavaScript, form tag is not valid only. So is there any chance we can implement AMP on eCommerce sites.

YQL console not returning complete html page

ok i will keep it simple
I put the following query in YQL console
select * from html where url="https://twitter.com/laurenlemon/status/470403949980549121"
The twitter site in the query is a list of tweets which i want to pull using YQL.
The response in the console contained only the html tags and few contents of some of the html tags, but not a single tweets of any user was visible in any html element in response of the YQL console window.
I dont know what i am doing wrong.
OK guys i figured it out, YQL only has ability to scrape html loaded contents, now the contents that are loaded by AJAX like requests, so i had to go with selenium cum phantomJS that has ability to emulate a real browser and navigate to sites and scrape functions.
Anyone looking to scrape ajax loaded content can refer the selenium docs here, its very easy to use, step by step guide to scraping AJAX content.

Error parsing input URL, no data was scraped. only with new pages on my site

The problem i have is that i own a website where other people can post stuff ,creating new pages on my domain, but the problem that occured today is that all the new post pages created today are malfunctioning , sharing is not loading thumbnail picture and title and so on, but the weird this is that all the posts(new pages) created before today are all working fine
What caused an error to occur out of nowhere?
I also cannot debug any of the URL's of my website as the same error: Error parsing input URL, no data was scraped
The website im having problems with is here http://www.vabameedia.ee/vm/184/h%C3%A4da-ei-anna-h%C3%A4beneda.html
This is one of the sites where it says no error on page but facebook still cant reach it. http://www.vabameedia.ee/vm/178/craig-parks-%C3%BChek%C3%A4eline-krossisoitja.html
For people experiencing the same problem but for different causes, I discovered a few interesting things about how Facebook "scrapes" pages, checking the logs of the server while doing some trials.
First of all: if you never tried to share a page with FB, FB never tried to scrape it, and it will not try to do so if you only put the url in the Debug tool.
That's the first reason because you get the error: it just states that FB has no information on the page, you must "force" it to scrape the page.
The first time you try to share a page, FB scrapes it (asks your server the first 40k of the page and analyse the opengraph tags).
What can happen is that you do not see the image: Facebook Share Dialog does not display thumbnails one first load
The reason is that FB behind the scenes is still scraping your page and caching the image. The next time, in fact, you have also the image.
How to solve it? Pre caching: https://developers.facebook.com/docs/sharing/best-practices#precaching
or simply add
<meta property="og:image:width" content="450"/>
<meta property="og:image:height" content="298"/>
I was pulling my hair out trying to fix this issue. Hours and hours of troubleshooting to no avail. After speaking with one of our programmers about a topic unrelated I thought of something to try as a long shot.
Much to my surprise, it worked!!!
This is the reason behind the problem and my solution for it:
When you draft a post in WordPress it generates a link based on your article's title (unless you manually change it). The title of my article included special characters, however the auto-generated link didn't display these special characters, only hyphens to replace the spaces. Should be fine right? Wrong! Somewhere embedded in metadata and code in the WordPress platform are those special characters and they mess up the way Facebook pulls info from the article being linked to. This is a problem because certain special characters invalidate hyperlinks.
For example:
Article Title: R[eloaded]
Auto-generated hyperlink DISPLAYED in WordPress "Permalink" field: http://www.example.com/reloaded
Actual WordPress Auto-generated hyperlink: http://www.example.com/r[eloaded]
Those brackets will invalidate the link and Facebook will be unable to pull any information (ie pictures) from it.
Solution:
(1) Simply, manually change the WordPress hyperlink address to something that doesn't include any special characters (this will not change the title of your article).
(2) Click "Update" to change the post to include the new hyperlink.
(3) Click "Purge from Cache" in the WordPress window
(4) Refresh your Facebook browser window
(5) Paste the new hyperlink for your article
(6) Enjoy your Facebook post with a preview image and information
Sidenote: Don't pull your hair out over Facebook, it's not worth it. =)
If you're using Wordpress, edit the post in question to change the permalink (just alter it slightly), then update the post. Using the new permalink in the Facebook OG debugger should now work.
It's a weird fix, but I think it takes care of a problem caused by special characters being used in the title of a post, which is then used to make the permalink.
Its all about DNS issue, was having same issue and resolved it by updating domain name servers to actual name servers.
In my case my domain was pointed to ns1.websterz.net and ns2.websterz.net and on this server i had DNS redirect to my other server (where web site is hosted). I Just updated name servers of the domain to actual name servers where my web site is hosted on. This was account migration case i forgot to update name servers as of new server.
Everything works fine now.

Trying to pass URL from iFrame to SharePoint site URL?

I have an application running in an iFrame that is embedded in a SharePoint site. The problem with this is navigation within the application does not result in a change in the SharePoint site URL. Therefore, if you were to refresh the overall page, you would be sent back to the default page of the application, not stay on the same page of the application. The reason this is an issue is sharing for social media. I have added a Facebook Share button to the application, but when it pulls the URL of the application which does not match or reference the URL of the overall site, so it just shares the application (which is not visually appealing and does not allow you to access the rest of the site).
Any body have any suggestions or know a place I can go for help? Thanks!
If I understand properly, the Facebook stuff is INSIDE the iframe?
If so, you can:
* Remove the iframe and integrate the application better with SharePoint, or
* Change the application so that it detects that it's running "alone" (with javascript etc), and if so redirect to the "big" application.
IF the Facebook stuff is in SharePoint, OUTSIDE of the iframe, you can write some javascript to update the URL in some way that matches the URL of the application. This requires that the SharePoint parent application and the iframe application run in the same domain - if they are not, this is not an option.
Note that changing the "parent" URL with JS will reload the page, UNLESS you only change the URL after the "#" part (so you can do something like:
"http://sharepoint/iframe.aspx?aa=11&bb=22#iframeUrl=http://uglyapplication/"
You'll also probably want to write JS to update your iframe accordingly if the user press "back"/"forward" etc in the browser, because changing the URL like above will still add a "step" to the browser history.

problem getting my domain redirect to update .html files in dropbox after used iweb SEO TOOL

So I created a nice 6 page website hutchspropertyandtree.co.nr using freedomain.co.nr via dropbox public folder. Everything was working and updating properly until i updated with iwebs SEO TOOL. I added meta and title tags as well as description etc... PROBLEM is that even though my .html files in dropbox are correct and show all new code and tags. when i open up my domain hutchspropertyandtree.co.nr it doesnt show any of my recent seo tool updates.
im thinking that the cheap domainname from .co.nr is the problem? Is it possible that the default tags and titles and keywords entered into the co.nr website creation boxs are overwriting the newer ones in the html within my dropbox?
But still doesnt explain why a stat counter code and google analytics code in the footer and header respectively still do not show up when i view source in browser.
PLEASE PLEASE HELP.
It's because the page at hutchspropertyandtree.co.nr uses a frame to show the content from another location. The meta information comes from the page with the frame, not the page in the frame. You should be able to see the content of the frame using an inspector (comes with all browsers these days) or "View frame source", if your browser does that.
Note that any search engine hits to your pages will link to the dropbox URL, not the frame page (that has essentially no content from the viewpoint of a search engine). If you want search engine results to show up under that domain, you'll have to get hosting that lets you point a domain directly to it.