Crawling website with dynamic pages - dom

I need to crawl websites and extract some information from dynamically created pages after a form submission.
The information which I need to crawl would mostly come from databases on these sites.
Added:
Crawlers usually work by jumping from one hyper-link to another. So these are mostly static pages. What about crawling pages that are not statically present but created on the fly.

From crawler's point of view there's no big difference. You're still getting genrated HTML.
The only thing you need to be careful about is links leading to infinite number of pages, e.g. calendar that's dynamically generated and has links to next/previous month/year.

Related

How to add Dynamic pages on moodle?

Is there any way to add dynamic content in Moodle pages? I have used Pages Plugin to create pages but I am not getting a way to add dynamic
content in it.
There are two ways to do this - a workaround which allows your greater flexibility on the content displayed and a limited amount of html using a plugin.
The workaround So, my way around this would be to create two or three different pages - and then use the 'restrict access' function to only allow certain people to see each page. This allows you the greatest flexibility over content on the page, but you'd only be able to control access on a group level e.g. only students with a certain grade, gender etc.
Find out more here: https://docs.moodle.org/36/en/Restrict_access_settings
HTML The generico plugin allows you to insert text strings into html on a page. It's pretty easy to use, but there's a limit to what you can do with it.
https://moodle.org/plugins/filter_generico

Should I use dynamic pages or actual files for blog?

I've seen news sites (CNN, Fox News, etc.) use HTML files as their post content. For my blog, I currently use dynamic pages (e.g. www.example.com/post/?id=3).
I'm wondering if this is the correct way to go, mostly because AdSense won't accept /post/ for ads. Is this because it's just pulling up /post/ & not the id?
So basically, which way do you recommend? Thanks
It depends on the contend of your page. But basically the good way is to create easy to read links like:
http://example.com/drive-to-norway
It's because it's easy to read for people and before clicking the user knowns what it could be (instead for example: http://example.com/id=3)
Some bigger pages do not use that convention because they for example sell a lot of similar items and having named, unique links without any numbering isn't possible/easy for them. Like I wrote at the beginning - it depends on content.

Does a what-links-here report for Gollum exist?

Is there any existing way to generate a what-links-here report for a gollum wiki? In other words, a list of the pages within the same wiki that link to the current page: a list of the local inbound links.
I wasn't able to spot any feature like this, nor find anything suitable in the API, but I may have missed it. Is there a third party add-on for it?
I do understand the reason it probably doesn't exist in the core: as these are plain text files, there isn't any table of links maintained anywhere. For the same reason, when a page is renamed it breaks all the inbound links to that page from other pages.
A function for this could use the API to read the generated source of each page (so that only html with normalized names needs to be parsed), producing a list of the local links from each page and the page they are on. Cache the results at page level until the next commit of that page.
This could be used to enhance the existing page rename feature as well. Has anybody already done this?

Cross-domain navigation within Blogger without Javascript

The setup: I have a Blogger blog set up on a domain name as blog.mydomain.com. The main site site at mydomain.com is running Umbraco CMS.
The problem: I need to have the navigation from the CMS transported to Blogger somehow, so that making a change on the main website doesn't require the extra step of modifying the navigation inside Blogger.
Generating the navigation data on the CMS side in what ever format it needs to be (XML, unordered list, JSON, etc) is not a problem. The problem is getting the data from Umbraco to Blogger after it is generated.
I'm not yet willing to use Javascript, as this would seriously impair the website for users browsing without Javascript. (Too bad because AJAX would be a very workable solution.)
I've tossed around the idea of using an iFrame. How would this work for a navigation system including sub-menus? Creating and deleting multiple iframes is out of the picture, since I don't want to use Javascript. I could use one large iframe to allow for the sub-menus, but then it would cover content at the top of the content area, rendering it unclickable.
I'm thinking about how you could do this, but while I do - in this day-and-age javascript has become very common. Most users are going to have it, and those with it disabled really shouldn't be on the web. Is this the only reason you don't want to use javascript? Around 2% according to YDN have js disabled, and that's lower from other countries. As time goes on that 2% should get lower, I don't see that as an issue. However if you absolutely can't use javascript, I'll keep thinking. I might have an idea, I'll need to test it though.
It's not possible to use IFrame, cause of same origin policy. Both sites are on different domains, when user click menu item inside IFrame, there is no way to call parent window.
There are few ways how this can be done.
1) Javascript solution. Use json rpc, or another cross-domain calls. Load menu from your CMS and render it. Yes, this requires javsascript, but, seriously, show me the site, which does not use javascript.
2) Direct server communication.
Is it possible to perform http call from blogger ? If so, just perform http call to your CMS from Blogger, get data and render it.
3) Mixed flash/javascript solution. Flash can perform http call regardless of same origin policy. Get data with flash, use ExternalInterface to call Javascript function to render data.
There is no another way to do it. I suggest you to use javascript solution
You could build an HTML skeleton of empty ULs in Blogger (the max that you might need) to hold your navigation contents, and then link to an Umbraco-generated external stylesheet.
This stylesheet could fill those LIs with CSS generated content using the :before and :after pseudo-elements, and hiding unused LIs with CSS display: none.
An example of this is at: http://jsfiddle.net/5bXja/1/
This works in IE8+ so depending on your clients, this may-or-may-not be more widely supported than Javascript. Likely not. ;-)

joomla multiple site content distribution

I'm just starting to evaluate joomla CMS as a tool to build out my personal site. I'd like to manage multiple sites/domains with one copy of joomla on one host. so I'll own mysite.com and myothersite.com, which will both point to the same host/joomla code. If I do this I need to be able to set which domain/site the content I add shows up on. For some sites the content will be on both for others it will be on only one. What would be ideal it to have some kind of filtering mechanism so I don't have to manually set where the content goes.
What would be ideal is for me to set tags on the content and each site can specify which taged content to show.
My last requirement is that I be able to have different pages on each site.
Is this possible or am I asking too much from a "free" CMS?
Thanks all
I don't know if there's a component that achieves what you're describing here. I use a multi-language component in some of my sites that shows translations, but it doesn't "suppress" articles that doesn't have references to a translation: it just says "No translations to this article". I know you're not asking for translations methods, but I think the Joomfish way of selecting content based in a chosen language would be what you wanted, but not based in languages, just domains.
The only component I know it would be able to suppress articles based in pre defined parameters (in its case the language), is the Joomfish's "Table Localization Plugin", but you need to be a Joomfish silver member paying $60 to Joomfish's developers.
You could write a component(see here for plugin documentation), that analyzing the domain, would suppress articles that shouldn't appear in that specific domain. But I think it's going yo be a lot of work. You would learn a lot of Joomla's architecture, though.
How Joomla displays its content (output) is controlled entirely by parameters. So if you can control what parameters are loading, you can create multiple displays per host
However, that may be overkill in this case. You can just easily hack your template. Just make it load a different menu for siteA and siteB. (The host is set in $_SERVER['HTTP_HOST'])
The menu on siteA could have a tagging component item, set to display articles tagged siteA.com. The siteB will have the same for its domain.
While there are extensions that will do what you describe (http://extensions.joomla.org/extensions/core-enhancements/multiple-sites), Joomla is really designed for one site at a time. I've done setups where I use the same codebase for Joomla and manage it with version control, but I always end up launching multiple sites with individual databases.
However, I don't know of any CMS that inherently allows you to share articles across instances while keeping the data centralized. You may be looking at an extension (or your own customization) regardless of which platform you pick.
We had a similar problem with needing to share content across multiple Joomla! sites so we developed this extension: http://extensions.joomla.org/extension/simple-sharing
It is not very robust in terms of what it can share but it does let you share Articles across multiple sites and choose which sites and categories those articles get published into. I hope it works for you.
Thanks!