TYPO3: Indexed search and crawler relation

TYPO3: Indexed search and crawler relation - content-management-system

I've done several sites with TYPO3 indexed_search. However I feel I still do not understand the nature of the relation between indexed_search and crawler. For instance, according to some authors to index tt_news I just need a generic crawler configuration and an indexed_search configuration for tt_news; but for other authors of tutorials and such I should create a crawler configuration for tt_news.
It is not clear to me what the relation is between crawler and indexed_search. How do they match? Shouldn't it be sufficient a root crawler configuration that when it finds an indexed_search configuration just runs it? Or does the URLs need to be generated by both? I've managed to create an index with just one crawler root configuration but I run the indexing through my own shell script that calls cli_dispatch.phpsh.
Are indexed_search and crawler redundant in terms of functionality (generation of URLs)?
Any clues are welcome.
Bests,
B.

Indexed_search can work without a crawler by indexing pages that are visited by visitors. The obvious disadvantage is that pages that aren't visited won't be indexed and thus won't show up in search results. If you have several frontend user groups configured then the chances that a page is visited are even lower.
The crawler can solve this by visiting each page. Furthermore it can visit pages as if it were member of (a combination of) FE user groups. This way it can help build the index of an entire website for all kinds of users.
Most of the details are explained in a tutorial by Xavier Perseguers. It's written for an older version but I guess that most of it is still valid.
(It's been a while since I last used indexed_search, but at that time the tutorial helped a lot).

Related

Why would "Disallow: /*?s=" be used in a robots.txt file?

We got notice from Google's Search Console that one of our blog posts couldn't be crawled. When inspecting the URL from the Google Search Console it reports that the paged was blocked by the following in our robots.txt file.
Disallow: /*?s=
I also ask why "Disallow: /*?s=" would be used? Why worry about parses that contain the letter "s"? If we remove it, what's the risk? Thanks so much in advance for any additional insight that can be shared - P

This query is commonly used on WordPress-based sites.
There may be several types of content on your site and the site builder wanted to allow search only for certain types of content by another way of searching.
It makes sense for example on a store site that wants to restrict users from searching for the products using a customized search form so that they do not wander behind the scenes of the site.
Google's robot has a number of ways to identify if it's a WordPress based site, which is probably why it's looking for that end of the path.

TYPO3 - display one page from pagetree only

I have a problem whose solution is certainly very simple, but it does not come to my mind at the moment :/
I have a multi-domain TYPO3 (6.1) installation and in one of the websites I need to temporarily show only one subpage, and over the rest of the pages I will work/update so I can not delete them. It is important that someone after entering a URL or going to the page from the Google search results has not opened this page, and has been redirected to this temporary.
I've tried the mount points but something does not work ...
Please help.

You can exchange the domain-records.
Make a new page on it's own (independent from the configuration of the domain it should replace). so it is a root-page. give it a domain record and disable the domain record of the pagetree it should replace.
Be aware to change the rootpageid configuration in realurl.
You also may need a special configuration for 404 handling for this domain as the most requests will be a 404 (or better 503).
And hurry up to update your system. TYPO3 6.1 is out of service for a long time.

Looking for an Extension of an extensive link list

For the construction of an extensive list of links, since the source page is a thematic portal, I am looking for a suitable EXT., Which also runs under TYPO3 7.6 LTS.
it if the list of links to a permits the use of categories and multiple categorization of links is possible would be nice. should Weiterrhin the links are described not only the destination address and an alias but here should still an outline of the target page (possibly with photo) be possible.
Additional functions such as proposing links by users, reporting broken links or even a User Voting would nice additional features.
There were times the Modern Linklist, but they were no longer being developed for TYPO3 <6.x.
Is there perhaps somewhere an alternative or as one might like to vorhnandenen solutions might realize? It would be nice of course, without any programming knowledge, since I'm not a programmer.
P.S .: It is not about building a spam list but high quality links with topics relating to the original page.

As this seems to be a straight forward usage you could try to build that extension by yourself with the ExtensionBuilder.
just build up the records neccessary for your data. and let the EB generate all usefull actions: list & show, even create, edit, delete in FE would be possible.
Afterwards you just need to edit the generated fluid templates.
these links may help:
Overview
EB manual
small remark: if you want the newest code state, use the EB from git instead of TER

I`m not aware of an existing extension for it but it could be a good project to learn extbase / fluid.
You should also take a look at
typo3/sysext/fluid_styled_content/Resources/Private/Partials/Menu
and
typo3/sysext/fluid_styled_content/Classes/ViewHelpers/Menu
Fluid Content contains everything you need to create a list like that, you "just" have to combine the necessary bits and pieces.

You can do a lot with TYPO3 core functionality: there is a page type "external URL", pages can have categories by default, there are plenty of menu options (TypoScript HMENU, menu content elements, Fluid menu Viewhelpers). The Linkvalidator can periodically check all links and report broken links.
For suggestions you could add a form. Powermail for example can also store submitted info in database records, so your visitors could prepare page records (they are hidden until you make them visible).

To index or not to index? What is Google CSE doing? Alternatives?

I am trying to understand what Google CSE (Custom Search Engine) is doing. I use the free version and submit a sitemap.php.
Google CSE takes this and indexes 200 (out of 2500 pages). I did this some time ago and is starting to wonder if it ever will index the rest.
If I look in Google Webmaster Tools, dashboard for the site in question it says 200 pages are indexed.
If I look in Google Webmaster Tools, Index Status it tells me that 0 pages are indexed. That looks incorrect to me. 200 is what I guess is correct at the moment, but I really do not know.
I suspect that the differences are due to that Google knows about the website before. However the sitemap.php points to pages it can not find without this file.
I am starting to wonder if this will work at all. Google CSE has previously sometimes returned 0 and sometimes a lot of hits. I have not been able to understand what is going on and that is why I am adding this sitemap. The sitemap presents the pages in question in a new way that I think is better for Google. (The same pages are also in a different form on http://zotero.org/.)
Any suggestion for what I can do to get this search working? (I am considering using OpenSearchEngine, but I do not have a webhost available at the moment where I can run Java. And this is a free project, on my spare time, so I do not have a lot of economic resources for this. Maybe I can get Apache Lucy to work, but I am unsure. I tried to compile it under Cygwin, but it failed due to a problem with the gcc-4-link which is fixed in perl 5.18, but Cygwin only have 5.14. My web hotell hosts of course runs Linux, but it looks a bit early for Lucy. Maybe I am wrong?)

Every free Custom Search Engine is assigned a quota of 200 pages for immediate indexing:
https://support.google.com/customsearch/answer/115958?hl=en
But, I think On-demand indexing may not be what you want, you simply want your 2.500 URLs to be searchable by CSE (not crawled as soon as possible). And this could be the problem: "If I look in Google Webmaster Tools, Index Status it tells me that 0 pages are indexed".
If your site is not indexed by Google, so it doesn't appear in www.google.com results, then you probably can't use CSE (yet). You can see how many pages you have indexed using site: operator - https://www.google.com/webhp#q=site%3Azotero.org (and in Google Webmaster Tools, Index Status, as you said).
I think you should submit sitemap in Webmaster Tools, and to make sure your site is easy to crawl (pages are loading fine, and they are interlinked, navigation is "hard coded" in plain HTML and not generated by JavaScript, or you provided AJAX HTML snapshots, etc.), and there are no technical issues (like invalid robots.txt file, and similar), and when you see your 2.500 pages on site:your-domain.com search on www.google.com, they will automatically appear on your CSE, too.

TYPO3 - Migration from RealURL to CoolURI

I have a big TYPO3 instance that exists for ages. All the time the webpage used RealURL but now we want to migrate to CoolURI because we have better experience with this. Now the problem is that all old links shall be available even after switching the URL extension.
The CoolURI documentation states
Migrating from RealURL
The field Speaking URL path segment (tx_realurl_pathsegment) is kept with its values, but make sure it's listed in the element.
I got the tables tx_realurl_pathcache and tx_realurl_uniqalias besides some other tables like redirects, etc. But I don't really understand the function and differences of these two tables and can't find any indepth documentation on this. So I'm a bit afraid right now that I have to reverse engineer the whole extensions and then write a script which exports all the old URLs and imports into the new CoolURI tables. Because we also use tt_news and these URLs have to work, too.
So does anyone have some experience with this? Does CoolURI automatically handle everything and the old links are still valid or if not maybe someone could give me a detailed explanation for all the RealURL tables in the database?

I wouldn't migrate if there is no really important reason (like ie. missing feature). To make sure that you'll be able to generate all links properly and then map them into CoolURI you'll need to learn RU logic anyway.
Reason: RU generates links on the fly - if it's required, and then caching it in the tables, other tables stores links to common pages and other for extensions. You would need to just write custom extension which will visit each page, to make sure that RU cached every possible link and then rewrite all results into for an example list of redirects. IMHO it's not worth of wasting time.
Note, I don't want to say that CU is bad :) actually I don't know it. I want to just remember Voltaire's most famous words: "the better is an enemy of the good"

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse