Nutch crawler: Configure to accept only pages in English - configuration-files

How can I configure the Nutch crawler to crawl only English pages?
This is what I set in nutch-site.xml, but it does not work:
<property>
<name>http.accept.language</name>
<value>en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.
</description>
</property>

The value you set: <value>en-us,en-gb,en;q=0.7,*;q=0.3</value> means that it prefers English but other languages (*) still there. For crawling only English pages, you should set value as below:
<value>en-us,en-gb,en</value>
To make sure, change the value in nutch-default.xml as well.
Hope this helps
-Le Quoc Do

Related

Can exams2moodle export additional metainfo such as idnumber and tags?

When I export the xml file of a multiple choice question, it contains the following lines:
<idnumber>arbitrary_id_set_by_user</idnumber>
<answernumbering>ABCD</answernumbering>
<tag></tag>
Is there a way to add idnumber, answernumbering and tag to the metainformation section of the question so that r-exams can export to moodle XML as <idnumber>idnumber</idnumber>,<answernumbering>ABCD</answernumbering>, <tag>tag1</tag>, and <tag>tag2</tag> etc?
The <answernumbering> tag can be set in exams2moodle() via the answernumbering= argument, see ?exams2moodle. The reason for this is that this is set in the same way for all exercises in a quiz. This is more consistent than setting it individually and potentially inconsistently in the meta-information of the different exercises.
The <idnumber> tag appears to be used by Moodle only for internal purposes. It is also not mentioned in the official Moodle XML documentation at https://docs.moodle.org/311/en/Moodle_XML_format. Hence we did not implement it in exams2moodle().
The <tag> is currently not supported in exams2moodle() because we felt that it would be more important to have tags in the Rmd (or Rnw) exercise itself and not the Moodle version of the exercise. For structuring the content on the Moodle side the exsection meta-information can be used, see boxhist for a worked example.
Finally, you can add arbitrary metainformation by using the exextra tag. This is used, for example, in the essayreg exercise template. However, there is no general way of using this extra metainformation to insert additional XML code in the exams2moodle() output. To do that, the source code underlying exams2moodle() would have to be adapted correspondingly.

Polylang: How does the linking of right page by changing the attribute "href" work?

I have a question about linking the right page with Polylang.
I have a hard-coded anchor, which is basically a “back home” link.
It looks like this:
<a href=“magazin” class=“article-type-inner”><?php pll_e(‘Close’); ?></a>
I have already implemented a string and it works fine in the posts in both languages, but how can I change the “Href” to the right language?
For example my default language is English and the other language is French. If I am on a french post I will return to the English page… Is there any solution?
Thank you.
Ideally you should not hardcode any strings or URL's, here is some options for you:
Use pll_home_url() if your intention is to redirect to the home page. pll_home_url() accepts optional parameter $slug (2-letters code of the language) to switch between languages if needed.
Use get_permalink(), the_permalink() or get_the_permalink(). You can pass page_id as first argument. Make sure post/pages/cpt's are linked properly. E.g. the_permalink(100).
Worst case scenario - use if/else in combination with pll_current_language(). Not recommended.

Sulu CMS 1.6: Is there a possibility to limit the number of images in media_selection?

Is there a possibility to limit the number of selectable images for the media_selection content type? According to the documentation there is none, but maybe there is still a way?
Reason is, that I want to allow to add an image to a text, but only one.
Maybe:
<property name="image" type="media_selection">
<param name="maxSelectionAmount" value="1"/>
</property>
There is nothing like that at the moment... What we have implement in the alphas of 2.0 is that the is a separate single_media_selection content type. This works well for limiting the assigned medias to one but still doesn't allow to restrict to an arbitrary number.

Rendering telephone links in HTL based on input from a Rich Text widget

I have a component using the Rich Text Edit widget (xtype="richtext") in my project that's used across the entire site as the default text component.
The users would like to be able to insert phone links using the tel URI scheme into the text entered using this component.
The dialog allows them to do so but when the contents of the Rich Text Edit are rendered in Sightly/HTL later on, the html context is used:
{$text # context='html'}
Once this is done, the value of my attribute is ignored.
The HTML stored in the repository is:
Call us!
And what's actually rendered on the page on the author instance is:
<a>Call us!</a>
on the publish instance, the tag gets removed altogether because of the link checker.
Changing the context to unsafe causes the href to render but it's not a solution I'm willing to accept. The component is used in a lot of places and I want to be sure the XSS protection is sufficient.
Is there a way I can affect the way the html context in HTL treats telephone links?
I tried adding an extra regular expression to the overlay of apps/cq/xssprotection/config.xml:
<regexp name="onsiteURL" value="([\p{L}\p{N}\\\.\##\$%\+&;\-_~,\?=/!]+|\#(\w)+)"/>
<regexp name="offsiteURL" value="(\s)*((ht|f)tp(s?)://|mailto:)[\p{L}\p{N}]+[\p{L}\p{N}\p{Zs}\.\##\$%\+&;:\-_~,\?=/!]*(\s)*"/>
<regexp name="telephoneLink" value="tel:\+?[0-9]+"/>
and further on:
<attribute name="href">
<regexp-list>
<regexp name="onsiteURL"/>
<regexp name="offsiteURL"/>
<regexp name="telephoneLink"/>
</regexp-list>
<!-- Skipped for brevity -->
</attribute>
but that doesn't seem to affect the way the Sightly/HTL escapes strings in the html context.
I've also tried overlaying the Sling xss rules located in /libs/sling/xss/config.xml but had no luck either.
How can it be done?
There are two xss protection config files:
/libs/cq/xssprotection/config.xml
/libs/sling/xss/config.xml
Sightly is using the second one, which means that you need to overlay it at path /apps/sling/xss/config.xml
What is worth mentioning is that new configuration seems to be applied only after restart of your aem instance.

Finding a typoscript code on big pages

There's a meta tag with charset ISO and I need to change it to utf8. I browsed through templates and configs, and cannot find the meta tag where it is set. Also I tried to override it with:
config.metaCharset = utf-8
and other config settings on that page without any success.
What's the best way to find a section of Typoscript I'm searching for?
The encoding parameter could be also specifed through :
config.additionalHeaders = Content-Type:text/xml;charset=utf-8
config.renderCharset = utf-8
Use TypoScript Object Browser to inspect the elements on the page in question.
Not found but it was possible to convert the files with broken charakters to the charset in this meta tag (ISO-8859-1). so the characters working now without any changes in the config and/or charsets.