Autocomplete via shingles and termvector component - autocomplete

One of the ways to go about Google-like auto-completion is to combine shingles and the termvector component in Solr 1.4.
First we generate all n-gram distributions with the shingles component and then use termvector to get the closest prediction to a user's term's sequence (based on document frequency).
Schema:
<fieldType name="shingle_text_fivegram" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Solr config:
<searchcomponent name="termsComponent" class="org.apache.solr.handler.component.TermsComponent"/>
<requesthandler name="/terms" class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<bool name="terms">true</bool>
<str name="terms.fl">shingleContent_fivegram</str>
</lst>
<arr name="components">
<str>termsComponent</str>
</arr>
</requesthandler>
With the above setup I need to drop stopwords anywhere on the edges of n-grams and keep them inside the n-gram sequence.
Let's say from the sequence "india and china" I need the following sequence:
india
china
india and china
and skip the rest.
Is it doable in combination with other Solr components/filters?
UPD: here is one possible solution in Lucene 4 (should be possible to wire into SOLR):
"Couldn't you make a custom stop filter that only removed stop words at the start (first token(s) seen) or end of the input (no non-stopword tokens seen after)? It'd required some buffering / state keeping (capture/restorteState) but it seem doable?" -- Michael McCandless
from: http://blog.mikemccandless.com/2013/08/suggeststopfilter-carefully-removes.html

The best way to do multi-word auto-complete in Solr 1.4 is with EdgeNGramFilterFactory, as you need to match the user input as he/she types it. So you need to match "i", "in" "ind" and so on to suggest India.

Use a separate query analyzer with the KeywordTokenizerFactory, thus (using your example):
<analyzer type="index">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

Related

EdgeNGramFilterFactory change in solr5

Short version:
Does anyone knows if something happened with EdgeNGramFilterFactory for solr5? It used to work fine on solr 4, but I just upgraded to solr5 and the cores having this fields using this filter refuses to load ...
Long story:
This configuration used to work in solr4.10 (schema.xml):
<field name="NAME" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="PP" type="text_prefix" indexed="true" stored="false" required="false" multiValued="false"/>
<copyField source="NAME" dest="PP">
<fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
And the documentation says I did it right (no clear mention if it is for solr4 or solr5).
However, when I am trying to add a collection using this configuration, it fails with the following message:
<lst name="failure">
<str>
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://localhost:8983/solr: Error CREATEing SolrCore 'test_collection': Unable to create core [test_collection] Caused by: Unknown parameters: {side=front}</str>
</lst>
I removed the side=front "unknown" parameter, started from scratch and it worked - meaning no more errors.
So, while it used to work for solr4 without any additional change, for solr5 it no longer works. Did something changed? Did I miss any doc regarding this filter? Any extra library I need to load to make this work?
And final, if the above is meant to be like this (bug/feature/whatever) - is there any workaround in order to have this "side-substring" indexing-functionality without me having to generate the values when I am adding docs to solr?
Update: with the "hacked" schema (i.e. without side=front), I indexed the documents and changed the PP field to be stored. when I searched, it looks like it indexes the entire value. For example, for NAME:ELEPHANT, I found PP:ELEPHANT ...
That attribute side has been removed in the context of LUCENE-3907 in Version 4.4. This filter now always behaves as if you gave in side="front". So you may just remove that attribute and are fine, since you are using it the "front-way".
As you can read in the conversation of the linked Lucene Issue
If you need reverse n-grams, you could always add a filter to do that
afterwards. There is no need to have this as separate logic in this
filter. We should split logic and keep filters as simple as possible.
And this is what has been done. The side attribute has been removed from the filter.
This has been done in Lucene, not directly in Solr. As Lucene is a Java-API it has been mentioned in the Java Doc of the filter
As of Lucene 4.4, this filter does not support
EdgeNGramTokenFilter.Side.BACK (you can use ReverseStringFilter
up-front and afterward to get the same behavior), handles
supplementary characters correctly and does not update offsets
anymore.
This may be the reason why you do not find a word about it in the Solr documentation. But this change has also been mentioned in Lucene's Change Log.

Remove email address from solr indexing

When Solr build the index, it gets parts of email address.
For exemple, if i have an email like this : foo#bar.com, Solr indexes the words "foo" and "barcom".
I want to remove theses words but I don't know how to do this. I tried to modify the configuration file schema.xml adding this rule on my indexed field :
<filter class="solr.PatternReplaceFilterFactory" pattern=" (.*)#(.*) " replacement=" " replace="all"/>
However, it doesn't work.
You can detect tokens as a e-mailaddress and blacklist them using
<fieldType name="emails" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.TypeTokenFilterFactory" types="email_type.txt" useWhitelist="true"/>
</analyzer>
</fieldType>

ActiveMQ Splitter / aggregator using jms transport

I have a problem with the activemq *aggregator*, would be very thankfull if someone would help me out somehow. Marshaling into a xml.
So i have my route configured like this:
<route id="myRoute">
<from uri="timer:someScheduler?period=5000" />
<bean ref="someBean" method="someMethod" />
<marshal>
<jaxb contextPath="some package" />
</marshal>
<split streaming="true">
<tokenize token="#id" group="1000" />
<to uri="activemq:topic:some_topic" />
</split>
</route>
This works and it splits my xml messages composed by 1k rows, tho dunno how to configure the aggregator in order to put together all the messages before proceding with their processing.
This is it(doesn't work):
<route id="myRoute">
<from uri="activemq:topic:some_Topic" />
<aggregate completionSize="5">
<correlationExpression>
<constant>true</constant>
</correlationExpression>
<to uri="mock:aggregated"/>
</aggregate>
<unmarshal>
<jaxb contextPath="some_package" />
</unmarshal>
<bean ref="someBean" method="someMethod" />
</route>
Thanks in advance!
What you need to do is to provide the aggregator with an implementation of an AggregationStrategy - this is a class that tells the pattern how to assemble two objects that match the correllationExpression. See Camel Aggregator for an example as to how to do this.

Weird behaviour on solr while using a custom indexing plugin on multivalued field

I'm using my custom plugin to index a bunch of xml in solr.
What this plugin does is "tag" documents and add those tags (comma separated) in a multivalued field.
This is what i have inside my log:
...
[MULTIVALUE CAR TYPE - final result] -> 4 Dr. Wagon with Wagon, 4X4,
...
This is what i actually have inside the solr instance when faceting:
<lst name="car_type_multivalue">
<int name="convertible">331</int>
<int name="4">152</int>
<int name="x">152</int>
<int name="wagon">121</int>
This is how the field is defined:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
</analyzer>
</fieldType>
As you can see 4x4 is added correctly to the tags of a document, but when it's faceted it's actually split between "4" and "x". My field type doesn't seem to allow something like this, so the question is why is solr behaving like this? All the other values work correctly, but not "4x4". Can i presume that everytime i find an "x" in my tags, it's going to be split no matter what? Thanks all!

Word 2007, Open XML - embedding an image

Do you know what basic information MUST include a paragraph (<w:p/>) in document.xml inside a *.docx document, that specifies an image? I do know there must be:
<a:blip r:embed="rId4" />
specifing the relationship id, but what else?
It's very hard to find it in Google and experimenting with cutting out tags from a ready document or reading the specification takes a lot of time.
An example with all the required tags would be greatly appreciated.
Word is rather picky concerning the input XML provided. To embed an image, you have to provide quite some information. Here is a simple example:
document.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
<w:r>
<w:drawing>
<wp:inline distT="0" distB="0" distL="0" distR="0">
<wp:extent cx="5943600" cy="3717290"/>
<wp:docPr id="1" name="Picture 0" descr="vlcsnap-325726.png"/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="myImage.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId4"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="5943600" cy="3717290"/>
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst/>
</a:prstGeom>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
</w:r>
</w:p>
</w:body>
</w:document>
document.xml.rels
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<!-- other relationships go here -->
<Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
</Relationships>
And of course the image must be added to the package at the correct location (media/image1.png)
Since all this is rather complicated I would recommend you to use the OpenXML SDK 2.0 provided by Microsoft or another library, e.g. OpenXML4J. These libraries, especially the one from Microsoft can make your work a lot easier.