EdgeNGramFilterFactory change in solr5

EdgeNGramFilterFactory change in solr5 - plugins

Short version:
Does anyone knows if something happened with EdgeNGramFilterFactory for solr5? It used to work fine on solr 4, but I just upgraded to solr5 and the cores having this fields using this filter refuses to load ...
Long story:
This configuration used to work in solr4.10 (schema.xml):
<field name="NAME" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="PP" type="text_prefix" indexed="true" stored="false" required="false" multiValued="false"/>
<copyField source="NAME" dest="PP">
<fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
And the documentation says I did it right (no clear mention if it is for solr4 or solr5).
However, when I am trying to add a collection using this configuration, it fails with the following message:
<lst name="failure">
<str>
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://localhost:8983/solr: Error CREATEing SolrCore 'test_collection': Unable to create core [test_collection] Caused by: Unknown parameters: {side=front}</str>
</lst>
I removed the side=front "unknown" parameter, started from scratch and it worked - meaning no more errors.
So, while it used to work for solr4 without any additional change, for solr5 it no longer works. Did something changed? Did I miss any doc regarding this filter? Any extra library I need to load to make this work?
And final, if the above is meant to be like this (bug/feature/whatever) - is there any workaround in order to have this "side-substring" indexing-functionality without me having to generate the values when I am adding docs to solr?
Update: with the "hacked" schema (i.e. without side=front), I indexed the documents and changed the PP field to be stored. when I searched, it looks like it indexes the entire value. For example, for NAME:ELEPHANT, I found PP:ELEPHANT ...

That attribute side has been removed in the context of LUCENE-3907 in Version 4.4. This filter now always behaves as if you gave in side="front". So you may just remove that attribute and are fine, since you are using it the "front-way".
As you can read in the conversation of the linked Lucene Issue
If you need reverse n-grams, you could always add a filter to do that
afterwards. There is no need to have this as separate logic in this
filter. We should split logic and keep filters as simple as possible.
And this is what has been done. The side attribute has been removed from the filter.
This has been done in Lucene, not directly in Solr. As Lucene is a Java-API it has been mentioned in the Java Doc of the filter
As of Lucene 4.4, this filter does not support
EdgeNGramTokenFilter.Side.BACK (you can use ReverseStringFilter
up-front and afterward to get the same behavior), handles
supplementary characters correctly and does not update offsets
anymore.
This may be the reason why you do not find a word about it in the Solr documentation. But this change has also been mentioned in Lucene's Change Log.

Related

Difference between "group" and "component" in QuickFIX/J

I am new to the FIX world. I am writing an application processing FIX messages in Java and for that I am using QuickFIX/J. I have downloaded the DataDictionary from the homepage (http://quickfixengine.org/). I am using the version 4.4.
In the XML-file exist groups and components. But a component can contain groups again.
What's the exact difference between them?

Components aren't really... things. They're like macros in the FIX DataDictionary (DD). Many messages need the same set of fields, so instead of specifying the same fields in every message, the DD defines a component that other messages can include.
A Group, on the other hand, is a very real thing. It's a repeating sequence of fields that will appear 0 or more times in a message.
QuickFIX's (QF) programming interface largely ignores components as a concept. You can't extract a component from a message because a component isn't a concept in QF; you just extract the fields like any other field.
A hypothetical example: The following two message definitions are exactly the same.
With a component
<message name="Automobile" msgtype="X" msgcat="app">
<field name="Wheel" required="Y"/>
<field name="Bumper" required="Y"/>
<component name="Dashboard" required="Y"/>
</message>
<component name="Dashboard">
<field name="Radio" required="Y"/>
<field name="AirConditioner" required="Y"/>
<field name="Heater" required="Y"/>
</component>
Without a component
<message name="Automobile" msgtype="X" msgcat="app">
<field name="Wheel" required="Y"/>
<field name="Bumper" required="Y"/>
<field name="Radio" required="Y"/>
<field name="AirConditioner" required="Y"/>
<field name="Heater" required="Y"/>
</message>
See? A component is pretty much just a macro.
Either way it's defined, you just end up calling msg.GetHeater() (or whatever).

From the FIXWiki for Components:
Component blocks are sets of related data fields grouped together and are referenced by the component block name in messages that they are used in. Sometimes they are referred to as "Groups".
Component blocks are practical to be defined, and then reused in different message types. Sometimes a repeating group is just for one particular message and then it is not defined as a Component block.
View a component block as a reusable definition of fields. Such a component block may or may not contain a repeating group of fields.
For instance take the Parties component block which is used in many different messages types (see "Used In" on that page). Easy to define once and use in many definitions of messages.

Just going to add some information since the accepted answer is missing this information (probably due to the fact that it is about five years old now).
In QuickFIX/J you are actually able to get and set components. So you can for example simply copy the Instrument component from one message to another.
#Test
public void testComponent() throws Exception {
final Instrument instrument = new Instrument();
instrument.set(new Symbol("DELL"));
instrument.set(new CountryOfIssue("USA"));
instrument.set(new SecurityType(SecurityType.COMMON_STOCK));
final quickfix.fix44.NewOrderSingle newOrderSingle = new quickfix.fix44.NewOrderSingle();
newOrderSingle.set(instrument);
final quickfix.fix44.ExecutionReport executionReport = new quickfix.fix44.ExecutionReport();
executionReport.setComponent(newOrderSingle.getInstrument());
System.out.println("NOS: " + newOrderSingle.toString().replace('\001', '|'));
System.out.println("ER: " + executionReport.toString().replace('\001', '|'));
}
Output:
NOS: 8=FIX.4.4|9=28|35=D|55=DELL|167=CS|470=USA|10=233|
ER: 8=FIX.4.4|9=28|35=8|55=DELL|167=CS|470=USA|10=221|
Maybe this is also possible in the other QuickFIX language variants.

Can I read the maxOccurs property for a segment from the stream being processed?

I am trying to create a mapping file for a fixed length file that contains multiple repeating segments. Problem is, that more than one of these segments are repeated an indefinite number of times, which is not supported by beanio for flat files. I understand, that there is a good reason for this, as beanio can do only so much guesswork about how often a segment repeats.
However the number of repetitions for each segment is present in the file, at a position before the repeating segments occur, so I am trying to figure out whether there is a way to read that number from the stream and then populate the "minOccurs" and "maxOccurs" properties for the following segment with the correct value.
Basically my mapping file looks like:
<beanio>
<stream name="employeeFile" format="fixedlength">
<record name="record1" class="example.Record1">
<field name="field1" length="10"/>
<field name="field2" length="2"/>
<field name="length1" length="2"/>
<segment name="list1" collection="list" minOccurs="1" maxOccurs="unbounded" class="example.List1">
...
</segment>
<field name="length2" length="2"/>
<segment name="list2" collection="list" minOccurs="1" maxOccurs="unbounded" class="example.List2">
...
</segment>
</record>
</stream>
</beanio>
I now need some way to use the value of fields length1 and length2 as "maxOccurs" property in the segments. I am fairly certain that there is no "official" way to get this behavior, but I have so far failed to come up with an even remotely elegant solution for this.
An idea I had was to create a procedure that loads the number of repetitions for each segment from the file and then doing a search-replace on the mapping file, before loading this in beanio, however this seems like a very complicated way of doing things.
Thanks,
Sönke

Found the answer myself. I was reading the beanio reference documentation for version 2.0, not 2.1 which introduced the feature I am looking for.
The reference document states:
If a field repeats a fixed number of times based on a preceding field
in the same record, the occursRef attribute can be used to identify
the name of the controlling field. If the controlling field is not
bound to a separate property of its parent bean object, be sure to
specify ignore="true". The following mapping file shows how to
configure the accounts field occurrences to be dependent on the
numberOfAccounts field. If desired, minOccurs and maxOccurs may still
be specified to validate the referenced field occurrences value.
So one can use:
<field name="accounts" type="int" collection="list" occursRef="numberOfAccounts" />
to get the intended result.
I don't think this property works with xml streams, as it is not really needed here. I accidentally tried to add this in a mapping file and got an exception instead of a proper error message.

Solr query on Europe character (Beklædning)

in solr query search,
a search
q=*%3A*&fq=grand_cat_str%3ABeklædning
Solr will read the fq as:<str name="fq">grand_cat_str:BeklÃ¦dning</str>
and return no result. Doing wild search for Bekl*dning would return correct result.
[edit]
I added
<fieldType name="string" class="solr.StrField" sortMissingLast="true" >
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
but got a error:
<org.apache.solr.common.SolrException: FieldType: StrField (string) does not support specifying an analyzer

This is related to how Solr handles characters that are not in the first 127 ASCII character set. The best recommendation is add the ASCIIFoldingFilterFactory analyzer to your field grand_cat_str in your schema.
Please reference Specifying an Analyzer in the Schema if you need guidance on adding a analyzer.

If most documents in his corpus are in that same language (Dannish?) then it is very possible that applying ASCIIFoldingFilterFactory is a bad option, depends on how the users are expected to enter their queries.
Have you tried just encoding the query??
q=*%3A*&fq=grand_cat_str%3ABekl%C3%A6dning
should work just fine

it is indeed an escape problem.
using org.apache.solr.client.solrj.util.ClientUtils.escapeQueryChars(String)
is able to make string readble.

Using SOLR Autocomplete for multiple terms (i.e. comma-separated locations)

I've got SOLR up and running, indexing data via the DIH, and properly returning results for queries. I'm trying to setup another core to run suggester, in order to autocomplete geographical locations. We have a web application that needs to take a city, state / region, country input. We'd like to do this in a single entry box. Here are some examples:
Brooklyn, New York, United States of America
Philadelphia, Pennsylvania, United States of America
Barcelona, Catalunya, Spain
Assume for now that every location around the world can be split into this 3-form input. I've setup my DIH to create a TemplateTransformer field that combines the 4 tables (city, state and country are all independent tables connected to each other by a master places table) into a field called "fullplacename":
<field column="fullplacename" template="${city_join.plainname},
${region_join.plainname}, ${country_join.plainname}"/>
I've defined a "text_auto" field in schema.xml:
<fieldType class="solr.TextField" name="text_auto">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
and have defined these two fields as well:
<field name="name_autocomplete" type="text_auto" indexed="true" stored="true" multiValued="true" />
<copyField source="fullplacename" dest="name_autocomplete" />
Now, here's my problem. This works fine for the first term, i.e. if I type "brooklyn" I get the results I'd expect, using this URL to query:
http://localhost:8983/solr/places/suggest?q=brooklyn
However, as soon as I put a comma and/or a space in there, it breaks them up into 2 suggestions, and I get a suggestion for each:
http://localhost:8983/solr/places/suggest?q=brooklyn%2C%20ny
Gives me a suggestion for "brooklyn" and a suggestion for "ny" instead of a suggestion that matches "brooklyn, ny". I've tried every solution I can find via google and haven't had any luck. Is there something simple that I've missed, or is this the wrong approach?
Thanks!
EDIT: Just in case, here's the searchComponent and requestHandler definition:
<requestHandler name="/suggest" class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
<searchComponent name="suggest" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
<str name="field">name_autocomplete</str>`<br/>
</lst>
</searchComponent>

The problem lies in the suggester. Like the spellchecker it tokenizes on whitespace.
http://lucene.472066.n3.nabble.com/suggester-issues-tp3262718p3266140.html has a solution for this problem.

You are using the KeywordTokenizer which will not create separate tokens for "Brooklyn", "NY" and "United States".
Your example queries do not look so much like autocomplete but more like regular searches.
Autocomplete query (IMHO) contains only partial terms:
http://localhost:8983/solr/places/suggest?q=brook
for type ahead lists. You want to use EdgeNGram for that: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
Most probably in combintation with StandardTokenizer and/or WordDelimiterFilterFactory.
For your query example:
http://localhost:8983/solr/places/suggest?q=brooklyn%2C%20ny
StandardTokenizer in combination with LowercaseFilter and dismax request handler with a good configuration of the mm parameter - restricting hits to those that contain all input terms - would work well, see: http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29

I feel the accepted answer is a bit too complex. An elegant way of doing it would be to use http://localhost:8983/solr/places/suggest?spellcheck.q=brooklyn in place of http://localhost:8983/solr/places/suggest?q=brooklyn. As mentioned here

Creating Lucene Index for Sitecore causes "Could not find add method" errors

I have a sitecore 6.2 site that had no lucene indexes besides the system index. I tried to add this new simple index:
<index id="videoIndex" type="Sitecore.Search.Index, Sitecore.Kernel" >
<param desc="name">$(id)</param>
<param desc="folder">IndexFolder</param>
<Analyzer ref="search/analyzer" />
<templates hint="list:AddTemplate">
<template>{854D2F45-3261-45A8-9E52-64D96B5D54E5}</template>
</templates>
<fields hint="raw:AddField">
<field target="category">Categories</field>
<field target="date">__updated</field>
</fields>
</index>
Once I add this, browsing to any page on the sitecore site gives the following error:
Could not find add method: AddTemplate (type: Sitecore.Search.Index)
Using lucene 2.3.1.3, .NET 3.5.

The 'type' attribute of the <index/> element references Sitecore.Search.Index class, which doesn't contain methods like AddTemplate and AddField. It seems you should reference Sitecore.Data.Indexing.Index instead. Take a look at <index id="system" ... /> in web.config.
Hope this helps.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

EdgeNGramFilterFactory change in solr5 - plugins

Related

Difference between "group" and "component" in QuickFIX/J

Can I read the maxOccurs property for a segment from the stream being processed?

Solr query on Europe character (Beklædning)

Using SOLR Autocomplete for multiple terms (i.e. comma-separated locations)

Creating Lucene Index for Sitecore causes "Could not find add method" errors

Categories

Resources