Solr indexing of MongoDB collection - mongodb

Suppose I have a test application representing some friends list. The application uses a collection where all documents are in the following format:
_id : ObjectId("someString"),
name : "George",
description : "some text",
age : 35,
friends : {
[
{
name: "Peter",
age: 30
town: {
name_town: "Paris",
country: "France"
}
},
{
name: "Thomas",
age: 25
town: {
name_town: "Berlin",
country: "Germany"
}
}, ... // more friends
]
}
... // more documents
How can I describe such collection in the schema.xml ? I need to produce facet queries like: "Give me countries, where George's friends live". Another use case may be - "Return all documents(persons), whose friend is 30 years old." etc.
My initial idea is to mark "friends" attribute as text field by this schema.xml definition:
<fieldType name="text_wslc" class="solr.TextField" positionIncrementGap="100">
....
<field name="friends" type="text_wslc" indexed="true" stored="true" />
and try to search for eg. "age" and "30" words in the text, but it is not a very reliable solution.
Please, leave aside not logically well-formed architecture of the collection. It is only an example of similar problem I am just facing.
Any help or idea will be highly appreciated.
EDIT:
Sample 'schema.xml'
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="text-schema" version="1.5">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
<fieldType name="trInt" class="solr.TrieIntField" precisionStep="0" omitNorms="true" />
<fieldType name="text_p" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="_id" type="string" indexed="true" stored="true" required="true" />
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="_ts" type="long" indexed="true" stored="true"/>
<field name="ns" type="string" indexed="true" stored="true"/>
<field name="description" type="text_p" indexed="true" stored="true" />
<field name="name" type="text_p" indexed="true" stored="true" />
<field name="age" type="trInt" indexed="true" stored="true" />
<field name="friends" type="text_p" indexed="true" stored="true" /> <!-- Here is the problem - when the type is text_p, all fields are considered as a text; optimal solution would be something like "collection" tag to mark name_town and town as descendant of the field 'friends' but unfortunately, this is not how the solr works-->
<field name="town" type="text_p" indexed="true" stored="true"/>
<field name="name_town" type="string" indexed="true" stored="true"/>
<field name="town" type="string" indexed="true" stored="true"/>
</fields>
<uniqueKey>_id</uniqueKey>

As Solr is document-centric you will need to flatten as much as you can down. According to the sample you have given, I would create a schema.xml like the one below.
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="friends" version="1.0">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="name"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="description"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="age"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="town"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="townRaw"
type="string" indexed="true" stored="true" multiValued="false" />
<field name="country"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="countryRaw"
type="string" indexed="true" stored="true" multiValued="false" />
<field name="friends"
type="int" indexed="true" stored="true" multiValued="true" />
</fields>
<copyField source="country" dest="countryRaw" />
<copyField source="town" dest="townRaw" />
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
<fieldType name="int" class="solr.TrieIntField"
precisionStep="0" positionIncrementGap="0" />
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
</types>
</schema>
I would go with the approach to model each person for itself. The relationship between two persons is modelled via the attribute friends, which translates into an array of IDs. So at index time you would need to fetch the IDs of all friends for a person and put them into that field.
Most of the other fields are straight forward. Interesting are the two Raw fields. Since you said that you want to facet on the country you will need the country unchanged or optimized for faceting. Usually the types of fields differ depending on their purpose (searching for them, faceting by them, autosuggesting them, etc.). In this case country and town are indexed just as they are given in.
Now to your use cases,
Give me countries, where George's friends live
This can then be done by faceting. You could query
for the ID of George
facet on countryRaw
Such a query would look like q=friends:1&rows=0&facet=true&facet.field=countryRaw&facet.mincount=1
Return all documents(persons), whose friend is 30 years old.
This one is harder. First off you will need Solr's join feature. You need to configure this in your solrconfig.xml.
<config>
<!-- loads of other stuff -->
<queryParser name="join" class="org.apache.solr.search.JoinQParserPlugin" />
<!-- loads of other stuff -->
</config>
The according join query would look like this q={!join from=id to=friends}age:[30 TO *]
This explains as follows
with age:[30 TO *] you search for all persons that are of age 30 or older
then you take their id and join it on the friends attibute of all others
this will return you all persons that have the ids matched by the initial query within their friends attribute
As I have not written this off of my mind, you may have a look on my solrsample project on github. I have added a test case there that deals about the question:
https://github.com/chriseverty/solrsample/blob/master/src/main/java/de/cheffe/solrsample/FriendJoinTest.java

Related

Solr Dataimport nested entity from PostgreSQL

I wanna create nested entity with DataImportHandler.
I use Solr 8.6, Postgress 12, openjdk-11.
My config (schema.xml) looks like this:
<schema name="products" version="1.5">
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="_root_" type="int" indexed="true" stored="false"/>
<uniqueKey>id</uniqueKey>
<field name="id" type="int" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="price" type="float" indexed="true" required="true" stored="true"/>
<field name="categories" type="int" indexed="false" stored="true" required="true" multiValued="true"/>
<field name="pictures" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="pid" type="int" indexed="true" stored="true" />
<field name="previewUrl " type="string" indexed="true" stored="true" />
</schema>
data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://${db.host}/myDB"
user="user"
password="myPassword"
/>
<document>
<entity name="products"
pk="id"
transformer="DateFormatTransformer"
query="SELECT * from products"
deltaQuery="SELECT id FROM products WHERE updated > '${dataimporter.last_index_time}'::timestamp"
deltaImportQuery="SELECT * FROM products WHERE id=${dataimporter.delta.id}"
/>
<field column="id" name="id"/>
<field column="price" name="price"/>
<entity name="categories"
query="SELECT category_id FROM product_category WHERE product_id='${products.id}'">
<field column="category_id" name="categories"/>
</entity>
<entity name="pictures"
child="true"
pk="pid"
query="SELECT * FROM pictures WHERE product_id='${products.id}'"
>
<field column="id" name="pid"/>
<field column="preview_url" name="previewUrl"/>
</entity>
</entity>
</document>
</dataConfig>
This is the result I expect:
[
{
"id":1,
"price": 10,
"categories": [1, 2]
"pictures": [
{
"pid":1,
"previewUrl":"/url"
},
{
"pid":2,
"previewUrl":"/url"
},
]
"_version_":1674819829308063744
}
]
But I get the following error:
org.apache.solr.common.SolrException: [doc=null] missing required field: price
What am I doing wrong?

Error: text index required for $text query

I would like to know how to add text type indexes in my ODM configuration by XML to solve this problem and search by name.
Thanks for everything.
Regards.
PD: I'm sorry for my English.
<document name="App\Document\Doc" db="db" collection="collection"
repository-class="App\Repository\DocRepository">
<id field-name="id" strategy="INCREMENT" type="int"/>
<field field-name="code" name="code" type="string"/>
<field field-name="name" name="name" type="string"/>
<field field-name="type" name="type" type="string"/>
<indexes>
???
</indexes>
</document>
After digging into some code I found this works:
<document name="App\Document\Doc" db="db" collection="collection"
repository-class="App\Repository\DocRepository">
<id field-name="id" strategy="INCREMENT" type="int"/>
<field field-name="code" name="code" type="string"/>
<field field-name="name" name="name" type="string"/>
<field field-name="type" name="type" type="string"/>
<indexes>
<index name="fts">
<key name="code" order="text" />
<key name="name" order="text" />
<key name="type" order="text" />
</index>
</indexes>
</document>
However the keyword order seems counterintuitive.

Data type not working in Solr

I wanna fetch records including a date type from Cassandra in solr, the following are my codes:
in dataconfig.xml:
<entity name="artist" query="SELECT artist_id, name, email, total_jobs, created FROM artist_list">
<field column="artist_id" template="ARTIST_${artist.artist_id}" name="id"/>
<field column="created" name="artist_created" />
</entity>
in schema.xml:
<fieldType name="tdate" class="solr.TrieDoubleField" omitNorms="true" />
<field name="artist_created" type="tdate" indexed="false" stored="true"/>
But the result did not contain created field. Is there anyone can tell me what the problem is? Thanks very much!
You are defining tdate data type as solr.TrieDoubleField. That's why result don't contain artist_created data.
Change your schema to :
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<field name="artist_created" type="date" indexed="false" stored="true"/>

Getting a JSONParseException when indexing fields from MongoDB collection in SOLR using DataImportHandler

I am seeing this exception while I am trying to index data from MongoDB collection :
Exception while processing: products document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: com.mongodb.util.JSONParseException:
{idStr,name,code,description,price,brand,size,color}
^
at org.apache.solr.handler.dataimport.MongoEntityProcessor.initQuery(MongoEntityProcessor.java:46)
at org.apache.solr.handler.dataimport.MongoEntityProcessor.nextRow(MongoEntityProcessor.java:54)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:481)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:462)
Caused by: com.mongodb.util.JSONParseException:
{idStr,name,code,description,price,brand,size,color}
^
at com.mongodb.util.JSONParser.parseString(JSON.java:387)
Following is my data-source-config file in dataimport directory in conf folder of my core :
<dataConfig>
<dataSource name="mymongodb" type="MongoDataSource" database="mongodb://*.*.*.*/testdb" />
<document name="data">
<entity
name="products"
processor="MongoEntityProcessor"
query="{idStr,name,code,description,price,brand,size,color}"
collection="products"
datasource="mymongodb"
transformer="MongoMapperTransformer" >
<field column="idstr" name="idstr" mongoField="idStr"/>
<field column="name" name="name" mongoField="name"/>
<field column="code" name="code" mongoField="code"/>
<field column="description" name="description" mongoField="description"/>
<field column="price" name="price" mongoField="price"/>
<field column="brand" name="brand" mongoField="brand"/>
<field column="size" name="size" mongoField="size"/>
<field column="color" name="color" mongoField="color"/>
<entity
name="categories"
processor="MongoEntityProcessor"
query="{'idStr':'${categories.idstr}'}"
collection="categories"
datasource="mymongodb"
transformer="MongoMapperTransformer">
<field column="type" name="type" mongoField="type"/>
</entity>
</entity>
</document>
</dataConfig>
I am trying to join the field idStr of categories collection with the idStr of products collection(field name => idstr) and get the above fields ( name,description,... from products and type field from categories).
Any comments/solution on this exception would be really appreciated.Thanks!
Your SOLR field is declared as idstr but you are referencing it in the query attribute of dataConfig as idStr (camelcase difference).
I was able to resolve this ...
Following is the working configuration in the data-source-config file :
<entity
name="products"
query="select idStr,name,code,description,price,brand,size,color from products">
<field name="prodidStr" column="idStr" />
<field name="name" column="name" />
<field name="code" column="name" />
<field name="description" column="description" />
<field name="price" column="price" />
<field name="brand" column="brand" />
<field name="size" column="size" />
<field name="color" column="color" />
<entity
name="categories"
dataSource="mongod"
query="select idStr,ancestors from categories where idStr = '${products.idStr}'">
<field name="catidStr" column="idStr" />
<field name="ancestors" column="ancestors" />
</entity>
</entity>

not all fields are copied from mongodb to solr after integration using mongo-connector

I am able to successfully integrate between MONGODB & SOLR, using MONGO-CONNECTOR. And whenever, I update or add any thing, in the sample collection i have created, it copies only two or three fields in a documents, and rest of the fields data are not copied into solr. This is some thing I am not able to do it.
This is my collection and its document details. Name of collection: testdb
document inserted as follows:
db.testdb.insert( {
... _id: "101",
... name: "test",
... description: "descr",
... mydesc: "mydescr",
... nmdsc: "nmdsc1",
... coords: "coords1"
... })
And the data sync between solr and mongo logs says successful:
2014-01-17 19:35:38,462 - INFO - Finished 'http://<hostname>:<port>/solr/update/?co
mmit=true' (post) with body '<add><doc>' in 0.210 seconds.
But when I execute a query to see the document data it says only these fields data:
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"wt": "json"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "101",
"description": "descr",
"name": "test",
"_version_": 1457486601392226300
}
]
}
}
Clearly i can see that following fields & respective data are not copied into solr:
... mydesc: "mydescr",
... nmdsc: "nmdsc1",
... coords: "coords1"
Following is my schema.xml:
<schema name="narayana" version="1.5">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
<fieldType name="text_wslc" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0" />
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate" />
<fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0" />
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="text_wslc" indexed="true" stored="true" />
<field name="description" type="text_wslc" indexed="true" stored="true" />
<field name="mydesc" type="text_wslc" indexed="true" stored="true" />
<field name="nmdsc" type="text_wslc" indexed="true" stored="true" multiValued="true" />
<field name="coords" type="string" indexed="true" stored="true" multiValued="true" />
<field name="_version_" type="long" indexed="true" stored="true" />
<dynamicField name="*" type="string" indexed="true" stored="true" />
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>nmdsc</defaultSearchField>
<!-- we don't want too many results in this usecase -->
<solrQueryParser defaultOperator="AND" />
<copyField source="name" dest="nmdsc" />
<copyField source="description" dest="nmdsc" />
</schema>
You are defining your configuration to copy only two specific fields, if you want to index additional fields, you should be adding them to your configuration file and the same import cycle again:
<!-- we don't want too many results in this usecase -->
<solrQueryParser defaultOperator="AND" />
<copyField source="name" dest="nmdsc" />
<copyField source="description" dest="nmdsc" />