Create Lucene Indexes in Apache Geode Region - geode

I'm trying to create Lucene Indexes on Apache Geode Region.
I have all the Region definitions in cache.xml. This cache.xml is read by cache server and Regions are created.
If I define a Region something like below in cache.xml,
<region name="trackRegion" refid="PARTITION_PERSISTENT">
<lucene:index name="myIndex">
<lucene:field name="tenant" />
</lucene:index>
</region>
Region is created with Lucene Index, but it doesn't allow me to add other properties of Region like, indexing on primary key, Region compressor etc.
Geode says we should create the Lucene Index first then Region should be created. How should I define the Lucene Index for a Region like below.
<region name="trackRegion" refid="PARTITION_PERSISTENT">
<region-attributes>
<compressor>
<class-name>org.apache.geode.compression.SnappyCompressor</class-name>
</compressor>
</region-attributes>
<index name="trackRegionKeyIndex" from-clause="/trackRegion" expression="key" key-index="true"/>
</region>
Also, I tried creating the Region with Java annotations following this document, https://github.com/spring-projects/spring-data-gemfire/blob/main/src/main/asciidoc/reference/lucene.adoc#annotation-configuration-support.
Even with this I get The Lucene Index must be created before Region error.

Regarding the Spring configuration model for defining Lucene Indexes and using Apache Geode's Lucene support...
Since I am not familiar with how you setup and arranged your application configuration, then you can have a look at a few SDG Integration Tests to see if this possibly helps you identify your problem.
First, have a look at the LuceneOperationsIntegrationTests class in the SDG test suite. This test class shows how to configure a Spring application using JavaConfig; for example.
Next, have a look at the EnableLuceneIndexingConfigurationIntegrationTests in the SDG test suite. This test class shows how your Spring application would be configured using SDG Annotations; for example.
Keep in mind that 1) Lucene Indexes on Apache Geode Regions can only be created on PARTITION Regions, and 2) PARTITION Regions can only be created on the peer servers in your cluster. That is, Lucene Indexes cannot be applied to client Regions on a ClientCache application.
I suspect your application configuration is missing Spring's #DependsOn annotation, either on the template or on the Region contaning the Lucene Index. For example.

Related

"winning-configuration-property" algorithm : spring configuration properties run time determination based on application specific qualifiers

I need to implement a "winning-configuration-property" algorithm in my application.
For example:
for property: dashboard_material
i would create a file (I am planning to represent each property as a file. This is slightly negotiable)dashboard_material.yml , with the following value
(which format i believe presents a consolidated view of the variants of the property and is more suitable for impact analysis when someone changes values in a particular scenario) :
car_type=luxury&car_subtype=luxury_sedan&special_features=none : leather
car_type=luxury&car_subtype=luxury_sedan&special_features=limited_edition : premium_leather
car_type=economy : pseudo_leather
default : pseudo_leather
I need the closest match. A luxury car can be a sedan or a compact.
I am assuming these are "decorators" of a car in object oriented terms, but not finding any useful implementation for the above problem from sample decorator patterns.
For example an API:
GET /configurations/dashboard_material should return the following values based on input parameters:
car_type=economy : pseudo_leather
car_type=luxury & car_subtype=luxury_sedan : leather
car_type=luxury : pseudo_leather (from default. Error or null if value does not exist)
This looks very similiar to the "specifications" or "queryDSL" problem with GetAPIs - in terms of slicing and dicing based on criteria.
but basically i am looking for a run-time determination of a config value from a single microservice
(which is a spring config client. I use git2consul to push values from git into consul KV. The spring config client is tied into the consul KV. I am open to any equivalent or better alternatives).
I would ideally like to do all the processing as a post-processing after the configuration is read (from either a spring config server or consul KV),
so that no real processing happens after the query is recieved. The same post processing will also have to happen after every spring config client "refresh" interval based
on configuration property updates.
I have previously seen such an implementation (as an ops engineer) with netflix archaius implementation, but not again finding any suitable text on the archaius pages.
My trivial/naive solution would be to create multiple maps/trees to store the data and then consolidate them into a single map based on the API request effectively overriding some of the values from the lower priority maps.
I am looking for any open-source implementations or references and avoid having to create new code.

Hibernate Search 6.0.2.Final with opendistro

I have a Spring Boot 2.4.2 application integrated with Hibernate Search 6.0.2.Final
When using standard elasticsearch everything works fine (read/write) on persisting new entities. The index also gets created as expected myindex-000001 based on the default simple index strategy.
However, when I switched the backend to opendistro (latest) I only see a single index created by the name myindex-write (different that the expected myindex-000001).
The write operations work fine as expected (due to the suffix -write), however the read operations fail with the error:
root_cause": [
{
"type": "index_not_found_exception",
"reason": "no such index [myindex-read]",
"resource.type": "index_or_alias",
"resource.id": "myindex-read",
"index_uuid": "_na_",
"index": "myindex-read"
}
]
GET /_cat/aliases on opendistro show that there are no aliases for the index.
What is the best possible way to resolve this? no-alias strategy shown here? The downside of using no-alias is the lack of blue-green deployment like re-indexing. Is custom index strategy the best way to solve this?
All the above issues were caused because I had set hibernate.search.schema_management.strategy to none. The indexes need to be created manually, as mentioned in the docs here

REST API for processing data stored in hbase

I have a lot of records in hbase store (millions) like this
key = user_id:service_id:usage_timestamp value = some_int
That means an user used some service_id for some_int at usage_timestamp.
And now I wanted to provide some rest api for aggregating that data. For example "find sum of all values for requested user" or "find max of them" and so on. So I'm looking for the best practise. Simple java application doesn't met my performance expectations.
My current approach - aggregates data via apache spark application, looks good enough but there are some issues to use it with java rest api so far as spark doesn't support request-response model (also I have took a view into spark-job-server, seems raw and unstable)
Thanks,
Any ideas?
I would offer Hbase + Solr if you are using Cloudera (i.e Cloudera search)
Solrj api for aggregating data(instead of spark), to interact with rest services
Solr Solution (in cloudera its Cloudera search) :
Create a collection (similar to hbase table) in solr.
Indexing : Use NRT lily indexer or custom mapreduce solr document creator to load data as solr documents.
If you don't like NRT lily indexer you can use spark or mapreduce job with Solrj to do the indexing For ex: Spark Solr :
Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
Data Retrieval : Use Solrj to get the solr docs from your web service call.
In Solrj,
There is FieldStatInfo through which Sum,Max etc.... can be achieved
There are Facets and Facetpivots to group data
Pagination is supported for rest API calls
you can integrate solr results with Jersey or some other web service as we have already implemented this way.
/**This method returns the records for the specified rows from Solr Server which you can integrate with any rest api like jersey etc...
*/
public SolrDocumentList getData(int start, int pageSize, SolrQuery query) throws SolrServerException {
query.setStart(start); // start of your page
query.setRows(pageSize);// number of rows per page
LOG.info(ClientUtils.toQueryString(query, true));
final QueryResponse queryResponse = solrCore.query(query, METHOD.POST); // post is important if you are querying huge result set Note : Get will fail for huge results
final SolrDocumentList solrDocumentList = queryResponse.getResults();
if (isResultEmpty(solrDocumentList)) { // check if list is empty
LOG.info("hmm.. No records found for this query");
}
return solrDocumentList;
}
Also look at
my answer in "Create indexes in solr on top of HBase"
https://community.hortonworks.com/articles/7892/spark-dataframe-to-solr-cloud-runs-on-sandbox-232.html
Note : I think same can be achieved with elastic search as well. But out of my experience , Im confident with Solr + solrj
I see two possibilities:
Livy REST Server - new REST Server, created by Cloudera. You can submit Spark jobs in REST way. It is new and developed by Cloudera, one of the biggest Big Data / Spark company, so it's very possible that it will be developed in future, not abandoned
You can run Spark Thrift Server and connect just like to normal database via JDBC. Here you've got documentation. Workflow: read data, preprocess and then share by Spark Thrift Server
If you want to isolate third-party apps from Spark you can create simple application that will have user-friendly endpoint and will translate query received by endpoint to Livy-Spark jobs or SQL that will be used with Spark Thrift Server

Titan - How to Use 'Lucene' Search Backend

I am attempting to use the lucene search backend with Titan. I am setting the index.search.backend property to lucene as so.
TitanFactory.Builder config = TitanFactory.build();
config.set("storage.backend", "hbase");
config.set("storage.hostname", "node1");
config.set("storage.hbase.table", "titan");
config.set("index.search.backend", "lucene");
config.set("index.search.directory", "/tmp/foo");
TitanGraph graph = config.open();
GraphOfTheGodsFactory.load(graph);
graph.getVertices().forEach(v -> System.out.println(v.toString()));
Of course, this does not work because this setting is of the GLOBAL_OFFLINE variety. The logs make me aware of this. Titan ignores my 'lucene' setting and then attempts to use Elasticsearch as the search backend.
WARN com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration
- Local setting index.search.backend=lucene (Type: GLOBAL_OFFLINE)
is overridden by globally managed value (elasticsearch). Use
the ManagementSystem interface instead of the local configuration to control
this setting.
After some reading, I understand that I need to use the Management System to set the index.search.backend. I need some code that looks something like the following.
graph.getManagementSystem().set("index.search.backend", "lucene");
graph.getManagementSystem().set("index.search.directory", "/tmp/foo");
graph.getManagementSystem().commit();
I am confused on how to integrate this in my original example code above. Since this is a GLOBAL_OFFLINE setting, I cannot set this on an open graph. At the same time, I do not know how to get a graph unless I open one first. How do I set the search backend correctly?
There is no inmemory search backend. The supported search backends are Lucene, Solr, and Elasticsearch.
Lucene is a good option for a small scale, single machine search backend. You need to set 2 properties to do this, index.search.backend and index.search.directory:
index.search.backend=lucene
index.search.directory=/path/to/titansearchindexdir
As you've noted, the search backend is a GLOBAL_OFFLINE, so you should configure this before initially creating your graph. Since you've already created a titan table in your HBase, either disable and drop the titan table, or set your graph configuration to point at a new storage.hbase.table.

How DSpace process a query in jspui?

How any query is processed in DSpace and data is managed between front end and PostgreSQL
Like every other webapp running in a Servlet Container like Tomcat, the file WEB-INF/web.xml controls how a query is processed. In case of DSpace's JSPUI you'll find this file in [dspace-install]/webapps/jspui/WEB-INF/web.xml. The JSPUI defines several filters, listeners and servlets to process a request.
The filters are used to report that the JSPUI is running, that restricted areas can be seen by authenticated users or even by authenticated administrators only and to handle Content Negotiation.
The listeners ensure that DSpace has started correctly. During its start DSpace loads the configuration, opens database connections that it uses in a connection pool, let Spring do its IoC magic and so on.
For the beginning the most important part to see how a query is processed are the servlets and the servlet-mappings. A servlet-mapping defines which servlet is used to process a request with a specific request path: e.g. all requests to example.com/dspace-jspui/handle/* will be processed by org.dspace.app.webui.servlet.HandleServlet, all requests to example.com/dspace-jspui/submit will be processed by org.dspace.app.webui.servlet.SubmissionController.
The servlets uses their Java code ;-) and the DSpace Java API to process the request. You'll find most of it in the dspace-api module (see [dspace-source]/dspace-api/src/main/java/...) and some smaller part in dspace-services module ([dspace-source]dspace-services/src/main/java/...). Within the DSpace Java API their are two important classes if you're interested in the communication with the database:
One is org.dspace.core.Context. The context contains information whether and which user is logged in, an initialized and connected database connection (if all went well) and a cache. The methods Context.abort(), Context.commit() and Context.complete() are used to manage the database transaction. That is the reason, why almost all methods manipulating the database requests a Context as method parameter: it controls the database connection and the database transaction.
The other one is org.dspace.storage.rdbms.DatabaseManager. The DatabaseManager is used to handle database queries, updates, deletes and so on. All DSpaceObjects contains an object TableRow which contains the information of the object stored in the database. Inside the DSpaceObject classes (e.g. org.dspace.content.Item, org.dspace.content.Collection, ...) the TableRow may be manipulated and the changes stored back to the database by using DatabaseManager.update(Context, DSpaceObject). The DatabaseManager provides several methods to send SQL queries to the database, to update, delete, insert or even create data in the database. Just take a look to its API or look for "SELECT" it the DSpace source to get an example.
In JSPUI it is important to use Context.commit() if you want to commit the database state. If a request is processed and Context.commit() was not called, then the transaction will be aborted and the changes gets lost. If you call Context.complete() the transaction will be committed, the database connection will be freed and the context is marked as been finished. After you called Context.complete() the context cannot be used for a database connection any more.
DSpace is quite a huge project and their could be written a lot more about its ORM, the initialization of the database and so on. But this should already help you to start developing for DSpace. I would recommend you to read the part "Architecture" in the DSpace manual: https://wiki.duraspace.org/display/DSDOC5x/Architecture
If you have more specific questions you are always invited to ask them here on stackoverflow or on our mailing lists (http://sourceforge.net/p/dspace/mailman/) dspace-tech (for any question about DSpace) and dspace-devel (for question regarding the development of DSpace).
It depends on the version of DSpace you are running, along with your configuration.
In DSpace 4.0 or above, by default, the DSpace JSPUI uses Apache Solr for all searching and browsing. DSpace performs all indexing and querying of Solr via its Discovery module. The Discovery (Solr) based searche/indexing classes are available under the "org.dspace.discovery" package.
In earlier versions of DSpace (3.x or below), by default, the DSpace JSPUI uses Apache Lucene directly. In these older versions, DSpace called Lucene directly for all indexing and searching. The Lucene based search/indexing classes are available under the "org.dspace.search" package.
In both situations, queries are passed directly to either Solr or Lucene (again depending on the version of DSpace). The results are parsed and displayed within the DSpace UI.