I am exploring Apache OpenNLP product in my project and one of the requirement is to persist the trained model in DB - Mongo DB / couchbase in my case.
Right now primarily I am looking to store document categorizer model output to DB so that I do not have to rerun unless its modified
I see that the library classes are not serializable e.g. DocumentCategorizerME and I am getting json deserilization exception if I try to retrieve the persisted records so want to know if someone is already doing that.
In general what would be the approach to persist even if I want to use any other open source NLP products.
One of the approach that can be followed is using DoccatModel.serialize to serialize and store the model to Mongo DB - GridFs
Couchbase DB has hard limit of 20 MB size for binary data to be stored.
Related
I need to keep data in ElasticSearch in sync with the data I have and maintain in MongoDB.
Currently I have a batch job that finds all the changed data and updates it in elastic search using Spring-Batch and Spring-Data-ElasticSearch.
This works, but I'm looking for a solution where every change is directly mirrored in ElasticSearch.
give this a go
mongo connector
have a read through this 5 ways to sync data
I am experimenting a lot these days, and one of the things I wanted to do is combine two popular NoSQL databases, namely Neo4j and MongoDB. Simply because I feel they complement eachother perfectly. The first class citizens in Neo4j, the relations, are imo exactly what's missing in MongoDB, whereas MongoDb allows me to not put large amounts of data in my node properties.
So I am trying to combine the two in a Java application, using the Neo4j Java REST binding, and the MongoDB Java driver. All my domain entities have a unique identifier which I store in both databases. The other data is stored in MongoDB and the relations between entities are stored in Neo4J. For instance, both databases contain a userid, MongoDB contains the profile information, and Neo4J contains friendship relations. With the custom data access layer I have written, this works exactly like I want it to. And it's fast.
BUT... When I want to create a user, I need to create both a node in Neo4j and a document in MongoDB. Not necessarily a problem, except that Neo4j is transactional and MongoDB is not. If both were transactional, I would just roll back both transactions when one of them fails. But since MongoDB isn't transactional, I cannot do this.
How do I ensure that whenever I create a user, either both a Node and Document are created, or none of both. I don't want to end up with a bunch of documents that have no matching node.
On top of that, not only do I want my combined database interaction to be ACID compliant, I also want it to be threadsafe. Both the GraphDatabaseService and the MongoClient / DB are provided from singletons.
I found something about creating "Transaction Documents" in MongoDB, but I realy don't like that approach. I would like something nice and clean like the neo4j beginTx, tx.success, tx.failure, tx.finish setup. Ideally, something I can implement in the same try/catch/finally block.
Should I perhaps make a switch to CouchDB, which does appear to be transactional?
Edit : After some more research, sparked by a comment, I came to realize that CouchDB is also not suitable for my specific needs. To clarify, the Neo4j part is set in stone. The Document Store database is not as long as it has a Java Library.
Pieter-Jan,
if you are able to use Neo4j 2.0 you can implement a Schema-Index-Provider (which is really easy) that creates your documents transactionally in MongoDB.
As Neo4j makes its index providers transactional (since the beginning), we did that with Lucene and there is one for Redis too (needs to be updated). But it is much easier with Neo4j 2.0, if you want to you can check out my implementation for MapDB. (https://github.com/jexp/neo4j-mapdb-index)
Although I'm a huge fan of both technologies, I think a better option for you could be OrientDB. It's a graph (as Neo4) and document (as MongoDB) database in one and supports ACID transactions. Sounds like a perfect match for your needs.
As posted here https://stackoverflow.com/questions/23465663/what-is-the-best-practice-to-combine-neo4j-and-mongodb?lq=1, you might have a look on Structr.
Its backend can be regarded as a Document database around Neo4j. It's fully transactional and open-source.
I have some data in MongoDB GridFS. I am using the Spring Data GridFsOperations class to do my GridFS read/writes.
I have a requirement to replace the content of an existing GridFS file i.e. the _id and filename should stay the same, but the file content should be updated.
Spring Data [GridFsOperations] (API) primarily allows find, which returns a Mongo GridFSDBFile, and store. GridFSDBFile (API) does not allow updating content. The store method could in theory be used if the file was deleted first, and then stored with the same _id as the previous file. However store does not allow specifying the _id field.
The only solution I have found so far is to use the Mongo API directly to delete the existing file, and store a new one with the same _id. Answers to this effect are not useful: the question is specific to Spring Data MongoDB.
The reason there's no API exposed yet is that there's no support for that on the MongoDB GridFS. You essentially work around this issue implementing a pattern like described here. But as this boils down to a non-atomic operation we decided to not expose it as operation in the first place.
In case you thing there's a reliable implementation of this pattern (plus the appropriate handling of error cases) feel free to open ticket in our JIRA to discuss options.
I've started a new job where they are using mongodb in a java environment.
They have implemented a pattern using DTOs and factories with the morphia driver, this may be due to a migration onto mongodb from a key value store previously. The client is a JSON client.
It seems to me that the jackson-mongo-mapper would be a better approach because it's just mapping pojos from json to BSON and back, seems like it could do away with all DTO factory facade?
Anyone know any pros and cons with these different approaches?
Spring Data for Mongodb is very nice since you can use even another data store or mix them and repository interface is very helpful.
Kundera is an option through JPA2
http://agilemobiledeveloper.wordpress.com/2013/08/22/working-with-mongodb-using-kundera/
There's a lot of java to mongodb options.
http://www.agilemobiledeveloper.com/2013/01/31/hibernate-ogm-mongodb-vs-kundera-vs-jongo-vs-mongodb-api-vs-morphia-vs-spring-data-mongo-mongodb-drivers-for-java/
Adding your own data layer and making sure you use DI and test it fully is very helpful.
NOSQLUnit is awesome -> https://github.com/lordofthejars/nosql-unit
DTOs are good for keeping a separation between implementation and design, so when they need or want to switch from mongo to some other NoSQL or SQL database it can be done cleanly.
I am working on a project where we have millions of entries stored in MongoDB database and, i want to index all this data using SOLR.
After extensive Searching i came to know there are no proper "Data Import Handlers" for mongoDB database.
Can anyone tell me what are the proper approaches for indexing data in MongoDB using SOLR ?
I want to use all the features of SOLR and want it to be scalable in real-time. I saw one or two approaches from different posts but not sure how they will work real time..
Many Thanks
10Gen introduce Mongodb Connector. You can integrate Mongodb with Solr using this tool.
Blog post : Introducing Mongo Connector
Github page : mongo-connector
I have created a plugin to allow you to load data from MongoDb using the Solr data import handler.
Check it out at:
https://github.com/james75/SolrMongoImporter
I wrote a response to a similar question, except it was how to import data from MySQL into SOLR. The example code is in PHP, but should give you a general idea. All you would need to do is set up an iterator to step through your MongoDB assets, extract the data to SOLR datatypes, and then save it to your SOLR index.
If you want it to be real-time, you could add some custom code to the save mechanism (assuming this can be done with MongoDB), and save directly to the SOLR index, then run a commit script to commit data every 15 minutes (via cron).