Crawler4j with mongoDB - mongodb

I was researching on crawler4j. I found that it uses BerkeleyDB as the database. I am developing a Grails app using mongoDB and was wondering how flexible will crawler4j be to work within my application. I basically want to store the crawled information in the mongodb database. Is it possible to configure crawler4j in such a way that it used mongoDB as the default datastore rather than BerkeleyDB? Any suggestions would be helpful. Thanks

There is not configurable dao layer, but you can manipulate it.
There are 3 dao classes. Counters class saves total 'Scheduled' and 'Processed' pages counts (this is just for statistics). DocIDServer class holds url-id pairs for resolving new urls. Frontier class holds queue for pages to crawl. Just keep method logics and trancastion blocks.

Related

Issue in MongoDB document search

I am new to MongoDB. And I have the following issue on currently developing web application.
We have an application where we use mongoDB to store data.
And we have an API where we search for the document via text search.
As an example: if the user type “New York” then the request should send the all the available data in the collection to the keyword “New York". (Here we call the API for each letter typed.) We have nearly 200000 data in the DB. Once the user searches for a document then it returns nearly 4000 data for some keywords. We tried with limiting the data to 5 – so it returns the top 5 data, and not the other available data. And we tried without limiting data now it returns hundreds and thousands of data as I mentioned. And it causes the request to slow down.
At Frontend we Bind search results to a dropdown. (NextJs)
My question:
Is there an optimizing way to search a document?
Are there any suggestions of a suitable way that I can implement this requirement using mongoDB and net5.0?
Or any other Implementation methods regarding this requirement?
Following code segment shows the query to retrieve the data to the incomming keyword.
var hotels = await _hotelsCollection
.Find(Builders<HotelDocument>.Filter.Text(keyword))
.Project<HotelDocument>(hotelFields)
.ToListAsync();
var terminals = await _terminalsCollection
.Find(Builders<TerminalDocument>.Filter.Text(keyword))
.Project<TerminalDocument>(terminalFeilds)
.ToListAsync();
var destinations = await _destinationsCollection
.Find(Builders<DestinationDocument>.Filter.Text(keyword))
.Project<DestinationDocument>(destinationFields)
.ToListAsync();
So this is a classic "autocomplete" feature, there are some known best practices you should follow:
On the client side you should use a debounce feature, this is a most. there is no reason to execute a request for each letter. This is most critical for an autocomplete feature.
On the backend things can get a bit more complicated, naturally you want to be using a db that is suited for this task, specifically MongoDB have a service called Atlas search that is a lucene based text search engine.
This will get you autocomplete support out of the box, however if you don't want to make big changes to your infra here are some suggestions:
Make sure the field your searching on is indexed.
I see your executing 3 separate requests, consider using something like Task.WhenAll to execute all of them at once instead of 1 by 1, I am not sure how the client side is built but if all 3 entities are shown in the same list then ideally you merge the labels into 1 collection so you could paginate the search properly.
As mentioned in #2 you must add server side pagination, no search engine can exist without one. I can't give specifics on how you should implement it as you have 3 separate entities and this could potentially make pagination implementation harder, i'd consider wether or not you need all 3 of these in the same API route.

Meteor - using snychronised non-persistent / in-memory MongoDB on the server

in a Meteor app, having real-time reactive updates between all connected clients is achieved with writing in collections, publishing and subscribing the right data. In normal case this means also database writes.
But what if I would like to sync particular data which does not need to be persistent and I would like to save the overhead of writing in the database ? Is it possible to use mini-mongo or other in-memory caching on the server by still preserving DDP synchronisation to all clients ?
Example
In my app I have a multiple collapsed threads and I want to show, which users currently expanded particular thread
Viewed by: Mike, Johny, Steven ...
I can store the information in the threads collection or make make a separate viewers collection and publish the information to the clients. But there is actually no meaning in making this information persistent an having the overhead of database writes.
I am confused by the collections documentation. which states:
OPTIONS
connection Object
The server connection that will manage this collection. Uses the default connection if not specified. Pass the return value of calling DDP.connect to specify a different server. Pass null to specify no connection.
and
... when you pass a name, here’s what happens:
...
On the client (and on the server if you specify a connection), a Minimongo instance is created.
But If I create a new collection and pass the option object with conneciton: null
// Creates a new Mongo collections and exports it
export const Presentations = new Mongo.Collection('presentations', {connection: null});
/**
* Publications
*/
if (Meteor.isServer) {
// This code only runs on the server
Meteor.publish(PRESENTATION_BY_MAP_ID, (mapId) => {
check(mapId, nonEmptyString);
return Presentations.find({ matchingMapId: mapId });
});
}
no data is being published to the clients.
TLDR: it's not possible.
There is no magic in Meteor that allow data being synced between clients while the data doesn't transit by the MongoDB database. The whole sync process through publications and subscriptions is triggered by MongoDB writes. Hence, if you don't write to database, you cannot sync data between clients (using the native pub/sub system available in Meteor).
After countless hours of trying everything possible I found a way to what I wanted:
export const Presentations = new Mongo.Collection('presentations', Meteor.isServer ? {connection: null} : {});
I checked the MongoDb and no presentations collection is being created. Also, n every server-restart the collection is empty. There is a small downside on the client, even the collectionHanlde.ready() is truthy the findOne() first returns undefined and is being synced afterwards.
I don't know if this is the right/preferable way, but it was the only one working for me so far. I tried to leave {connection: null} in the client code, but wasn't able to achieve any sync even though I implemented the added/changed/removed methods.
Sadly, I wasn't able to get any further help even in the meteor forum here and here

Strapi: Initialize / populate database

When I deploy Strapi to a new server, I want to create and populate the database tables (PostgreSQL), particularly categories. How do I access production config, and create tables and category entries?
A hint on how-to approach this, would be much appreciated!
I know this is an old question, but i recently came upon the same issue.
Basically you should create the collections first, which result in the creation of models. Of course you also could create the models manually.
In the recent documentation you find a section about a bootstrap function.
docs bootstrap
The function is called at the start of the server.
The docs list the following use cases:
Here are some use cases:
Create an admin user if there isn't one.
Fill the database with some necessary data.
Load some environment variables.
The bootstrap function can be synchronous or asynchronous.
A great example can be found in the Plugin strapi-plugin-users-permissions
You can implement a new service or overwrite a function of an existing plugin.
the function initialize is implemented here async initialize
and called in the bootstrap function here
await ...initialize()
The initialize function is used to populate the database with the two roles
Authenticated and Public.
Hope that helps whoever stumbles upon this question.

Storing custom temporary data in Sitecore xDB

I am using Sitecore 8.1 with xDB enabled (MongoDB). I would like to store the user-roles of the visiting users in the xDB, so I can aggregate on these data in my reports. These roles can change over time, so one user could have one set of roles at some point in time and another set of roles at a later time.
I could go and store these user-roles as custom facets on the Contact entity, but as they may change for a user from visit to visit, I will loose historical data if I update the data in the facet every time the user log in (fx. I will not be able to tell which roles a given user had, at some given visit).
Instead, I could create a custom IElement for my facet data, and store the roles along with a timestamp saying when the given roles were registered, but this model may be hard to handle during the reporting phase, where I would need to connect the interaction data with the role-data based on timestamps every time I generate a report.
Is it possible to store these custom data in the xDB in something else than the Contact collection? Can I store custom data in the Interactions collection? There is a property called Tracker.Current.Session.Interaction.CustomValues which sounds like what I need, but if I store data here, will I be able to perform proper aggregation/reporting on the data? Any other approaches I haven't thought about?
CustomValues
Yes, the CustomValues dictionary is what I would use in your case. This dictionary will get serialized to MongoDB as a nested document of every interaction (unless the dictionary is empty).
Also note that, since CustomValues is a member of the base class Sitecore.Analytics.Model.Entity, this dictionary is available in many other data classes of xDB. For example, you can store custom values in PageData and PageEventData objects.
Since CustomValues takes an object of any class, your custom data class needs some extra things for it to be successfully saved to and subsequently loaded from MongoDB:
It has to be marked as [Serializable].
It needs to be registered in the MongoDB driver like this:
using Sitecore.Analytics.Data.DataAccess.MongoDb;
// [...]
MongoDbObjectMapper.Instance.RegisterModelExtension<YourCustomClassName>();
This needs to be done only once per application lifetime - for example, in an initialize pipeline processor.
Your own storage
Of course, you don't have to use Sitecore's API to store your custom data. So the alternative would be to manually save data to a custom MongoDB collection or an SQL table. You can then read that data in your aggregation processor, finding it by the ID of currently processed interaction.
The benefit of this approach is that you can decide where and how your data is stored. The downside is extra work of implementing and maintaining this data storage.

Building REST web services with nested collections + collections inside entries

I would like to build a REST webservice that would provide :
nested collections,
collections inside entries.
Nesting collections would be used to refine a concept from general to particular, for example :
/vehicles/road_vehicles/cars/AB-123-CD
The idea is to limit the number of concepts appearing at the root of the webservice.
Collections inside entries would be used to access parts of the entries, for example :
/cars/AB-123-CD/engine/spark_plugs/1
could be a good URI for the first spark plug of the car whose id is "AB-123-CD". A nested collection makes sense whenever the deletion of the "container" means the deletion of all its parts.
DELETE /cars/AB-123-CD
would obviously delete :
/cars/AB-123-CD/engine/spark_plugs/1
and all other parts of the car (think of the car as being sent to scrap by the DELETE).
Question : while this kind of "clean URIs" are quite a common need, are there any software to simplify the building of such a webservice?
It seems that the Atom Publication Protocol (AtomPub) could have been a good candidate, since their vision of webservices if very close to what I want, but it doesn't seem to support nested collections.
The article The future of API design: The orchestration layer indicates Query-based APIs are putting the power in the hands of the requesting developer, although that power is limited. Using AtomPub along with query parameters is available to you in AtomPub build tools. I found some examples searching for "APIs Query-based" but could not find an actual definition.
https://developers.google.com/google-apps/contacts/v3/#retrieving_contacts_using_query_parameters
http://resources.arcgis.com/en/help/arcgis-rest-api/index.html#/Query_Related_Records_Map_Service_Dynamic_Layer/02r3000000nt000000/
Author goes on to say that Experienced-based APIs have device-specific wrappers but they are designed, implemented and owned by the device teams. Maybe the way to approach your issue is to create to different AtomPub end-points which present different views of the information?