How are searches implemented in a Flat File CMS - content-management-system

Flat file CMS's don't use databases. So how are searches implemented? Is searching more or less computationally expensive with this type of setup compared to a database powered search?

The problem with a static site and search together is that one is by definition static, while the other is highly dynamic. So out of the box there is no simple way to make the two live happily together.
Flat file CMS arn't static websites. While parsing files is more costly than parsing databases (usually?), a search functionality can easily be provided by the underlying CMS. Look for plugins that can provide what you want.
However, there is some non trivial solutions that can achieve what you want, depending on your infrastructure and your volumetry and if you site can achieve server side computations or not (grav can, gatsby and hugo can't).
The simplest way to do it is to create an index of all your content in a special file, then load that and do the search client side. You can even use already made package to speedup dev time on this option. (for example: https://www.npmjs.com/package/react-fuzzy-search )
The pro is that it's quite trivial to do. the cons are that the index will get quite big with large side and all the search is done client side (so, maybe a long waiting time for the user if the index is large enough). This solution will also NOT scale well.
Another way to do it is to use a search service (as a SAAS or on your own premises) to externalize the search functionality. Basically this service run a your server, will have a way to index your content (via an API) and search ie (via an API). Just make sure the search API is public and you can query it in realtime from client side.
This solution scales really well because these sort of services are made from the ground up to scale ! However the setup costs are really high, and not worth it if you don't plan to scale to millions of pages.

Related

Outsourcing web content versus maintaining local content

I am developing a full web application...
I am considering using prismic.io to outsource some web content which I will query through graphQL. But I would store personal information about users in a local instance of mongoDB.
Whats the long term benefit? If I can just store all of the content myself through an instance of mongodb which holds it all for me.
This is mostly my opinion, if you're a developer working alone or just with other developers, and are only looking for a place to store data, then you're probably better off not using a CMS. One of a CMSs main purposes is to extend the ability to significantly modify an application to non-technical individuals. For example, building a website for a local restaurant, and wanting to allow them to change their menu, without you having to build out a UI to enable it. With a CMS they'd be able to easily change the text and other content on their platform, whereas interacting with a mongo backend might be a bit less straightforward for them. For a more industrial example, say you have a marketing team, who need to run A/B tests to determine the optimal content for a site, they can perform their tests, and have their changes reflect into a template you set up, without them (and you if you set it up cleverly) having to write any extra code. There are more advantages and disadvantages to using a CMS, but I think accessibility is the main reason reason to consider one, especially long-term.

Creating a database of many products

I am creating an inventory app currently for iPhone using Parse for companies to keep track of all of their tools, supplies, inventory. Now I'd like to allow for the user/company when adding a new item to their database for them to have the option to search from a pre-made database of items such as for a construction company when adding a simple Dewalt Drill Battery to their inventory would search the pre-made database for "Dewalt #DC9096 18V XRP 2.4A Battery" or an office would search for pencils by brand/serial number/name. I am looking for a simple way to make a database or even a table containing multiple brands products including their prices, product specifications, website for ordering more, company website, warranty phone number, etc... I have considered parsing all of the retail websites for information but don't know the legalities behind it and if the websites change then I'd need to update code. If there is ANY (easier/better) way to do this then assistance or direction would be great!
Thanks always
I would not go down the route of trying to parse websites, that will be a huge pain in the neck and impossible to maintain unless you have extensive resources (and as you mention it probably violates most site's terms of service anyway). Your best bet would be to hook into existing product databases via an API, such as Google's Search API for shopping, or maybe Amazon's API. Here's where you can start if you wanted to use Google:
https://developers.google.com/shopping-search/
Hopefully that gets you going in the right direction.
Edit: Here's a list of a lot more shopping APIs that could be good options:
http://www.programmableweb.com/apis/directory/1?apicat=Shopping
If you did find yourself needing to parse many different vendor websites (we'd call this "screen scraping") and you have the legal right to do so, you should use a tool like SelectorGadget to get your XPaths, it's much faster, easier and less error-prone than doing it by hand.
If you're doing more than a couple websites, though, you'll probably find that you'll have to update the scraping rules pretty often, it definitely won't be a set-and-forget operation.

PDF Storage System with REST API

I have hundreds of thousands of PDFs that are presently stored in the filesystem. I have a custom application that, as an afterthought to its actual purpose, provides access to these PDFs. I would like to take the "storage & retrieval" part out of the custom application and use an OpenSource document storage backend.
Access to the PDF Store should be via a REST API, so that users would not need a custom client for basic document browsing and viewing. Programs that store PDFs should also be able to work via the REST API. They would provide the actual binary or ASCII data plus structured meta data, which could later be used in retrieval.
A typical query for retrieval would be "give me all documents that were created between days X and Y with document types A or B".
My research, whether such a storage backend exists, has come up empty. Do any of you know a system that provides these features? OpenSource preferred, reasonably priced systems considered.
I am not looking for advice on how to "roll my own" using available technologies. Rather, I'm trying to find out whether that can be avoided. Many thanks in advance.
What you describe sounds like a document management or asset management system of which there are many; and many work with PDF files. I have some fleeting experience with commercial offerings such as Xinet (http://www.northplains.com/xinet - now acquired apparently) or Elvis (http://www.elvisdam.com). Both might fit your requirements but they're probably too big and likely too expensive.
Have you looked at Alfresco? This is an open source alternative I came into contact with years ago while being on the board of a selection committee. As far as I remember it definitely goes in the direction of what you are looking for and it is open source so might fit that angle as well: http://www.alfresco.com.

What are the challenges in embedding text search (Lucene/Solr/Hibernate Search) in applications that are hosted at client sites

We have a enterprise java web-app that our customers (external) deploy on their intranets. I am exploring different full text search options: Lucene/Solr/Hibernate Search and one common concern is deployment/administration/tuning overhead for this.
This is particularly challenging in our case, since we do not host these applications. From what I have seen, most uses of these technologies have been in hosted applications. Our customers typically deploy our application in a clustered environment and do not have any experience with Lucene/Solr.
Does anyone have any experience with this? What challenges have you encountered with this approach? How did you overcome them? At this point I am trying to determine if this is feasible.
Thank you
It is very feasible to deploy applications onto clients sites that use Lucene (or Solr).
Some things to keep in mind:
Administration
you need a way to version your index,
so if/when you change the document
structure in the index, it can be
upgraded.
you therefore need a good
way to force a re-index of all
existing data. Probably also a
good idea to provide an Admin
option to allow an Admin to
trigger a re-index as well.
you
could also provide an Admin option to
allow optimize() be called on your
index, or have this scheduled.
Best to test the actual impact
this will have first, since it may
not be needed depending on the shape
of your index
Deployement
If you are deploying into a clustered environment, the simplest (and fastest in terms of dev speed and runtime speed) solution could be to create the index on each node.
Tuning
* Do you have a reasonable approximation of the dataset you will be indexing? You will need to ensure you understand how your index scales (in both speed and size), since what you consider a reasonable dataset size, may not be the same as your clients... Therefore, you at least need to be able to let clients know what factors will lead to overly large index size, and possibly slower performance.
There are two advantages to embedding lucene in your app over sending the queries to a separate Solr cluster, performance and ease of deployment/installation. Embedding lucene means to run lucene in the same JVM which means no additional server round trips. Commits should be batched in a separate thread. Embedding lucene also means including some more JAR files in your class path so no separate install for Solr.
If your app is cluster aware, then the embedded lucene option becomes highly problematic. An update to one node in the cluster needs to be searchable from any node in the cluster. Synchronizing the lucene index on all nodes yields no better performance than using Solr. With Solr 4, you may find the administration to be less of a barrier to entry for your customers. Check out the literature of the grossly misnamed Solr Cloud.

What's the benefit of Connectedness?

What is the benefit of Connectedness as defined by Resource Oriented Architecture (ROA)? The way I understand it, the crux of Connectedness is the ability to crawl the entire application state using only the root URIs.
But how useful is that really?
For example, imagine that HTTP GET http://example.com/users/joe returns a link to http://examples.com/uses/joe/bookmarks.
Unless you're writing a dumb web crawler (and even then I wonder), you still need to teach the client what each link means at compile-time. That is, the client needs to know that the "bookmarks URI" returns a URI to Bookmark resources, and then pass control over to special Bookmark-handling algorithms. You can't just pass links blindly to some general client method. Since you need this logic anyway:
What's the difference between the client figuring out the URI at runtime versus providing it at compile-time (making http://example.com/users/bookmarks a root URI)?
Why is linking using http://example.com/users/joe/bookmarks/2 preferred to id="2"?
The only benefit I can think of is the ability to change the path of non-root URIs over time, but this breaks cached links so it's not really desirable anyway. What am I missing?
You are right that changing Uris is not desirable but it does happen and using complete Uris instead of constructing them makes change easier to deal with.
One other benefit is that your client application can easily retrieve resources from multiple hosts. If you allowed your client to build the URI's the client would need to know on which host certain resources reside. This is not a big deal when all of the resources live on a single host but it becomes more tricky when you are aggregating data from multiple hosts.
My last thought is that maybe you are oversimplifying the notion of connectedness by looking at it as a static network of links. Sure the client needs to know about the possible existence of certain links within a resource but it does not necessarily need to know exactly what are the consequences of following that link.
Let me try an give an example: A user is placing an order for some items and they are ready to submit their cart. The submit link may actually go to two different places depending on whether the order will be delivered locally or internationally. Maybe orders over a certain value need to go through an extra step. The client just knows that it has to follow the submit link, but it does not have compiled in knowledge of where to go next. Sure you could build a common "next step" type of resource so the client could have this knowledge explicitly but by having the server deliver the link dynamically you introduce a lot less client-server coupling.
I think of the links in resources as placeholders for what the user could choose to do. Who will do the work and how it will be done is determined by what uri the server attaches to that link.
Its easier to extend, and you could write small apps and scripts to work along with the core application fairly easily.
Added: Well the whole point starts with the idea that you don't specify at compile-time how to convert URIs to uids in a hardcoded fashion, instead you might use dictionaries or parsing to do that, giving you a much more flexible system.
Then later on say someone else decides to change the URI syntax, that person could write a small script that translates URIs without ever touching your core Application. Another benefit is if your URIs are logical other users, even within a corporate scenario, can easily write Mash-ups to make use of your system, without touching your original App or even recompiling it.
Of course the counter side to the whole argument is that it'll take you longer to implement a URI based system over a simple UID system. But if your App will be used by others in a regular fashion, that initial time investment will greatly payback (it could be said to have a good extensibility based ROI).
Added: And another point which is a mater of tastes to some degree is the URI itself will be a better Name, because it conveys a logical and defined meaning
I'll add my own answer:
It is far easier to follow server-provided URIs than construct them yourself. This is especially true as resource relationships become too complex to be expressed in simple rules. It's easier to code the logic once in the server than re-implement it in numerous clients.
The relationship between resources may change even if individual resource URIs remain unchanged. For example, imagine Google Maps indexes their map tiles from 0 to 100, counting from the top-left to the bottom-right of the screen. If Google Maps were to change the scale of their tiles, clients that calculate relative tile indexes would break.
Custom IDs identify a resource. URIs go a step further by identifying how to retrieve the resource representation. This simplifies the logic of read-only clients such as web-crawlers or clients that download opaque resources such as video or audio files.