Best way to generate static version of "dynamic" web site - perl

I have a website that is dynamic in the sense that a lot of data is generated from a database, but the contents of the database changes rarely (about 1-3 times a week). These changes are manual and controlled.
Instead of having the overhead of a dynamic website, I prefer to use a static pages. I'm debating what is the best solution:
curl/wget/spider
This question mentions it. The disadvantages I see might be:
manual clean up needed (links, missing images, etc.)
cannot mix static and dynamic pages
proxy
I could use a proxy to cache the static pages for a certain number of days. Disadvantages:
hard to manage the cache of each page
need to clear the cache after each manual change?
Use program to generate static pages
My current choice: I use perl programs to generate static pages from dynamic content. This doesn't scale very well as I have to hard code a lot of HTML, especially the page structure
Any other ways to do it? What would you/do you prefer?

Memcache base full-page cache with long expire time. Tag extension could allow you to invalidate only selected range of pages.

Any particular reason you want to do it this way instead of just setting up a database caching solution to stop the queries from actually having to hit the database?
Whether it's possible or not depends on the amount of dynamic data that's on your site, and the amount of memory available in your server, but it wouldn't have any of the problems you're worried about.

I would do it the same way you're doing it right now, using a script to generate static pages. You can use a templating system to avoid having to write new HTML every time.

You have not mentioned how important it is to show the changed data as soon as possible to your user.
We have used proxy cache successfully for our website to handle dynamic pages which gets lots of hits. Depending upon how soon we want the updated data to be seen by customer we kept different cache age for each categories.

Related

Outsourcing web content versus maintaining local content

I am developing a full web application...
I am considering using prismic.io to outsource some web content which I will query through graphQL. But I would store personal information about users in a local instance of mongoDB.
Whats the long term benefit? If I can just store all of the content myself through an instance of mongodb which holds it all for me.
This is mostly my opinion, if you're a developer working alone or just with other developers, and are only looking for a place to store data, then you're probably better off not using a CMS. One of a CMSs main purposes is to extend the ability to significantly modify an application to non-technical individuals. For example, building a website for a local restaurant, and wanting to allow them to change their menu, without you having to build out a UI to enable it. With a CMS they'd be able to easily change the text and other content on their platform, whereas interacting with a mongo backend might be a bit less straightforward for them. For a more industrial example, say you have a marketing team, who need to run A/B tests to determine the optimal content for a site, they can perform their tests, and have their changes reflect into a template you set up, without them (and you if you set it up cleverly) having to write any extra code. There are more advantages and disadvantages to using a CMS, but I think accessibility is the main reason reason to consider one, especially long-term.

How are searches implemented in a Flat File CMS

Flat file CMS's don't use databases. So how are searches implemented? Is searching more or less computationally expensive with this type of setup compared to a database powered search?
The problem with a static site and search together is that one is by definition static, while the other is highly dynamic. So out of the box there is no simple way to make the two live happily together.
Flat file CMS arn't static websites. While parsing files is more costly than parsing databases (usually?), a search functionality can easily be provided by the underlying CMS. Look for plugins that can provide what you want.
However, there is some non trivial solutions that can achieve what you want, depending on your infrastructure and your volumetry and if you site can achieve server side computations or not (grav can, gatsby and hugo can't).
The simplest way to do it is to create an index of all your content in a special file, then load that and do the search client side. You can even use already made package to speedup dev time on this option. (for example: https://www.npmjs.com/package/react-fuzzy-search )
The pro is that it's quite trivial to do. the cons are that the index will get quite big with large side and all the search is done client side (so, maybe a long waiting time for the user if the index is large enough). This solution will also NOT scale well.
Another way to do it is to use a search service (as a SAAS or on your own premises) to externalize the search functionality. Basically this service run a your server, will have a way to index your content (via an API) and search ie (via an API). Just make sure the search API is public and you can query it in realtime from client side.
This solution scales really well because these sort of services are made from the ground up to scale ! However the setup costs are really high, and not worth it if you don't plan to scale to millions of pages.

What is the maximum versions a page can have in AEM?

Is there a limit to the no. of versions a content item can have in AEM? I want to retain all the versions of my page. As in, unlimited.
Want to know if AEM has a limit internally after which it automatically removes older versions?
Appreciate any thoughts on this.
Although this is not recommended but you can disable the version manager by configuring the versionmanager.purgingEnabled to false. You will need to configure this as described in the document below:
https://docs.adobe.com/docs/en/aem/6-3/deploy/configuring/version-purging.html#Version Manager
Retaining lots of versions will gradually slow down your instance and result in poor authoring performance as the storage (Tar or Mongo) will grow large with stale data.
It is normally recommended to retain versions by a fixed number of days or fixed number of version counts.
For performance reasons, it is better to backup your AEM instance for older archived versions and rely on a restore function to access those versions.
I was asking this question once to Adobe DayCare and received the similar response like in i.net post - it is possible to disable purging the versioning of the page however it comes with the risk of authoring performance issues - pages can start loading very slowly.
The solutions that were suggested (depending on the requirements):
backing up an instance, which is not the best one if you need to be able to retrieve or compare old content anytime, recover if needed; the disadvantage is that all copy of instance needs to be stored and it needs to be repeated from time to time (when you notice performance issues)
designing and implementing a custom solution with an additional instance that would be responsible for storing these versions - I have no much details on that solution however as I understood, it would require deep analysis how it can be done
if the access to previous content is needed only for historical reasons (no need to retrieve it and publish once again) then taking use of the page to PDF extraction mechanism and storing history in DAM or another place; you can then also consider saving to PDF screenshot all page with design (not content only), presenting different browser breakpoints, annotations, etc. depending on requirements

Redis / Memcached ReST caching for an external service

Question here about caching data from calls to an external ReST API.
There is currently a ReST service set up to generate and retrieve some specific types of reports that the UI must consume. However, this service is not meant for high volume usage, or to be exposed to the public and these reports are fairly static. Possibly only changing every 10-20 minutes. The web application resides on a separate server.
What I would like to do is, using memcached or Redis, is when a request for data comes in from the UI to the web back-end, make a call from the web application back-end to the report server to get the specified report, transform the data to the appropriate format for the UI to consume, cache it with a timestamp, and return it to the UI so subsequent requests will be available in memory on the web applications back-end without having to re-request from the report server. I would also need to check this timestamp and make a new request if the cached report has been held for longer than the specified time. The data that will be cached is fairly minuscule just some smallish JSON objects with only a handful of values holding the information the UI needs and there is NOT a ton of these objects, I would not be surprised if they could all be easily stored in memory at once so the time stamping is the only invalidation that should be necessary.
I have almost 0 experience here when it comes to caching / memcached / Redis. Is there advantages to one or the other? Is something like this possible? How would I go about implementing this? Are there other options?
Appreciate the help!
Server-caching these kinds of RESTful query responses is very possible and quite common.
With any server based caching, you should also think hard about whether you really need it, as it does add complexity. It can certainly make a huge improvement, but since your usage volume is low, it might actually be overkill. You may also be able to use HTTP caching protocols to avoid the need for caching on the server. If the data doesn't change very often and you use eTags or modified dates correctly, along with an intermediary proxy like AWS CloudFront, users will rarely experience that delay.
Also, if you are finding your database to be a bottleneck, you might be able to get away with just configuring it to cache more aggressively.
Assuming you do want to cache in memory ...
For server-side caching, the normal approach is to cache results for some time period or manually clear them from the cache. But a more modern and better approach imo is to use Russian-doll caching, where you key items according to the time their inputs changed. Then you never need to worry about manually clearing them, you just make sure timestamps are correct and synchronised.
Memcached versus Redis versus something else? For this usage, Memcached is probably best as it's extremely simple and you don't have to worry about persistence, which is a big advantage of Redis over Memcached. Redis is well-engineered and would work fine too, but I don't see the benefit to use something that's considerably more feature-rich and complex if you don't need it and there's a good alternative. That said, the one big advantage of Redis is it now has excellent built-in clustering support, so it's easy to scale and stay online. But that would be overkill for your use case.
Something else? There are plenty of other in-memory databases, but I think Memcached and Redis are probably best if you want to avoid the problems of relying on cutting-edge frameworks without too much support. However, there is something else: boring old files. If you're generating reports, you might want to consider just generating them as temporary files. If your OS is doing its job, the files will end up being cached anyway.

How do you CM an application with managed content

We have a web application which contains a bunch of content that the system operator can change (e.g. news and events). Occasionally we publish new versions of the software. The software is being tagged and stored in subversion. However, I'm a bit torn on how to best version control the content that may be changed independently. What are some mechanisms that people use to make sure that content is stored and versioned in a way that the site can be recreated or at the very least version controlled?
When you identify two set of files which have their own life cycle (software files on one side, "news and events" on the other, you know that:
you can not versionned them together at the same time
you should not put the same label
You need to save the "news and event" files separatly (either in the VCS or in a DB like Ian Jacobs suggests, or in a CMS - Content Management system), and find a way to link the tow together (an id, a timestamp, a meta-label, ...)
Do not forget you are not only talking about two different set of files in term of life cycle, but also about different set of files in term of their very natures:
Consider the terminology introduced in this SO question "Is asset management a superset of source control" by S.Lott
software files: Infrastructure information, that is "representing the processing of the enterprise information asset". Your code is part of that asset and is managed by a VCS (Version Control System), as part of the Configuration management discipline.
"news and events": Enterprise Information, that is data (not processing); this is often split between Content Managers and Relational Databases.
So not everything should end up in Subversion.
Keep everything in the DB, and give every transaction to the DB a timestamp. that way you can keep standard DB backups and load the site content at whatever date you want if the worst happens.
I suppose part of the answer depends on what CMS you're using, and how your web app is designed, but in general, I'd regard data such as news items or events as "content". In other words, it's not part of your application - it's the data which your application processes.
Of course, there will be versioning issues between your CMS code and your application code. You could manage this by defining the interface between the two. Personally, I'd publish the data to the web app as XML, which gives you the possibility of using XML schema to define exactly what the CMS is required to produce, and what the web app should expect to process.
This ought to mean that most changes in the web app can be made without a corresponding alteration in the rendering of the data. When functionality changes require this, you can create a new version of the schema and continue to make progress. In this scenario, I'd check the schema in with the web app code, but YMMV.
It isn't easy, and it gets more complicated again if you need additional data fields in your CMS. Expect to plan for a fairly complex release process (also depending on how complex your Dev-Test-Acceptance-Production scenario is.)
If you aren't using a CMS, then you should consider it. (Of course, if the operation is very small, it may still fall into the category where doing it by hand is acceptable.) Simply putting raw data into a versioning system doesn't solve the problem - you need to be able to control the format in which your data is published to the web app. Almost certainly this format should be something intended for consumption by software, and therefore not usually suitable for hand-editing by the kind of people who write news items or events.