Question here about caching data from calls to an external ReST API.
There is currently a ReST service set up to generate and retrieve some specific types of reports that the UI must consume. However, this service is not meant for high volume usage, or to be exposed to the public and these reports are fairly static. Possibly only changing every 10-20 minutes. The web application resides on a separate server.
What I would like to do is, using memcached or Redis, is when a request for data comes in from the UI to the web back-end, make a call from the web application back-end to the report server to get the specified report, transform the data to the appropriate format for the UI to consume, cache it with a timestamp, and return it to the UI so subsequent requests will be available in memory on the web applications back-end without having to re-request from the report server. I would also need to check this timestamp and make a new request if the cached report has been held for longer than the specified time. The data that will be cached is fairly minuscule just some smallish JSON objects with only a handful of values holding the information the UI needs and there is NOT a ton of these objects, I would not be surprised if they could all be easily stored in memory at once so the time stamping is the only invalidation that should be necessary.
I have almost 0 experience here when it comes to caching / memcached / Redis. Is there advantages to one or the other? Is something like this possible? How would I go about implementing this? Are there other options?
Appreciate the help!
Server-caching these kinds of RESTful query responses is very possible and quite common.
With any server based caching, you should also think hard about whether you really need it, as it does add complexity. It can certainly make a huge improvement, but since your usage volume is low, it might actually be overkill. You may also be able to use HTTP caching protocols to avoid the need for caching on the server. If the data doesn't change very often and you use eTags or modified dates correctly, along with an intermediary proxy like AWS CloudFront, users will rarely experience that delay.
Also, if you are finding your database to be a bottleneck, you might be able to get away with just configuring it to cache more aggressively.
Assuming you do want to cache in memory ...
For server-side caching, the normal approach is to cache results for some time period or manually clear them from the cache. But a more modern and better approach imo is to use Russian-doll caching, where you key items according to the time their inputs changed. Then you never need to worry about manually clearing them, you just make sure timestamps are correct and synchronised.
Memcached versus Redis versus something else? For this usage, Memcached is probably best as it's extremely simple and you don't have to worry about persistence, which is a big advantage of Redis over Memcached. Redis is well-engineered and would work fine too, but I don't see the benefit to use something that's considerably more feature-rich and complex if you don't need it and there's a good alternative. That said, the one big advantage of Redis is it now has excellent built-in clustering support, so it's easy to scale and stay online. But that would be overkill for your use case.
Something else? There are plenty of other in-memory databases, but I think Memcached and Redis are probably best if you want to avoid the problems of relying on cutting-edge frameworks without too much support. However, there is something else: boring old files. If you're generating reports, you might want to consider just generating them as temporary files. If your OS is doing its job, the files will end up being cached anyway.
Related
Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modelled database which the developer has to work with for the lifetime of the application unless there is a big data migration project. CQRS and ES also has other benefits like scaling eventstore, audit log etc. that are already all over the internet.
But what are the disadvantages ?
Here are some disadvantages that I can think of after researching and writing small demo apps
Complex: Some people say ES is complex. But I'd say having a complex application is better than a complex database model on which you can only run very restricted queries using a query language (multiple joins, indexes etc). I mean some programming languages like Scala have very rich collection library that is very flexible to produce some seriously complex aggregations and also there is Apache Spark which makes it easy query distributed collections. But databases will always be restricted to it's query language capabilities and distributing databases are harder then distributed application code (Just deploy another instance on another machine!).
High disk space usage: Event store might end up using a lot of disk space to store events. But we can schedule a clean up every few weeks and creating snapshot and may be we can store historical events locally on an external HD just incase we need old events in the future ?
High memory usage: State of every domain object is stored in memory which might increase RAM usage and we all how expensive RAM is. BIG PROBLEM!! because I'm poor! any solution to this ? May be use Sqlite instead of storing state in memory ? Am I making things more complex by introducing multiple Sqlite instances in my application ?
Longer bootup time: On failure or software upgrade bootup is slow depending on the number of events. But we can use snapshots to solve this ?
Eventual consistency: Problem for some applications. Imagine if Facebook used Event sourcing with CQRS for storing posts and considering how busy facebook's system is and if I posted a post I would see my fb post the next day :)
Serialized events in Event store: Event stores store events as serialized objects which means we can't query the content of events in the event store which is discouraged anyway. And we won't be able to add another attribute to the event in the future. Solution would be to store events as JSON objects instead of Serialized events ? But is that a good idea ? Or add more Events to support the change to orignal event object ?
Can someone please comment on the disadvantages I brought up here and correct me if I am wrong and suggest any other I may have missed out ?
Here is my take on this.
CQRS + ES can make things a lot simpler in complex software systems by having rich domain objects, simple data models, history tracking, more visibility into concurrency problems, scalability and much more. It does require a different way thinking about the systems so it could be difficult to find qualified developers. But CQRS makes it simpler to separate responsibilities across developers. For example, a junior developer can work purely with the read side without having to touch business logic.
Copies of data will require more disk space for sure. But storage is relatively cheap these days. It may require the IT support team to do more backups and planning how to restore the system in a case in things go wrong. However, server virtualization these days makes it a more streamlined workflow. Also, it is much easier to create redundancy in the system without a monolithic database.
I do not consider higher memory usage a problem. Business object hydration should be done on demand. Objects should not keep references to events that have already been persisted. And event hydration should happen only when persisting data. On the read side you do not have Entity -> DTO -> ViewModel conversions that usually happened in tiered systems, and you would not have any kind of object change tracking that full featured ORMs usually do. Most systems perform significantly more reads than writes.
Longer boot up time can be a slight problem if you are using multiple heterogeneous databases due to initialization of various data contexts. However, if you are using something simple like ADO .NET to interact with the event store and a micro-ORM for the read side, the system will "cold start" faster than any full featured ORM. The important thing here is not to over-complicate how you access the data. That is actually a problem CQRS is supposed to solve. And as I said before, the read side should be modeled for the views and not have any overhead of re-mapping data.
Two-phase commit can work well for systems that do not need to scale for thousands of users in my experience. You would need to choose databases that would work well with the distributed transaction coordinator. PostgreSQL can work well for read and write separate models, for example. If the system needs to scale for a high number of concurrent users, it would have to be designed with eventual consistency in mind. There are cases where you would have aggregate roots or context boundaries that do not use CQRS to avoid eventual consistency. It makes sense for non-collaborative parts of the domain.
You can query events in serialized a format like JSON or XML, if you choose the right database for the event store. And that should be only done for purposes of analytics. Nothing inside the system should query event store by anything other than the aggregate root id and the event type. That data would be indexed and live outside the serialized event.
Just to comment on point 5. I've been told that Facebook does use ES with Eventual Consistency, which is why you can sometimes see a post disappear and reappear after you've posted it.
Usually the read-model your browser is accessing is located 'close' to you, but after you make a post the SPA switches over to a read-model that is close to your write-model. The close proximity between the write-model (events) and the read-model mean you get to see your own post.
However, 15 minutes later your SPA switches back to the first, closer, read-model. If the event containing your post hasn't yet propagated to that read-model you'll see your own post disappear only to reappear sometime later.
I know it's been almost 3 years since this question was asked, but still this article may be useful for someone. Key points are
Scaling with snapshots
Visibility of data
Schema changing
Dealing with complex domains
Need to explain it to most new team members
Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modeled database which the developer has to work with for the lifetime of the application unless there is a big data migration project.
This is a big misconception. The relational databases were invented exactly for the evolution of the model (thanks to simple two-dimensional tables as opposed to pre-defined hierarchical structures). With views and procedures ensuring the encapsulation of data access, the logical and physical model can evolve independently. This is also why SQL defines DDL and DML in the same language. Some RDBMS also allow all those evolutions to be versioned and deployed online (continuous delivery) as Oracle Edition Based Redefinition.
Big data structures are predefined and can be read only with the code developed for this structure. Ok when consumed immediately but you will have hard time to read it 10 years later without the exact version, and language compiler or interpreter.
I hope to not be late to try to give an answer. In these months I've done a lot of research on that argument with the goal of implementing a production-grade solution for some parts of my architecture where ES can make sense
Complex: Actually, it should not be complex, its mission is to be deadly simple. How? pushing all the complexity from business logic code to infrastructure code. The data access should be done by frameworks that are not enough mature yet. Still, there is no clear winner in the ES/CQRS race, maybe because is still a niche/hipster approach (?) So some team is rolling its own solution or adopting some ready-made technology such as Axon
High disk space usage: I would say more, I would say * potentially infinite* Disk Usage. But if you go towards ES, you also have a very good reason to tolerate this apparent drawback. Let's give some of them:
Audit Logs : The datastore is an event log, we already know it. Financial apps or every mission/safety critical could need a centralized audit log that enables to state Who made What in Which moment. ES provides this capability of the box...you can also decorate your event entries with some business meaningful metadata (eg. a transaction Id correlated with some API consumer identity, A severity level of the operation...)
High Concurrency: there are systems where logical resource states are mutated by many clients in a concurrent way. These are games, IoT platforms, and so on. Logging events instead of change a state representation could be a smart way to provide a total order of events. The other way is to delegate to DB the synchronization stuff. But this is not what you want if you're into ES
Analytics Let's say you have a lot of data with a lot of business value, but you still don't know which. For years we extracted knowledge from applications information by translating data organization with different information models (OLAP cubes). The event store provides something similar out of the box again. Event logs is the rawest form of representation of information And you can have many ways to process them, in batch or reacting to events stored
High memory usage: I think it should be the same once you have built your projection
Longer bootup time: If the read side caches its projections and "remembers" the last update event, it should not re-apply the entire event sequence. Snapshots are mitigation but if you do a lot of snapshots maybe you made a bad choice with ES. I think that this problem is minor in microservices ecosystems, where the boot time can be masked without service interruption. In fact, you get the most out of ES/CQRS when you apply it so microservices
Eventual consistency: Blame CAP theorem for this, not ES. Many non ES/CQRS have to deal with this, but there are a lot of scenarios where it is not a real problem. These are the scenarios where ES fits well. And you can mix ES and non ES services into the same platform
Serialized events in Event store: if it's important to have a non-serialized event representation, you could use a document-oriented DB, but if you do this to make queries over events payload, you are missing the point of ES/CQRS. ES means to move all data manipulation from the DB side to the application tier, where every piece changes fastly, and all are stateless. This enhances scalability and fault tolerance and provides means to shape the organization of your team, doing things like let the frontend guy/girl write his/her BFF in javascript easily.
I hope to put into practices this principles with good results and draw on the benefits of this exciting approach
For a single-page app to be RESTful it needs access to API endpoints from which to retrieve its content. This mean's that you'll have, say, a API/user_information endpoint, and a API/article/ endpoint, and an API/comments endpoint and so on. For large applications these ajax calls will slow down rendering, especially over slow (mobile) connections. Ironically, this is not as much of a problem in the old-world of doing everything server-side, since the entire html is delivered in bulk upon the first request. Thus, sigle-page apps, which are supposed to be faster and more responsive than old-school server-rendered apps, may end up being considerably slower (admittedly, only if they require a large enough number of API calls).
The options, as far as I can see, seem to be:
1) Forget single-page apps, go back to doing everything server-side.
2) Consolidate all the API endpoints into fewer (possibly one) endpoint(s) so all content is retrieved with minimum latency. But now you're not longer RESTful.
These are both unsatisfying and sad. Is there a better solution?
PS: caching content in local storage seems to cause more problems than it solves because now you have to worry about invalidating client-side caches every time you push out fresh content.
I'm working now in a similar architecture as you commented, and I agree with you in your thoughts. What we are doing is to try to find a compromise between the both options you described, even if sometimes we are not always 100% RESTful.
I think it not always necessary to be 100% RESTful if your needs require it. Using the DTO pattern you can define the DTOs in a way that optimizes the performance of your application minimizing the number of calls, offering still reusability.
I'm implementing a web scraper that needs to scrape and store about 15GB+ of HTML files a day. The amount of daily data will likely grow as well.
I intend on storing the scraped data as long as possible, but would also like to store the full HTML file for at least a month for every page.
My first implementation wrote the HTML files directly to disk, but that quickly ran into inode limit problems.
The next thing I tried was using Couchbase 2.0 as a key/value store, but the Couchbase server would start to return Temp_OOM errors after 5-8 hours of web scraping writes. Restarting the Couchbase server is the only route for recovery.
Would MongoDB be a good solution? This article makes me worry, but it does sound like their requirements are beyond what I need.
I've also looked a bit into Cassandra and HDFS, but I'm not sure if those solutions are overkill for my problem.
As for querying the data, as long as I can get the specific page data for a url and a date, it will be good. The data too is mostly write once, read once, and then store for possible reads in the future.
Any advice pertaining to storing such a large amount of HTML files would be helpful.
Assuming 50kB per HTML page, 15GB daily gives us 300.000+ pages per day. About 10 million monthly.
MongoDB will definitely work well with this data volume. Concerning its limitations, all depends on how do you plan to read and analyze the data. You may take advantage of map/reduce features given that amount of data.
However if your problem size may further scale, you may want to consider other options. It might be worth noting that Google search engine uses BigTable as a storage for HTML data. In that sense, using Cassandra in your use case can be a good fit. Cassandra offers excellen, persistent write/read performance and scales horizontally much beyond your data volume.
I'm not sure what deployment scenario you did when you used Cassandra to give you those errors .. may be more investigation is required to know what is causing the problem. You need to trace back the errors to know their source, because, as per requirements described above, Cassandra should work fine, and shouldn't stop after 5 hours (unless you have a storage problem).
I suggest that you give MongoDB a try, it is very powerful and optimized to what you need, and shouldn't complain of the requirement you mentioned above.
You can use HDFS, but you don't really need it while MongoDB (or even Cassandra) can do it.
I want to develop one multimedia system, the system need to save millions videos and images, so I want to select a distributed storage subsystem. who can give me some suggestion ? thanks!
I guess that best option for the 'millions videos and images' is content distribution/delivery network (CDN):
CDN is a server setup which allows for
faster, more efficient delivery of
your media files. It does this by
maintaining copies of your media at
different points of presence (POPs)
along a global network to ensure quick
client access and the fastest delivery
possible
If you will use CDN you no need care about many problems(distribution, fast access). Integration with CDN also should be very simple.
#yi_H
You can configure your writes to be first replicated to multiple nodes before it return to the client. Now whether or not that is needed is of course unto the use case. And definitely involves a performance hit. So if you are implementing a write heavy analytical database, it will have a significant impact on write throughput.
All other points you make about the question in terms of lack of requirements etc, I second that.
Having replicated file system with metadata in a nosql database is a very common way of doing things. #why did you consider this kinda approach?
Have you taken a look at Mongodb gridfs? I have never used it, but it is something I would take a look at to see if it gives you any ideas.
Yo gave us (near) zero information about what your requirements are. Eg:
Do you want atomic transactions?
Is the system read or write heavy?
Do you need fast queries or want to batch-process the data set?
How big are the videos?
Do you want to distribute data locally (on a LAN) or spanning multiple data centers / continents?
How are we supposed to pick the right tool if we don't know what it needs to support?
Without any knowledge of the system I would advise using some kind of FS replication for the videos and images and then storing the metadata associated with the items either in MongoDB, MySQL Master-Master or MySQL Cluster.
Distributed related to what?
If you are talking of replication to distribute:
MongoDb only restricted to Master-Slave replication, so only one node is able to read/write which leaves you with a single point of failure for a really distributed system.
CouchDB is able to peer-to-peer replicate.
Find a very good comparison here and here also compared with hbase.
With CouchDB you also have to be aware that you are going to talk http to the database and have build in webservices.
Regards,
Chris
An alternative is to use MongoDB's GridFS, serving as a (very easily manageable) redundant and distributed filesystem.
Some will say that it's slow on reads, (and it is, mostly because of the nature of its design) but that doesn't have to mean it's a dealbreaker for your system in whole, because if you need performance later on, you could always put Varnish or Squid in front of the filesystem tier.
For all I know, Squid also supports on-disk cache for all the less-hot files.
Sources:
http://www.mongodb.org/display/DOCS/GridFS
http://www.squid-cache.org/Doc/config/cache_dir/
I would like to get feedback from all you seasoned developers as to what methodology would be the more "correct" or "efficient" way of implementing my solution.
I have a 4.5 MB flat file that is some 16,000 rows with 13 columns. I know I can import this into SQLite and create my data model but would it be more iPhone efficient to use this file locally on the iPhone or have the application read the data from a web service?
Thanks.
If you are not going to update the data (or only update it when you are updating the app) the local sqlitedb is going to simpler and more responsive. You would probably be even better off importing the data into CoreData, that way you won't need to directly manipulate sqlite or deal with things like synchronous read APIs.
If you want to be able to have the app download updated data the choice because a lot more difficult, depending on the quantity of data, the frequency of updates, how large the changes tend to be, etc.
a local database should always be more efficient in terms of user experience than a web service
I'd use both.
A remote source allowing for a dynamic datastore, and a local datastore with local cacheing seems like a pretty safe bet.
As for the web service. Unless there is any server-side only business logic, maybe give a cloud solution a try. Something like Amazon's SimpleDB comes to mind.
It of course really depends on how static your data is. As everyone has mentioned already if you don't need many updates the most effective solution is a sole local datastore.
Cheers
I guess it depends a bit on how much of the data you need at any one time. If your users need to download a lot of data just to use your application, that would make your app potentially very slow and also unusable without a network connection.
How often do you need to update the data? Frequent updates would favour a web service solution. Otherwise you'd need to update your app and resubmit every time a bit of your data changes.
Another thing to think about: how much do you pay for web traffic for your website? It could become quite expensive if a lot of users constantly need to download data. Unless you use some kind of subscription you only get money once, when you sell the app.
Personally, I'd probably lean towards putting the data on the phone and not using a web service.