Smart URL Synchronization/Download Engine available open source? - perl

I am wondering if anybody is aware of an open-source URL Synchronization/Download engine or if I will end up writing my own one.
Expected functionality:
- Provide a plain list of HTTP Urls to be synchronized on a local disk store
- Engine takes care of synchronizing URL content as effectively (HEAD request to check for changes) and efficiently (all download performed using Gzip compression)
- Preferably it would be smart enough to optimize its beahvior by recognizing URLs that change frequently
Is someone aware of an alike-implementation of such engine? Preferably in Perl?

Have you looked at the lwp-mirror program that comes with LWP? That might be a good place to start. Alternatively, wget has more features.

Related

How to integrate localization (i18n) so that it scales with a React application?

I am currently looking at various i18n npm packages and most seem to insist that the translations are stored in a flat file, e.g. .json formatted file. My questions is whether this has a performance overhead that would be greater then storing the languages in a database, e.g. MongoDB.
For example, if I have 10,000 translations (we will assume that in this particular application only one language file will be needed at a time, i.e. most will be using the application in English and some users may want to set the application to use a different language.) then this will equate to approximately 200kb of data to download before the application can even start being used.
In a React application, a suggested design pattern is to load data using container components, that then pass data to 'dumb' child components. So, would it not make sense to also load translations in the same manner, i.e. group the translations into usage, or by component, so that the data is sent down the wire only when needed, say, from a call to MongoDB?
I would integrate it in your API. That means you can create e.g. a REST or GraphQL API, which handles this for you. In i18n, it is often reasonable to store the data in a hierarchy. This means you can split your translations in different categories (like pages) and simply request those translations, which you really need.
I really like the way of doing it in the react-starter-kit. In this example, you find how they handle it with a GraphQL API and only request those translations, which are really required for rendering the page. Hope this helps.
Important files of the i18n implementation of the react-starter-kit:
GraphQL Query: https://github.com/kriasoft/react-starter-kit/blob/feature/react-intl/src/data/queries/intl.js
Example component implementation: https://github.com/kriasoft/react-starter-kit/blob/feature/react-intl/src/components/Header/Header.js
Of course if you have this amount of translations, I would use a database for a better system usage (in the react starter kit, they use simple file storage which is not really usable with so many translations). A mongodb would be there my first choice, but maybe this is only my own preference of flexibility and own knowledge.
Obviously, you don't want each and every language to be loaded on the client. My understanding of the pattern you described is to use a container component to load the relevant language for the whole app on startup.
When a user switches language, your container will load the relevant language file from the server.
This should work just fine for a small/medium app but has a drawback : you'll need another request to the server after the JS code has loaded to load the i18n data.
Another way to solve this is to use code splitting (and possibly server side rendering) techniques which could allow this workflow :
Server builds a small bundle containing a portion of the i18n data
Client loads the rest of your app code and associated i18n data on demand, as the user navigates through your app
If not yet done having a look at https://react.i18next.com/ might be a good advice. It is based on i18next: learn once - translate everywhere.
Your code will look something like:
<div>{t('simpleContent')}</div>
<Trans i18nKey="userMessagesUnread" count={count}>
Hello <strong title={t('nameTitle')}>{{name}}</strong>, you have {{count}} unread message. <Link to="/msgs">Go to messages</Link>.
</Trans>
Comes with samples for:
- webpack
- cra
- expo.js
- next.js
- storybook integration
- razzle
- dat
- ...
https://github.com/i18next/react-i18next/tree/master/example
Beside that you should also consider workflow during development and later for your translators -> https://www.youtube.com/watch?v=9NOzJhgmyQE

How to refuse wget?

I am uploading images to a public directory and I would like to prevent users from downloading the whole lot using wget. Is there a way to do this?
As far as I can see, there must be. I have found a number of sites where, as a public browser, I can download a single image, but as soon as I run wget against them I get a 403 (Forbidden). I have tried using the no-robot argument, but I'm still not able to download them. (I won't name the sites here, for security reasons).
You can restrict access using user-agent string, see apache 2.4 mod_authz_core for example.
Wget also respects robots.txt directives by default. This should repent any casual user.
However, a careful look into wget manual will let to bypass these restrictions. Wget also lets to add random delays between requests, so even advanced techniques based on access pattern analysis may be bypassed.
So the proper way is to mess with wget link/reference recognition engine. Namely, the content you want to keep unmirrored should be loaded dynamically using javascript and the urls must be encoded in a way that would require js code to decode. This would protect your content, but would require to manually provide unobfuscated version for web bots you want to index your site, such as google bot (and no, it is not the only one one should care about). Also, some people do not run js scripts by default (esoteric browsers, low-end machines, mobile devices may demand such policy).

uploading images to php app on GCE and storing them onto GCS

I have a php app running on several instances of Google Compute Engine (GCE). The app allows users to upload images of various sizes, resizes the images and then stores the resized images (and their thumbnails) in the storage disk and their meta data in the database.
What I've been trying to find is a method for storing the images onto Google Cloud Storage (GCS) through the php app running on GCE instances. A similar question was asked here but no clear answer was given there. Any hints or guidance on the best way for achieving this is highly appreciated.
You have several options, all with pros and cons.
Your first decision is how users upload data to your service. You might choose to have customers upload their initial data to Google Cloud Storage, where your app would then fetch it and transform it, or you could choose to have them upload it directly to your service. Let's assume you choose the second option, and you want users to stream data directly to your service.
Your service then transforms the data into a different size. Great. You now have a new file. If this was video, you might care about streaming the data to Google Cloud Storage as you encode it, but for images, let's assume you want to process the whole thing locally and then store it in GCS afterwards.
Now we have to get a file into GCS. It's a PHP app, and so as you have identified, your main three options are:
Invoke the GCS JSON API through the Google API PHP client.
Invoke either the GCS XML or JSON API via custom code.
Use gsutil.
Using gsutil will be the easiest solution here. On GCE, it automatically picks up appropriate credentials for your service account, and it's got several useful performance optimizations and tuning that a raw use of the API might not do without extra work (for example, multithreaded uploads). Plus it's already installed on your GCE instances.
The upside of the PHP API is that it's in-process and offers more fine-grained, programmatic control. As your logic gets more complicated, you may eventually prefer this approach. Getting it to perform as well as gsutil may take some extra work, though.
This choice is comparable to copying files via SCP with the "scp" command line application or by using the libssh2 library.
tl;dr; Using gsutil is a good idea unless you have a need to handle interactions with GCS more directly.

Why RTMP streaming protocal's url path different from each other?

Recently I'm doing some work on RTMP streaming, that is using Flowplayer to integrate with Edgecast Streaming service and CloudFront Streaming service.
The basic concept is easy to follow, but the format of different providers really waste me a lot of time to figure out.
For example, in order to make edgecast happy, according to the documentation, you need to specify filename in the format of mp4:filename.mp4, flv:filename (without .flv extension) and mp3:filename (without .mp3 extension).
But for CloudFront, it's a different story that mp4:filename.mp4, filename (no flv:prefix, and no .flv extension) and mp3:filename (without .mp3 extension).
This format makes people even more frustrating when today I try to use Edgecast's loadToEdge function, the format the accept is filename.mp4 (without mp4: prefix), filename.flv (without flv: prefix) and mp3:filename.mp3.
As you can see, basically there is no logic there and you have to guess and try all different combinations to make it finally working.
I just would like to know if anyone has idea about why different providers implement their streaming in all customized way? Or is it Adobe's fault doesn't have a unified form or it's just up to service providers to use whatever they like.
Thank you!
It's all about implementation. The URL format, including extensions, has nothing to do with
As an analogy, your question is like asking "Why do some websites have different URLs than others?" Example of two different yet viable ways of serving up an image:
http://server.com/question/87/why/65.png
http://server.com/image/question?number=87&image=65.
It's all about how the coders at EdgeCast, Amazon, et al wanted to implement their CDN. I'm sure there was some logic to it, well thought out or not. And probably some need to deal with legacy systems, clients and URLs.
It has nothing to do with FMS itself. Just like the above analogy's URLs have nothing to do with the web server they are served from.

How do I sync an offline web app (HTML+JS+CSS) with my server?

Do I need to implement my own sync methods in order to make an offline web app (html+css+js) stay up to date with changes made on the server (and viceversa)? I'm using MySQL on the server side.
I read Two-way sync between iPhone application and web application with some pointers but I think they're talking about native applications when they mention CFUUIDCreate and I wander if this is possible for the Web.
Does someone have some code to share or maybe can point me in the right direction?
Thank you!
P.S.: I hope my english is not that rusty ;)
To store static contents on the client-side, as Jethro Larson said, the Application Cache Manifest is the way to go to cache the static contents of your website (HTML, CSS, JS and images).
To handle dynamically generated contents offline, you can use javascript templates. There are several solutions for this.
To sync the two databases, there is a project called persistence.js (persistencejs.org) which is a javascript library which offers a unique API to work with WebSQL databases, Local Storage, etc. They have a plugin for this library called persistence.sync (persistencejs.org/plugin/sync) which syncs the remote database with the server's one. It consists of POST and GET requests to a specific url that you can configure (for example yourapp.dev/sync). They have an example back-end written in node.js and here is one for Rails. It's simple to understand and persistence.sync is well documented.
Look at the offline cache:
http://www.webreference.com/authoring/languages/html/HTML5-Application-Caching/
http://www.google.com/search?q=offline+cache+html5
http://www.slideshare.net/search/slideshow?q=offline+cache