Why does wget add html exentions to every file? - wget

I'm using following command to download all files from a server
wget -R "index.*" -m -np -e robots=off http://robotics.ethz.ch/~asl-datasets/ijrr_euroc_mav_dataset/
All files are recognized correctly, but wget adds .html to all files. For example: ijrr_euroc_mav_dataset/calibration_datasets/cam_april/cam_april.bag becomes ijrr_euroc_mav_dataset/calibration_datasets/cam_april/cam_april.bag.html
Why is that?
Also, wget creates the folder ~asl-datasets which I didn't ask for. I just wanted to download all files below ijrr_euroc_mav_dataset.

This is two separate questions, but is easy to answer. (I already solved this in the comments, but answering since that was apparently a spot-on observation).
The first is, why is Wget adding a .html suffix to your files. The reason for that is most likely that you have adjust-extensions in your ~/.wgetrc file. This option is disabled by default for obvious reasons but is useful in many cases. Try modifying the ~/.wgetrc file or use --no-config (or --config=/dev/null if using a >5 year old version of Wget).
The second question is is why is Wget creating a directory. Well, the answer to that is simple. You asked to mirror a website which has that directory. You can use the --cut-dirs option to fine tune which directories you want Wget to create on disk. (In your cases, I think --cut-dirs=2 --no-host-directories might be appropriate since you don't care about preserving the directory structure. However remember that this means files in different directories with the same name will likely be overwritten

Related

How can I move PERLBREW_ROOT to another directory?

I use perlbrew to manage my Perl environment.
When I installed perlbrew the first time as per the documentation, it installed everything to ~/perl5/perlbrew, which I now find undesirable.
The documentation states:
The directory ~/perl5/perlbrew will contain all install perl executables, libraries, documentations, lib, site_libs. In the documentation, that directory is referred as "perlbrew root". If you need to set it to somewhere else because, say, your HOME has limited quota, you can do that by setting PERLBREW_ROOT environment variable before running the installer:
export PERLBREW_ROOT=/opt/perl5/perlbrew
curl -kL http://install.perlbrew.pl | bash
Question: How can I move PERLBREW_ROOT directory to be /opt/perl5/perlbrew instead of ~/perl5/perlbrew?
Unfortunately, you cannot simply move an installed Perl. For starters, the paths added to #INC are hardcoded. I present you four solutions, of which I recommend the third.
But first, I recommend using /opt/perlbrew instead of /opt/perl5/perlbrew since there's no need for the extra level. The code snippets below assume you followed this recommendation.
Start from scratch, reinstalling any build of perl you had.
Con: For each build, you'll also have to reinstall any modules that build had installed. This means you'll need to retest all your applications. This is time consuming, and not without risk.
Move the perlbrew directory, but attempt to fix the installations.
Move the installation as follows:
mv ~/perl5/perlbrew /opt/
# Adjust file ownership and permissions as desired.
Then, edit the paths in each of the files printed by the following:
for q in /opt/perlbrew/perls/* ; do
"$q/bin/perl" -le'
use Config;
require "Config_heavy.pl";
print $INC{"Config.pm"};
print $INC{"Config_heavy.pl"};
'
done
You'll also need to edit the shebang (#!) line of many scripts.
Con: Lots of work (though not nearly as much as the first option), fragile, and not guaranteed to work.
Create future builds in /opt/perlbrew, but keep existing builds where they are.
After installing perlbrew in /opt/perlbrew, run the following:
cd /opt/perlbrew/perls
for q in ~/perl5/perlbrew/perls/* ; do
ln -s "$q"
done
Pro: Super simple and quick. Over time, you can phase out your ~/perl5/perlbrew (by deleting unneeded builds, by replacing them as per option 1, or by moving them as per option 2).
Con: Everyone that should have access to /opt/perlbrew also needs access to your ~/perl5/perlbrew.
Don't change PERLBREW_ROOT. Simply make /opt/perlbrew a symlink.
ln -s ~/perl5/perlbrew /opt/perlbrew
Pro: Super simple and quick.
Con: Everyone that should have access to /opt/perlbrew also needs access to your ~/perl5/perlbrew.

Using external files and modules in perl PAR Packer

I'm having some trouble using the pp command to create standalone executables on a Linux machine. It seems that every tutorial says a different thing and I'm a bit confused. I'd like your help regarding two issues:
1. I'm trying to include a module created by me (.pm file), but not sure how to do so and keep getting error messages. Should I use the -M option? or should it be -B? And once the module is included, how do I call it from the script? the usual way (i.e. "use module" and then "module::sub")?
2. I want to include some text files too. So far, I've tried -a and -l options, but not sure if they actually work. Which one should I use? Also, how do I open these files? For instance, if I pack the file tmp.txt, what should the open command look like?
Thank you very much!
Adding modules with the -M option and use the module as usual.
Adding your text file with the -a option, from pp's manual:
By default, files are placed under / inside the package with their original names.
so you should be able to read these text files with:
my $content = PAR::read_file('your_file.txt');

How to make my Perl module's README file compatible with Github's Markdown display?

I've authored the README file in my Perl module in Markdown. Github treats this README file as plain text. I tried renaming the file to "README.md"—which looks great on Github, but is invisible to Perl tools that look for a file named "README."
Is there any way I can have both a README file, and have my Markdown formatting be interpreted correctly by Github?
The only option I could come up with was to have both a README and a README.md, but I'd prefer not to have to manually keep the two files in sync.
Thanks for your help.
Format your README in pod, rename it README.pod and then it works both places! Example
For my purposes I actually just generate my README.pod from the main pod by doing
$ podselect lib/My/Main/Module.pm > README.pod
One caveat, named external links don't work correctly L<GitHub|http://github.com> will unfortunately point to search.cpan.org looking for a GitHub module. I have tried to inform them of this glitch but it got me nowhere. Instead you can just use plain external links (i.e. GitHub: L<http://github.com>) and they work fine.
Good news, it appears that they have fixed this since the last time I checked!
Just a question, what parts of the Perl toolchain expect a README file? If you mean including it in your tarball, just be sure to add the file to your MANIFEST and it should get included.
Have you heard of POD? This is the standard documentation tool in Perl. POD is a simple text documentation format that actually lives in your code. One of the commands that come with perl is perldoc. You can use it to get the information of any Perl command. Try these:
$ perldoc File::Find
$ perldoc -f split
All Perl modules in CPAN are required to incorporate POD documentation. In fact, this is how the CPAN webpages themselves are built.
So, where am I going with this and how is this going to help you?
You should include POD documentation in your Perl program. Then, you can use the pod2text command to create your README for your Perl program:
$ pod2text myperl.pl > README
That handles half of your issue.
The other half is a bit more tricky. You need to install from CPAN the Pod::Markdown on your system. Then, you can run the pod2markdown command that comes with this module to create the markdown version of your file:
$ pod2markdown myperl.pl > README.md
The results:
Your documentation lives, as it should, in your Perl program.
Users can use the perldoc program to print out complete documentation of your program.
You can use the pod2text tool to create your README file.
You can use the pod2markdown tool to create your README.md file.
As a bonus, you can use the Pod::Usage module that comes with Perl to show the POD documentation (or bits and pieces of it) as help text that's displayed when a user runs your program with the -help parameter.
So, one place where your documentation lives, and you're using a couple of helper programs to create the files Github and whatever Perl tools you use need.
If you don't mind using Dist::Zilla you can pretty much do away with maintaining a README entirely. For example Dist::Zilla::Plugin::ReadmeFromPod can create your README file by extracting the Pod from your main module. This means never having to write a README again.
I've never tried it myself, but you could look at something like Dist::Zilla::Plugin::ReadmeMarkdownFromPod to create your README automatically in markdown.
This may not be the exact answer you're looking for, but I think using this sort of a tool can save you a lot of time as it allows you avoid repeating yourself in your documentation.
Another solution if you really want to distribute your module with a Markdown README, that doesn't involve Pod is to :
rename your README file to README.md
update the previous change in the MANIFEST file
I think it can be an interesting solution because more people know Markdown syntax than Pod one. As the aim of the README file is to be read by anyone, Markdown should be considered.
I was just looking for a solution for this problem and decided to use Dist::Zilla::Plugin::ReadmeAnyFromPod as it understands =attr and =method tags from Pod::Weaver.
The only option I could come up with was to have both a README and a
README.md, but I'd prefer not to have to manually keep the two files
in sync.
Then automatically keep them in sync?

Why wget doesn't get java file recursively?

I am trying to download all the folder structure and files under a folder in a website using wget.
Say there is a website like:
http://test/root. Under root it is like
/A
/A1/file1.java
/B
/B1/file2.html
My wget cmd is:
wget -r http://test/root/
I got all the folders and the html files, but no java files. Why is that?
UPDATE1:
I can access the file in the browser using:
http://test/root/A/A1/file1.java
I can also download this individual file using:
wget http://test/root/A/A1/file1.java
wget can just follow links.
If there is no link to the files in the subdirectories, then wget will not find those files. wget will not guess any file-names, it will not test exhaustively for filenames and wget does not practice black magic.
Just because you can access the files in a browser does not mean that wget can necessarily retrieve it. Your browser has code able to recognize the directory structure, wget only knows what you tell it.
You can try adding the java file to an accept list first, perhaps that's all it needs:
wget -r -A "*.java" http://text/root
But it sounds like you're trying to get a complete offline miror of the site. Let's start—as with any command we're trying to figure out—with man wget:
Wget can follow links in HTML, XHTML, and CSS pages, to create local
versions of remote web sites, fully recreating the directory structure
of the original site. This is sometimes referred to as "recursive
downloading." While doing that, Wget respects the Robot Exclusion
Standard (/robots.txt). Wget can be instructed to convert the links in
downloaded files to point at the local files, for offline viewing.
What We Need
1. Proper links to the file to be downloaded.
In your intex.html file, you must provide a link to the Java file, otherwise wget will not recognize it as needing to be downloaded. For your current directory structure, ensure file2.html contains a link to the java file. Format it to link to a directory above the current one:
JavaFile
However, if the file1.java is not sensitive and you routinely do this, it's cleaner and less code to put an index.html file in your root directory and link to:
JavaFile
If you only want the Java files and want to ignore HTML, you can use --reject like so:
wget -r -nH --reject="file2.html"
### Or to reject ALL html files ###
wget -r -nH --reject="*.html"
This will recursively (-r) go through all directories starting at the point we specify.
2. Respect robots.txt
Ensure that if you have a /robots.txt file in your */root/* directory it does not prevent crawling. If it does, you need to instruct wget to ignore it using the following option in your wget command by adding:
wget ... -e robots=off http://test/root
3. Convert remote links to local files.
Additionally, wget must be instructed to convert links into downloaded files. If you've done everything above correctly, you should be fine here. The easiest way I've found to get all files, provided nothing is hidden behind a non-public directory, is using the mirror command.
Try this:
wget -mpEk http://text/root/
# If robots.txt is present:
wget -mpEk robots=off http://text/root/
Using -m instead of -r is preferred as it doesn't have a maximum recursion depth and it downloads all assets. Mirror is pretty good at determining the full depth of a site, however if you have many external links you could end up downloading more than just your site, which is why we use -p -E -k. All pre-requisite files to make the page, and a preserved directory structure should be the output. -k converts links to local files.
Since you should have a link set up, you should get a file1.java inside the ../A1/ directory. However this command should work as is without a specific link being placed to the java file inside of your index.html or file2.html but it doesn't hurt as it preserves the rest of your directory. Mirror mode also works with a directory structure that's set up as an ftp:// also.
General rule of thumb:
Depending on the side of the site you are doing a mirror of, you're sending many calls to the server. In order to prevent you from being blacklisted or cut off, use the wait option to rate-limit your downloads. If it's a site the side of the one you posted you shouldn't have to, but any large site you're mirroring you'll want to use it:
wget -mpEk --no-parent robots=off --random-wait http://text/root/

Should I check in *.mo files?

Should I check in *.mo translation files into my version control system?
This is a general question. But in particular I'm working on Django projects with git repositories.
The general answer is:
if you do need those files to compile or to deploy (in shot: to "work" with) your component (set of files queried from your VCS), then yes, they should be stored in it (here: in Git).
This is the same for other kind of files (like project files for instance)
.mo files are particular:
django-admin.py compilemessages utility.
This tool runs over all available .po files and creates .mo files, which are binary files optimized for use by gettext
Meaning:
you should be able to rebuild them every time you need them (guarantying in effect that they are in synch with their .po couterparts)
Git is not so good with binary storage and that would avoid it to store a full version for every changes
So the specific answer is not so clear-cut:
if your po files are stables and will not evolve too often, you could definitively store the .mo file
you should absolutely store a big README file explaning how to generate mo from po files.
The general answer is to not store generated contents in version control.
You can include it in tarball, if it requires rare tools, or even have separate repository or disconnected branch with only those generated files (like 'html' and 'man' branches in git.git repository).
For asked question Jakub answer is pretty neat.
But one could ask:
So where should I store such files? Should I generate them every time I deploy my code?
And for that... it depends. You could deploy it in tarball (as Jakub sugested) or even better - create pip or system package (RPM for fedora, DEB for debian, etc.).