Why is "package" keyword sometimes separated by a comment from the package name? - perl

Analyzing sources of CPAN modules I can see something like this:
package # hide from PAUSE
Obviously, it's taken from Try::Tiny, but I have seen this kind of comments between package keyword and package identifier in other modules too.
Why this procedure is used? What is its goal and what benefits does it have?

It is indeed a hack to hide a package from PAUSE's indexer.
When a distribution is uploaded to PAUSE, the indexer will examine each file in the upload, looking for the names of packages that are included in the distribution. Any indexed packages can show up in CPAN search results.
There are many reasons for not wanting the indexer to discover your packages. Your distribution may have many small or insignificant packages that would clutter up the search results for your module. You may have packages defined in your t (test) directory or some other non-standard directory that are not meant to be installed as part of the distribution. Your distribution may include files from a completely different distribution (that somebody else wrote).
The hack works because the indexer strictly looks for the keyword package and an expression that looks like a package name on the same line.
Nowadays, you can include a META.yml file with your distribution. The PAUSE indexer will look for and respect a no_index specification in this file. But this is a relatively new capability of the indexer so older modules and old-timer CPAN contributors will still use the line break hack.
Here's an example of a no_index spec from Forks::Super
- t
- inc
- Sys::CpuAffinity
- Signals::XSIG
- Signals::XSIG::Default
- Signals::XSIG::TieArray56
Sys::CpuAffinity and Signals::XSIG are separate distributions that are also packaged with Forks::Super. Some of the test scripts contain package declarations (e.g., Arbitrary::Test::Package) that shouldn't be indexed.

Okay, here's another shot at this phenomenon ... I've been whacky-hacking Perl for a dozen years and I've rarely seen this packy hack and possibly simply ignored and never bothered to investigate. One thing seems clear, though. There's some hackish processing going on at PAUSE that's been crafted in the good ol' Perl'n'UNIX school of thought that without the shadow of a doubt involves line-oriented text parsing, so they parse those Perl files, possibly even using grep, but rather perl itself, who knows, to extract package names and then kick of some procedure or get some stats or whatnot. And to trip up this procedure and hack around its ways the author splits the package declaration in two lines so the hacky packy grep job doesn't have a clue that there's a package declared right under its nose and the programmer is happy about his hacky skills and the PAUSE stats or whatever it is they're cobbling together are as they should be. Does that make sense?


How can I have 2 verions of Gensim for summarization in one Jupyter notebook?

I want to have 2 versions of Gensim for using summarization and keyword function from old Gensim.
How can I setup this senario?
In general, a single Jupyter notebook is backed by a single Python interpreter/environment, and popular packages at their 'official' installation paths can only be installed once.
There are a few hackish workarounds suggested in answers like:
Installing multiple versions of a package with pip
However, each workaround presents operational problems.
One approach is to install the older package to a non-standard path (directory) that's still found by Python importing logic (controlled by PYTHONPATH). For example, put/move the older copy of Gensim to a gensim_old package directory. But: this is only likely to work well with very sime (single-.py-file) packages.
With any signficant library (like Gensim) which cross-imports a lot of things from its own utility modules, using the standard paths, lots of things are likely to break unless you dig into all involved individual files to change their import paths. That's kind of kludgey & hard-to-maintain. (Though, to the extent you're just using one old version, say gensim-3.8.3 for the removed summarization feature, perhaps it'd be worth fighting through this process once, then keeping the changes around.)
Another approach is to create a totally-separate Python environment with the alternate version, and only use that other environment from the notebook by a system-call – via either something in Python-code like subprocess.call(), or the notebook-cell ! or !! magic-escapes to run a shell command. That is, you give up the ability to run individual interactive lines of Python in that alt environment - but could still send it batches of data, and either capture the console output or observe its output files to continue processing in your notebook.
I'd expect this to be a better option – cleaner & more-maintainable – provided that either the old-version-functionality (summarization) or new-version-functionality (whatever else) can be condensed into one (or a few) single-step scripts.
Another option would be to try to completely copy the gensim.summarization source code files to some new location inside your own project – performing whatever (few, minor) edits are necessary to ensure it works from the alternate location.
One of the reasons that functionality was removed was that its approach to things like tokenization was not consistent/integrated with other Gensim practices – which actually means it's likely to be a little easier to keep it working (given its use of its own idiosyncratic approaches) separately.
Personally I'd rank these three options desirability as:
(best) Section off the summarization tasks to be run via subprocess executions in a separate Python environment, which has only the older package installed.
(maybe ok) Copy the 10 .py files that implement the gensim.summarization' to your own local module. Edit lightly as necessary to ensure they still work. (That should mainly be updating import` lines, but might reuire a few other adaptations to other Python 3.x/Gensim 4.x changes.)
(probably too messy) Install the whole old package to a non-standard directory, edit lots of files to ensure anything you're using still works.
Finally, note that the main reason the feature was removed is that it did not offer very impressive or adaptable results. While I've seen some people say it's worked OK for their applications, I've never seen even so much as a demo where its practices/algorithm – which can only extract some subset of important sentences, never paraphrase – gave impressive results.
So unless you already know that its approach works well for your needs, don't get your hopes up! Good luck.

Simplest way to get a comprehensive listing of package names available in CPAN?

Suppose that, as a private project, I have implemented a Perl package, and tested it, both formally and through extensive everyday use. I find the package useful and solid enough to warrant submitting it to CPAN.
Up to this point, since the package has been a private project, I have not worried too much about the package's name, but now that I want to submitted to CPAN, however, I would like the package's name to fit well within the ecology of package names already in CPAN.
In order to find a suitable "CPAN name" for my package, I would have to inspect a comprehensive listing of all these package names1.
What is the simplest way to get this comprehensive listing of names of packages in CPAN?
(IOW, if the question above is already clear enough for you, you may safely ignore what follows.)
I don't think that I can give a technically correct formal definition of what I mean here by "package name", so let me at least give an "operational definition".
If, for example, the one-liner
$ perl -MFoo::Bar::Baz -c -e 1
fails with an error beginning with
Can't locate Foo/Bar/Baz.pm in #INC ...
..., but after installing some distributions from CPAN, the same oneliner succeeds with
-e syntax OK
...then I'll say that "Foo::Bar::Baz is a package name in CPAN".
(We could split hairs over the package/module distinction, and consider scenarios in which the distinction matters, but please let's not.)
Furthermore, if after inspecting the list this question asks about I discover that, on the one hand, there are in fact many eminent package names in CPAN that begin with the prefix Foo::Bar::, and on the other, there are none (or negligibly few) that begin with the prefix Fubar::, then this would be a good enough reason for me to change the name of my Fubar::Frobozz package to Foo::Bar::Frobozz before submitting it to CPAN.
1 Of course, after inspecting such a list, I may discover that my package does not add sufficiently new functionality relative to what's already available in CPAN to warrant submitting my package to CPAN after all.
If you have run cpan before, you have downloaded a comprehensive package and distribution list under <cpan-home>/sources/modules/02packages.details.txt.gz.
A fresh copy is available on any CPAN mirror, e.g.
http://www.cpan.org/modules/02packages.details.txt.gz .
PAUSE::Packages can do what you want, however you probably want to use this list, but http://prepan.org/ can provide advice/review before submission to cpan, with of course reading on the naming of modules first.
Are you sure that's a thing you want? There are 33,623 distributions on CPAN at the time of writing. Within cpan you can enter
cpan> d /./
That's d for distributions followed by a regex pattern that matches the names you're interested in
If you're really interested in packages -- and a distribution may contain multiple package names -- you need
cpan> m/./
where m is for modules. There are 163,136 of those, which means there's an average of four or five packages per distribution, and it takes cpan a few minutes to generate the list. (I'm sorry, I didn't monitor the exact time.)
You could use MetaCPAN::Client
I found this article which gives the idea about using this module.
use strict; use warnings; use MetaCPAN::Client;
my $mcpan = MetaCPAN::Client->new();
my $release_results = $mcpan->release({ status => 'latest' } );
while ( my $release = $release_results->next ) {
printf "%s v%s\n", $release->distribution, $release->version;
Currently this gave me 32601 result like this:
Proc-tored v0.11
Locale-Utils-PlaceholderBabelFish v0.004
Perinci-To-Doc v0.83
Mojolicious-Plugin-Qooxdoo v0.905
App-cdnget v0.05
Baal-Parser v0.01
Acme-DoOrDie v0.001
Net-Shadowsocks v0.9.0
MetaCPAN-Client v2.006000
This modules also gives information about release, module, author, and file & uses Elasticsearch.
It also get updated regularly on every MetaCPAN API change.

Why isn't my CPAN distribution indexed by PAUSE?

I've uploaded my stasis distribution to PAUSE, but it isn't in the index.
I thought this was because it didn't contain a package, so I added a package declaration to the stasis script in v0.04 like this:
#!/usr/bin/env perl
package stasis;
package main;
but it still wasn't indexed.
Is there anyway to get this distribution indexed that doesn't involve creating a boilerplate module file? (e.g. adding lib/stasis.pm to the distribution).
I believe CPAN does not index scripts.
IMO your best option is to make a module that allows doing programmatically what your script does (and make the script use it).
You could put in a fake module or make it think your script is a module (I think listing it in provides works), but I wouldn't if I were you.
Because your package statement was not in a *.pm file.
The PAUSE indexer is open source. It is a little complicated to unpack, but the regex for extracting a package name in a distribution is in PAUSE::pmfile::packages_per_pmfile, which is a method and a package that is meant to process *.pm files only.
The PAUSE::dist::_index_by_meta method provides the alternate method of declaring a package through the provides keyword in the metafile.

How do I find and delete duplicate Perl modules from the library?

I've been using Module::Build to manage my module installation and I've found that I have duplicate versions of a given module in different parts of the library:
This is a concern to me, because I can't be sure that they will stay in sync.
Is there a way to find and remove duplicate files of this sort? Or are these here because of something that I did and can I control that?
That's probably because you added/removed XS code from your module. If you added XS code, the first one is the current one (good), if you removed all XS code, the second one is the current one (not good). This is unlikely to be a long-term problem, so a long-term solution may be unnecessary.
If you run perl -V, you'll notice the order of every path in your #INC. Perl will likely have your i686-linux/site_perl directory before the normal one, so your XS version will get loaded, and the other one will be ignored. It doesn't matter if they're in sync or not, only one will get loaded. Thus, the important thing is that if you remove all the XS code so it becomes a pure-perl module, you'll have to delete the XS version from your tree. This is rare - once you start doing XS, usually it's not removed. Even dual-life modules (List::MoreUtils) keep their XS code and merely have a way to determine if it got installed or not, and have a way to disable the XS code for testing purposes. But they don't actually get rid of the XS code.
Most likely, you added XS code so it was no longer pure-perl, and everything will be fine.
Why? Is it causing a problem? Identifying all the files from a distribution could be tricky, so trying to remove an installed distribution is more likely to cause a problem than leaving things be.
You said you're worried about the installations getting out of sync, but that makes no sense. Why do you care about the state of an installation you're not using?
You will most likely always have some duplicate looking Modules, because the modules are installed in one (or more) of the paths defined in #INC list. If you inspect the modules versions with cpan -l, you will probably discover that they have different versions. Please see this answer for many more details.
However one can always argue to wonder why Perl never supplied a more sane (and people friendly) way to organize and inspect the modules that has been installed.

Different architectures in the same or different directory trees?

At $work, we maintain a set of Perl modules at a central location for easy inclusion via PERL5LIB. As there is a re-installation ahead and we need to provide the modules for both 32 and 64 bit architecture, we are wondering if it's better to install them into the same directory tree, relying on the $archname subdirectories, or keep the two architectures entirely separate and duplicate each module.
I was not very successful at researching the inner workings of the Perl module lookup process involving $archname, maybe someone can point me in the right direction.
In your experience, what are the pros and cons of the two approaches?
From perldoc lib:
When using use lib LIST;
For each directory in LIST (called
$dir here) the lib module also
checks to see if a directory called
$dir/$archname/auto exists. If so
the $dir/$archname directory is
assumed to be a corresponding
architecture specific directory and is
added to #INC in front of $dir.
lib.pm also checks if directories
called $dir/$version and
$dir/$version/$archname exist and
adds these directories to #INC.
IMHO, it is more idiomatic - and dare I say, neater - to use the per-architecture subdirectories, like Perl's standard libraries would.
However, it night be more straightforward to manage per-architecture-entire-tree of your own libraries, though not by a large margin once you create a few basic tools/scripts to do so.
Build the modules separately on each system so that you only get the files needed there. Or use a packaging system that distinguishes between architectures. Don't try to provide the files for all architectures to all systems.