How do I find and delete duplicate Perl modules from the library?

How do I find and delete duplicate Perl modules from the library? - perl

I've been using Module::Build to manage my module installation and I've found that I have duplicate versions of a given module in different parts of the library:
./site_perl/5.14.2/i686-linux/site_perl/Support/Dump.pm
./site_perl/5.14.2/Support/Dump.pm
This is a concern to me, because I can't be sure that they will stay in sync.
Is there a way to find and remove duplicate files of this sort? Or are these here because of something that I did and can I control that?

That's probably because you added/removed XS code from your module. If you added XS code, the first one is the current one (good), if you removed all XS code, the second one is the current one (not good). This is unlikely to be a long-term problem, so a long-term solution may be unnecessary.
If you run perl -V, you'll notice the order of every path in your #INC. Perl will likely have your i686-linux/site_perl directory before the normal one, so your XS version will get loaded, and the other one will be ignored. It doesn't matter if they're in sync or not, only one will get loaded. Thus, the important thing is that if you remove all the XS code so it becomes a pure-perl module, you'll have to delete the XS version from your tree. This is rare - once you start doing XS, usually it's not removed. Even dual-life modules (List::MoreUtils) keep their XS code and merely have a way to determine if it got installed or not, and have a way to disable the XS code for testing purposes. But they don't actually get rid of the XS code.
Most likely, you added XS code so it was no longer pure-perl, and everything will be fine.

Why? Is it causing a problem? Identifying all the files from a distribution could be tricky, so trying to remove an installed distribution is more likely to cause a problem than leaving things be.
You said you're worried about the installations getting out of sync, but that makes no sense. Why do you care about the state of an installation you're not using?

You will most likely always have some duplicate looking Modules, because the modules are installed in one (or more) of the paths defined in #INC list. If you inspect the modules versions with cpan -l, you will probably discover that they have different versions. Please see this answer for many more details.
However one can always argue to wonder why Perl never supplied a more sane (and people friendly) way to organize and inspect the modules that has been installed.

Related

How can I have 2 verions of Gensim for summarization in one Jupyter notebook?

I want to have 2 versions of Gensim for using summarization and keyword function from old Gensim.
How can I setup this senario?

In general, a single Jupyter notebook is backed by a single Python interpreter/environment, and popular packages at their 'official' installation paths can only be installed once.
There are a few hackish workarounds suggested in answers like:
Installing multiple versions of a package with pip
However, each workaround presents operational problems.
One approach is to install the older package to a non-standard path (directory) that's still found by Python importing logic (controlled by PYTHONPATH). For example, put/move the older copy of Gensim to a gensim_old package directory. But: this is only likely to work well with very sime (single-.py-file) packages.
With any signficant library (like Gensim) which cross-imports a lot of things from its own utility modules, using the standard paths, lots of things are likely to break unless you dig into all involved individual files to change their import paths. That's kind of kludgey & hard-to-maintain. (Though, to the extent you're just using one old version, say gensim-3.8.3 for the removed summarization feature, perhaps it'd be worth fighting through this process once, then keeping the changes around.)
Another approach is to create a totally-separate Python environment with the alternate version, and only use that other environment from the notebook by a system-call – via either something in Python-code like subprocess.call(), or the notebook-cell ! or !! magic-escapes to run a shell command. That is, you give up the ability to run individual interactive lines of Python in that alt environment - but could still send it batches of data, and either capture the console output or observe its output files to continue processing in your notebook.
I'd expect this to be a better option – cleaner & more-maintainable – provided that either the old-version-functionality (summarization) or new-version-functionality (whatever else) can be condensed into one (or a few) single-step scripts.
Another option would be to try to completely copy the gensim.summarization source code files to some new location inside your own project – performing whatever (few, minor) edits are necessary to ensure it works from the alternate location.
One of the reasons that functionality was removed was that its approach to things like tokenization was not consistent/integrated with other Gensim practices – which actually means it's likely to be a little easier to keep it working (given its use of its own idiosyncratic approaches) separately.
Personally I'd rank these three options desirability as:
(best) Section off the summarization tasks to be run via subprocess executions in a separate Python environment, which has only the older package installed.
(maybe ok) Copy the 10 .py files that implement the gensim.summarization' to your own local module. Edit lightly as necessary to ensure they still work. (That should mainly be updating import` lines, but might reuire a few other adaptations to other Python 3.x/Gensim 4.x changes.)
(probably too messy) Install the whole old package to a non-standard directory, edit lots of files to ensure anything you're using still works.
Finally, note that the main reason the feature was removed is that it did not offer very impressive or adaptable results. While I've seen some people say it's worked OK for their applications, I've never seen even so much as a demo where its practices/algorithm – which can only extract some subset of important sentences, never paraphrase – gave impressive results.
So unless you already know that its approach works well for your needs, don't get your hopes up! Good luck.

Testing an XS module that uses Dist::Zilla

I'm working on a Perl module that has a lot of XS code and also uses Dist::Zilla to manage packaging. What's the best way to test things efficiently? I know about dzil test, but that's pretty slow because it does a full build/compile/test cycle every time it's invoked.
It would be nice to only update the parts that need updating since last test, and also to be able to run only certain t/*.t test scripts rather than all of them. Anyone have a solution that they like?

I have, in the past, just taken a Build.PL/Makefile.PL as generated by dzil and dropped it into the source repository as a "Makefile_dev.PL" (or "Build_dev.PL"), added it to MANIFEST.SKIP (or the dzil-based, generated equivalent) and used it during development.

For my XS modules, I use either MakeMaker::Custom or ModuleBuild::Custom (both by me). If you set things up properly, you can run Makefile.PL or Build.PL directly in your repo without invoking dzil at all. To run specific tests, you just build the dist and use prove -b testname.
Some examples using ModuleBuild::Custom: Media-LibMTP-API, Win32-IPC.
An example using MakeMaker::Custom: Win32-Setupsup.

I know I'm pigeonholing myself as old-school, but its for these very reasons that I don't use Dist::Zilla: when it works its great, when it doesn't it can be really hard to make it do what you want.
I guess that means, my answer is: when it gets too hard, just move to one of the primary tools that dzil generates, ie. EUMM or MB directly.

Why is "package" keyword sometimes separated by a comment from the package name?

Analyzing sources of CPAN modules I can see something like this:
...
package # hide from PAUSE
Try::Tiny::ScopeGuard;
...
Obviously, it's taken from Try::Tiny, but I have seen this kind of comments between package keyword and package identifier in other modules too.
Why this procedure is used? What is its goal and what benefits does it have?

It is indeed a hack to hide a package from PAUSE's indexer.
When a distribution is uploaded to PAUSE, the indexer will examine each file in the upload, looking for the names of packages that are included in the distribution. Any indexed packages can show up in CPAN search results.
There are many reasons for not wanting the indexer to discover your packages. Your distribution may have many small or insignificant packages that would clutter up the search results for your module. You may have packages defined in your t (test) directory or some other non-standard directory that are not meant to be installed as part of the distribution. Your distribution may include files from a completely different distribution (that somebody else wrote).
The hack works because the indexer strictly looks for the keyword package and an expression that looks like a package name on the same line.
Nowadays, you can include a META.yml file with your distribution. The PAUSE indexer will look for and respect a no_index specification in this file. But this is a relatively new capability of the indexer so older modules and old-timer CPAN contributors will still use the line break hack.
Here's an example of a no_index spec from Forks::Super
no_index:
directory:
- t
- inc
package:
- Sys::CpuAffinity
- Signals::XSIG
- Signals::XSIG::Default
- Signals::XSIG::TieArray56
Sys::CpuAffinity and Signals::XSIG are separate distributions that are also packaged with Forks::Super. Some of the test scripts contain package declarations (e.g., Arbitrary::Test::Package) that shouldn't be indexed.

Okay, here's another shot at this phenomenon ... I've been whacky-hacking Perl for a dozen years and I've rarely seen this packy hack and possibly simply ignored and never bothered to investigate. One thing seems clear, though. There's some hackish processing going on at PAUSE that's been crafted in the good ol' Perl'n'UNIX school of thought that without the shadow of a doubt involves line-oriented text parsing, so they parse those Perl files, possibly even using grep, but rather perl itself, who knows, to extract package names and then kick of some procedure or get some stats or whatnot. And to trip up this procedure and hack around its ways the author splits the package declaration in two lines so the hacky packy grep job doesn't have a clue that there's a package declared right under its nose and the programmer is happy about his hacky skills and the PAUSE stats or whatever it is they're cobbling together are as they should be. Does that make sense?

How expensive is: require "foo.pl";

I'm about to rewrite a large portion of a project that I have developed over the last 10years while learning perl. There is alot of optimisation that can be gained.
A key part of the code is a large if/elsif block that require xxx.cgi files depending on a POST value. Eg:
if($FORM{'action'} eq "1"){require "1.cgi";}
elsif($FORM{'action'} eq "2"){require "2.cgi";}
elsif($FORM{'action'} eq "3"){require "3.cgi";}
elsif($FORM{'action'} eq "4"){require "4.cgi";}
It has many more irritations but just how expensive is using "require" in perl?

require itself has a relatively low cost in any case and, if you require the same file more than once within a single run of your program, it will detect that the file has already been loaded and not attempt to load it a second time. However, if you have a long and highly-populated search path (#INC) and you require (or use) a lot of files, it's possible that all of the directory searches could add up; this isn't common (and doesn't sound likely in your case), but it can be improved by reorganizing your module directories so that the things you're loading show up earlier in #INC.
The potentially-major performance hit referred to by earlier answers is the cost of compiling the code in the files you require. Getting rid of the require by moving the code into your main program will not help with this, as the code will still need to be compiled. In your case, it would probably make things worse, as it would cause the code for all options to be compiled on every one rather than only compiling the code used by the one action selected by the user.

As has been said, it really depends on the actual code in those files. Your best bet would be to do tests using Devel::NYTProf and/or Benchmark to see where the most time is being spent in your code if you are unhappy with its performance.
You can also read Profiling Perl on perl.com, but it is a bit outdated as it uses Devel::DProf.

Not answer to your primary question, but still a good idea for code refactor i read recently in Ovid blog.

The first time, possibly expensive; Perl has to search a path to find the file and load it up. Subsequent times, it's cheap -- a table is consulted and the file isn't actually loaded a second time. If this is in a CGI that is run once per request and then exited, then this is not too good.

It's really going to depend on the size of the files you're calling to. If you have massive CGI files, then it might detriment the performance of your software. If we're talking 6 or 7 lines of code each, then no issue. Try benchmarking your program's performance with and without, and make your own judgement.

What does the DumpXS in Perl's Data::Dumper do?

I have gone through the source code of Data::Dumper. In this package I didn't understand what's going on with DumpXS. What is the use of this DumpXS?
I have searched about this and I read that, it is equal to the Dump function and it is faster than Dump. But I didn't understand it.

The XS language is a glue between normal Perl and C. When people want to squeeze every last bit of performance out of an operation, they try to write it as close to the C code as possible. Python and Ruby have similar mechanisms for the same reason.
Some Perl modules have an XS implementation to improve performance. However, you need a C compiler to install it. Not everyone is in a position to install compiled modules, so the modules also come in a "PurePerl" or "PP" version that does the same thing just a bit slower. If you don't have the XS implementation, a module such as Data::Dumper can automatically use the pure Perl implementation. In this case, Data::Dumper also lets you choose which one you want to use.

A lot of Perl modules have "XS" versions, like JSON::XS. The XS in the name means that it partly uses C in order to increase the speed or other efficiency of the module. I don't know this particular case, but it is probably that.

And if you want a bit more info on XS go to http://perldoc.perl.org/perlxs.html
But I am curious what lead you to this question.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How do I find and delete duplicate Perl modules from the library? - perl

Related

How can I have 2 verions of Gensim for summarization in one Jupyter notebook?

Testing an XS module that uses Dist::Zilla

Why is "package" keyword sometimes separated by a comment from the package name?

How expensive is: require "foo.pl";

What does the DumpXS in Perl's Data::Dumper do?

Categories

Resources