Should Nix packages set `LC_ALL` themselves? - unicode

At least on non-NixOS, it looks like Nix-installed packages do not use proper encoding. For example, https://github.com/srid/emanote/issues/125
They do not work, unless something like LC_ALL=C.UTF-8 is manually set. What is the Nix way to obviate users from having to do this manual configuration?

Related

How can I have 2 verions of Gensim for summarization in one Jupyter notebook?

I want to have 2 versions of Gensim for using summarization and keyword function from old Gensim.
How can I setup this senario?
In general, a single Jupyter notebook is backed by a single Python interpreter/environment, and popular packages at their 'official' installation paths can only be installed once.
There are a few hackish workarounds suggested in answers like:
Installing multiple versions of a package with pip
However, each workaround presents operational problems.
One approach is to install the older package to a non-standard path (directory) that's still found by Python importing logic (controlled by PYTHONPATH). For example, put/move the older copy of Gensim to a gensim_old package directory. But: this is only likely to work well with very sime (single-.py-file) packages.
With any signficant library (like Gensim) which cross-imports a lot of things from its own utility modules, using the standard paths, lots of things are likely to break unless you dig into all involved individual files to change their import paths. That's kind of kludgey & hard-to-maintain. (Though, to the extent you're just using one old version, say gensim-3.8.3 for the removed summarization feature, perhaps it'd be worth fighting through this process once, then keeping the changes around.)
Another approach is to create a totally-separate Python environment with the alternate version, and only use that other environment from the notebook by a system-call – via either something in Python-code like subprocess.call(), or the notebook-cell ! or !! magic-escapes to run a shell command. That is, you give up the ability to run individual interactive lines of Python in that alt environment - but could still send it batches of data, and either capture the console output or observe its output files to continue processing in your notebook.
I'd expect this to be a better option – cleaner & more-maintainable – provided that either the old-version-functionality (summarization) or new-version-functionality (whatever else) can be condensed into one (or a few) single-step scripts.
Another option would be to try to completely copy the gensim.summarization source code files to some new location inside your own project – performing whatever (few, minor) edits are necessary to ensure it works from the alternate location.
One of the reasons that functionality was removed was that its approach to things like tokenization was not consistent/integrated with other Gensim practices – which actually means it's likely to be a little easier to keep it working (given its use of its own idiosyncratic approaches) separately.
Personally I'd rank these three options desirability as:
(best) Section off the summarization tasks to be run via subprocess executions in a separate Python environment, which has only the older package installed.
(maybe ok) Copy the 10 .py files that implement the gensim.summarization' to your own local module. Edit lightly as necessary to ensure they still work. (That should mainly be updating import` lines, but might reuire a few other adaptations to other Python 3.x/Gensim 4.x changes.)
(probably too messy) Install the whole old package to a non-standard directory, edit lots of files to ensure anything you're using still works.
Finally, note that the main reason the feature was removed is that it did not offer very impressive or adaptable results. While I've seen some people say it's worked OK for their applications, I've never seen even so much as a demo where its practices/algorithm – which can only extract some subset of important sentences, never paraphrase – gave impressive results.
So unless you already know that its approach works well for your needs, don't get your hopes up! Good luck.

How to detect which HPC scheduler (Torque, Sun Grid Engine etc) am I using?

I need to run a different script depending on the type of scheduler, which necessitates a reliable way to detect if the scheduler is Torque, SGE or something else. Something like $SHELL telling which shell I am using. or something like name.
I am aware of environmental variables the two systems set, but they don't offer me a reliable or an elegant way - given the commands the env. variables are named similarly or identically.. there needs to several ifs and buts before we can conclude which is it.
Set an environment variable explicitly in your .bashrc that you can later query.
e.g.
export RUNNING_ON="moms_gpu_cluster5"
export THIS_SYSTEMS_SCHEDULER="SGE"
You won't have to rely on what the sysadmin gives you, or what the scheduler does or doesn't do.

Why is "package" keyword sometimes separated by a comment from the package name?

Analyzing sources of CPAN modules I can see something like this:
...
package # hide from PAUSE
Try::Tiny::ScopeGuard;
...
Obviously, it's taken from Try::Tiny, but I have seen this kind of comments between package keyword and package identifier in other modules too.
Why this procedure is used? What is its goal and what benefits does it have?
It is indeed a hack to hide a package from PAUSE's indexer.
When a distribution is uploaded to PAUSE, the indexer will examine each file in the upload, looking for the names of packages that are included in the distribution. Any indexed packages can show up in CPAN search results.
There are many reasons for not wanting the indexer to discover your packages. Your distribution may have many small or insignificant packages that would clutter up the search results for your module. You may have packages defined in your t (test) directory or some other non-standard directory that are not meant to be installed as part of the distribution. Your distribution may include files from a completely different distribution (that somebody else wrote).
The hack works because the indexer strictly looks for the keyword package and an expression that looks like a package name on the same line.
Nowadays, you can include a META.yml file with your distribution. The PAUSE indexer will look for and respect a no_index specification in this file. But this is a relatively new capability of the indexer so older modules and old-timer CPAN contributors will still use the line break hack.
Here's an example of a no_index spec from Forks::Super
no_index:
directory:
- t
- inc
package:
- Sys::CpuAffinity
- Signals::XSIG
- Signals::XSIG::Default
- Signals::XSIG::TieArray56
Sys::CpuAffinity and Signals::XSIG are separate distributions that are also packaged with Forks::Super. Some of the test scripts contain package declarations (e.g., Arbitrary::Test::Package) that shouldn't be indexed.
Okay, here's another shot at this phenomenon ... I've been whacky-hacking Perl for a dozen years and I've rarely seen this packy hack and possibly simply ignored and never bothered to investigate. One thing seems clear, though. There's some hackish processing going on at PAUSE that's been crafted in the good ol' Perl'n'UNIX school of thought that without the shadow of a doubt involves line-oriented text parsing, so they parse those Perl files, possibly even using grep, but rather perl itself, who knows, to extract package names and then kick of some procedure or get some stats or whatnot. And to trip up this procedure and hack around its ways the author splits the package declaration in two lines so the hacky packy grep job doesn't have a clue that there's a package declared right under its nose and the programmer is happy about his hacky skills and the PAUSE stats or whatever it is they're cobbling together are as they should be. Does that make sense?

What is a good method for inventing a command name?

We're struggling to come up with a command name for our all purpose "developer helper" tool, which we are using on our project. It's like a wrapper for our existing tools like cmake and hg. The purpose of the command is really just to make our lives easier by combining multiple commands into one (for example, publishing packages). For example, we have commands like:
do conf
do build
do install
do publish
We've considered a few ambiguous names like do (as above) and run, but obviously, do is a Linux bash command and run is pretty ambiguous.
We'd like our command to be 2 chars short, preferably - but who thinks we're asking the impossible? Is there a practical way to check the availability of command names (other than just typing them into your terminal), or is it just a case of choose one and hope nobody else will use it? Are we worrying about nothing?
Since it's a "developer helper" tool why not use hm [run|build|port|deploy|test], Help Me ...
Give it a verbose name, then let everyone alias it to whatever they want. Make sure you use the verbose name in other scripts so that it removes ambiguity.
This way, each user gets to use whatever makes sense to him/her, and the scripts are more readable and more easily searchable (for example, grepping four "our_cool_tool" will usually yield better results than grepping for "run").
How many 2-character words are useful in this context? I think you need four. With that in mind, here are some suggestions.
omni
torq
fluf
mega
spif
crnk
splt
argh
quat
drul
scud
prun
sqat
zoom
sizl
I have more if you need them.
Pick one: http://en.wikipedia.org/wiki/List_of_all_two-letter_combinations
To check the availability of command names, I suggest looking for all two-letter filenames that are in the directories in your path. You can use a script like this
for item in `echo $PATH | sed 's/:/ /g'` ; do
ls -1d $item/??
done
It won't show builtins in your shell (like "do" as you mentioned) but it's a good start.
Change ?? to ??? for three-letter files, etc.
I'm going to vote for qp (quick package?) since it's easy to pronounce, easy to type, and easy to remember where the keys are on the keyboard.
I use "asd". it's short and most developers type it without thinking
(oh, and you can always claim later that it stands for some "Advanced Script for Developers" if you need to justify yourself a few years from now)
How about fu? As in Kung Fu. It's a special purpose tool. And it's really easy to type.
I think that run is a good name, at least anybody that will download your project will know what to do. Calling it without parameters should reveal your options.
Even 'do' will do, I think you can use backquotes to run it from bash scripts.
Also remember that running the tools without parameters will tell you what options you have.
Use makefiles to do everything for you.
How about calling it something descriptive, like 'build_runner', and then just aliasing it to 'br' (or preferred acronym) in your .bashrc?
There is a really crappy tool called cleartool (part of clearcase), and people will alias it on their machine to "ct". Perhaps you can have a longer command and suggest users alias it.
It would probably be best to do something like ire_and_curses suggested, name it descriptively then alias it to a 2 letter command. If I was choosing, I would name it dev_help and alias it to dh.
I think you're worrying about nothing. Install the program as 'the-command-to-do-evertyhing-and-if-you-dont-make-your-own-alias-for-it-you-should'. I don't think that will be too long for any modern filesystems, but you might need to shorten it to 'tctdeaiydmyoafiys'. See what common aliases are used, and then change the program's name to that. In other words: don't decide, let natural selection decide for you. If you are working with a team of < 10, this should not even remotely cause any problems.
Call it devtool alias to dt
Custom tools like that I like to start with the prefix 'jj-'. I can type (with big index-finger power) 'jj ' and see all my personal commands. Also, they group together in alphabetical lists. 'J' is not a very common character for built-inc commands, but you can pick your own.
Since you want two characters, you can use just 'zz', or something starting with 'z'.
Are you sure you want to put all your functionality in one command? That might be simultaneously over-constraining and over-loading the interface a little.
do conf
do build
do install
do publish

Are there any good automated frameworks for applying coding standards in Perl?

One I am aware of is Perl::Critic
And my googling has resulted in no results on multiple attempts so far. :-(
Does anyone have any recommendations here?
Any resources to configure Perl::Critic as per our coding standards and run it on code base would be appreciated.
In terms of setting up a profile, have you tried perlcritic --profile-proto? This will emit to stdout all of your installed policies with all their options with descriptions of both, including their default values, in perlcriticrc format. Save and edit to match what you want. Whenever you upgrade Perl::Critic, you may want to run this command again and do a diff with your current perlcriticrc so you can see any changes to existing policies and pick up any new ones.
In terms of running perlcritic regularly, set up a Test::Perl::Critic test along with the rest of your tests. This is good for new code.
For your existing code, use Test::Perl::Critic::Progressive instead. T::P::C::Progressive will succeed the first time you run it, but will save counts on the number of violations; thereafter, T::P::C::Progressive will complain if any of the counts go up. One thing to look out for is when you revert changes in your source control system. (You are using one, aren't you?) Say I check in a change and run tests and my changes reduce the number of P::C violations. Later, it turns out my change was bad, so I revert to the old code. The T::P::C::Progressive test will fail due to the reduced counts. The easiest thing to do at this point is to just delete the history file (default location t/.perlcritic-history) and run again. It should reproduce your old counts and you can write new stuff to bring them down again.
Perl::Critic has a lot of policies that ship with it, but there are a bunch of add-on distributions of policies. Have a look at Task::Perl::Critic and
Task::Perl::Critic::IncludingOptionalDependencies.
You don't need to have a single perlcriticrc handle all your code. Create separate perlcriticrc files for each set of files you want to test and then a separate test that points to each one. For an example, have a look at the author tests for P::C itself at http://perlcritic.tigris.org/source/browse/perlcritic/trunk/Perl-Critic/xt/author/. When author tests are run, there's a test that runs over all the code of P::C, a second test that applies additional rules just on the policies, and a third one that criticizes P::C's tests.
I personally think that everyone should run at the "brutal" severity level, but knock out the policies that they don't agree with. Perl::Critic isn't entirely self compliant; even the P::C developers don't agree with everything Conway says. Look at the perlcriticrc files used on Perl::Critic itself and search the Perl::Critic code for instances of "## no critic"; I count 143 at present.
(Yes, I'm one of the Perl::Critic developers.)
There is perltidy for most stylistic standards. perlcritic can be easily configured using a .perlcritic file. I personally use the it at level one, but I've disabled a few policies.
In addition to 'automated frameworks', I highly recommend Damian Conway's Perl Best Practices. I don't agree with 100% of what he suggests, but most of the time he's bang on.
The post above mentioning Devel::Prof probably really means Devel::Cover (to get the code coverage of a test suite).
Like:
http://metacpan.org/pod/Perl::Critic
http://www.slideshare.net/joshua.mcadams/an-introduction-to-perl-critic/
Looks like a nice tool!
A nice combination is perlcritic with EPIC for Eclipse - hit CTRL-SHIFT-C (or your preferred configured shortcut) and your code is marked up with warning indicators wherever perlcritic has found something to complain about. Much nicer than remembering to run it before checkin. And as normal with perlcritic, it will pick up your .perlcriticrc so you can customise the rules. We keep our .perlcriticrc in version control so everyone gets the same standards.
In addition to the cosmetic best practices, I always find it useful to run Devel::Prof on my unit test suite to check test coverage.