SSVD for dimensional reduction +Clustering

SSVD for dimensional reduction +Clustering - cluster-analysis

I have run the ssvd by mahout to apply LSA (Latent semantic analysis). I have text documents each contains many features(from 100 to 2000 terms).
I would like to use LSA on the documents to get the top terms or phrases which appear together "concepts". Any one has an idea how can I do that?
Actually I applied preprocessing filtering(tokenization, stopword removal, stemming, ....), create tfidf by mahout, and then run ssvd command: bin/mahout ssvd -i termVectors/tfidf-vectors/part-r-00000 -no Output Folder -c 200 -us true -U false -V false -t 1 -ow -pca true
I use clusterdump in mahout to parse the results, but all terms in the rsults start with the letter "a*", and are not represent any concept.
Is anyone has experince in ssvd for reducing the features before clustering? or any idea how do you use ssvd to show the concepts in text corpus?
Thank you

Related

Is there a standard or alternative for shorter UUIDs?

The UUID standard has several versions. Version 4 for example is based on a completely random input. But it still encodes the version information and only uses 125bits of the possible 128 bits.
But for transferring these via HTTP, it is more efficient to encode them in BASE64. There are libraries for this (https://github.com/skorokithakis/shortuuid).
But what I am wondering: Is there an alternative standard for shorter ID-strings? Of course I could slap together a version-byte + n random bytes and encode them BASE64, having my own working 'short, random ID scheme', but I wonder if there is any alternative that someone already specified before I make my own.

There is no standard for anything shorter.
Numerous folks have asked the same question and all come to the same conclusion: UUIDs are overkill for their specific requirements. And they developed their own alternatives, which essentially boil down to a random string of some length (based on the expected size of their dataset) in a more efficient encoding like base64. Many of them have published libraries for working with these strings. However, all of them suffer from the same two problems:
They cannot handle arbitrarily large datasets (the reason UUIDs are so big)
They are not standardized, which means there is no real interop.
If neither of these problems affect you, then feel free to pick any of the various schemes available or roll your own. But do consider the future costs if you discover you're wrong about your requirements and whether the cost of having to change justifies whatever minuscule advantage in space or complexity you get by not using a proven, universal system from the start.

I have just found https://github.com/ai/nanoid
It it not really a 'standard', but at least not an arbitrary scheme that I would come up with myself. It is shorter through smarter encoding (larger alphabet) and faster.

A quick and dirty alternative is mktemp, depending on your requirements for security, uniqueness and your access to a shell.
Use the form mktemp -u XXXXXXXX
-u: dry-run, don't create a file
XXXXXXXX is the format, in this case eight random characters
$ echo `mktemp -u XXXXXXXX`
KbbHRWYv
$ echo `mktemp -u XXXXXXXX`
UnyO2eH8
$ echo `mktemp -u XXXXXXXX`
g6NiTfvT

Using IPMItool to set system shutdown on upper critical temperature

I've been digging quite a bit into IPMItool commands and have yet to find a comprehensive list of raw hex commands. We have approximately 90 Dell C6220 II machines that I need to set a trigger (Dell calls these Platform Event Filters) to have the system shutdown upon reaching the Upper Critical Threshold that I set (ironically with IPMItool) for inlet temperature. Our Dell rep tells me this isn't possible and that I'll have to pull up the web interface for all 90 machines and set this by hand. They also told me it wasn't possible to set the inlet temperature thresholds with IPMItool and I did that so my faith in Dell is dwindling. What little I've been able to find on the internet it looks like I might be able to make it happen with raw hex commands. Can anyone in the great internet wild help me?

I ended up using the freeipmi tools ipmi-sensors-config and ipmi-pef-config. First I ran ipmi-sensors-config -L | grep Inlet to find which sensor number corresponded to the inlet temp (for my C6220 II machines it was sensor 16, but for my C6320s it was 110, or sometimes 10, so be sure to do this). I then ran ipmi-sensors-config -c -e '16_Inlet_Temp:Upper_Non_Critical_Threshold=30' &&
ipmi-sensors-config -c -e '16_Inlet_Temp:Upper_Critical_Threshold=32'. This sets the temps to what you want, but we're not done. We have to actually have to set an event to react to these. For that I ran ipmi-pef-config -c -e 'Event_Filter_4:Event_Filter_Action_Power_Off=Yes' &&
ipmi-pef-config -c -e 'Event_Filter_5:Event_Filter_Action_Power_Off=Yes'. Event 4 and 5 in my system corresponds to Temp Non-Critical and Temp Critical events for all temp sensors. To find these I ran ipmi-pef-config -o > pefconf.txt, and then used Vim to search for "Temp".

Minizinc - counting number of solution and print first 50 only for checking

Other than generating an output file then using wc -l output.txt and -1 divide by 2 and head -50 the output.txt, is there any easy way to auto count no of solution inside minizinc and print the first 50 solution?
My program run for 12 hour in one scenario and the other one expected to run 2 days!
Also, any way in batch mode (not ide) to generate resources usage other than using time minizinc ...
Thanks for advice

The command line program "minizinc" as well as most FlatZinc solvers supports the parameter "-n " which is the number of solutions to show. The MiniZinc IDE has the option "Stop after this many solutions:".
Note that this is relevant for satisfaction problems. For optimization problems, however, there is no consensus how different solvers handle "-n".

mdoc(7) markup for accumulating options with arguments

is there a "proper" or "canonical" markup for a command (section 1) "accumulating" option with argument? (or without for that matter)
an accumulating option can be given multiple times and the effects add up: think gcc's -I or -W.
let's say i'm documenting ssh(1). i want the SYNOPSIS to give away that -v and -o accumulate, this is usually done with ellipses:
ssh [-o option]... [-v]...
i'd like to tack the ellipsis to the idiomatic
.Op Fl o Ar option
the closest i can get is
.Oo
. Fl o Ar option
.Oc Ns \&...
as the shorthand Op coopts it.
what do other people do?

I'm a novice at man pages, but aside from your workaround, the closest I can get is:
.Op [ Fl o Ar option ] No ...
However, this results in:
[[-o option ] ...]
It's not exactly canonical or precisely what you're hoping for, but it seems unambiguous. (See http://docopt.org for other examples of how this can be expressed.)

What is best module for parallel processing in Perl?

What is best module for parallel process in Perl? I have never done the parallel processing in Perl.
What is good Perl module for parallel process which is going to used in DB access and mailing ?
I have looked in module Parallel::ForkManager. Any idea appreciated.

Well, it all depends on particular case. Depending on what you exactly need to achieve, you might be better of with one of following:
Parallel::TaskManager
POE
AnyEvent
For some tasks there are also specialized modules - for example LWP::Parallel::UserAgent, which basically means - you have to give us much more details about what you want to achieve to get best possible answer.

Parallel::ForkManager, as the POD says, can limit the number of processes forked off. You could then use the children to do any work. I remember using the module for a downloader.

Many-core Engine for Perl (MCE) has been posted on CPAN.
http://code.google.com/p/many-core-engine-perl/
https://metacpan.org/module/MCE
MCE comes with various examples showing real-world use case scenarios on parallelizing something as small as cat (try with -n) to greping for patterns and word count aggregation.
barrier_sync.pl
A barrier sync demonstration.
cat.pl Concatenation script, similar to the cat binary.
egrep.pl Egrep script, similar to the egrep binary.
wc.pl Word count script, similar to the wc binary.
findnull.pl
A parallel driven script to report lines containing
null fields. It is many times faster than the binary
egrep command. Try against a large file containing
very long lines.
flow_model.pl
Demonstrates MCE::Flow, MCE::Queue, and MCE->gather.
foreach.pl, forseq.pl, forchunk.pl
These take the same sqrt example from Parallel::Loops
and measures the overhead of the engine. The number
indicates the size of #input which can be submitted
and results displayed in 1 second.
Parallel::Loops: 600 Forking each #input is expensive
MCE foreach....: 34,000 Sends result after each #input
MCE forseq.....: 70,000 Loops through sequence of numbers
MCE forchunk...: 480,000 Chunking reduces overhead
interval.pl
Demonstration of the interval option appearing in MCE 1.5.
matmult/matmult_base.pl, matmult_mce.pl, strassen_mce.pl
Various matrix multiplication demonstrations benchmarking
PDL, PDL + MCE, as well as parallelizing Strassen
divide-and-conquer algorithm. Also included are 2 plain
Perl examples.
scaling_pings.pl
Perform ping test and report back failing IPs to
standard output.
seq_demo.pl
A demonstration of the new sequence option appearing
in MCE 1.3. Run with seq_demo.pl | sort
tbray/wf_mce1.pl, wf_mce2.pl, wf_mce3.pl
An implementation of wide finder utilizing MCE.
As fast as MMAP IO when file resides in OS FS cache.
2x ~ 3x faster when reading directly from disk.

You could also look at threads, Coro, Parallel::Iterator.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse