ELKI - k-means clustering. - cluster-analysis

I' like to run ELKI k-means clustering in command line.
It seems that running time is too short compared with R programming. I tried to run k-means clustering in R, then It took about 100 seconds.
Moreover, there is no change among k=5, k=10 and so on.
file.tsv has 60,000 rows and 25 columns.
START=$(date +%s)
k=5
java -jar elki.jar KDDCLIApplication \
-dbc.in "file.tsv" \
-dbc.parser NumberVectorLabelParser \
-parser.colsep "\t" \
-algorithm clustering.kmeans.KMeansLloyd \
-kmeans.k $k \
-kmeans.initialization KMeansPlusPlusInitialMeans \
-kmeans.maxiter 9999 \
-resulthandler ResultWriter -out.gzip false \
-out output/k-$k \
END=$(date +%s)
DIFF=$(( $END - $START ))
echo "It took $DIFF seconds"
The output is "It took 5 seconds"
START=$(date +%s)
k=10
java -jar elki.jar ...
...
END=$(date +%s)
DIFF=$(( $END - $START ))
echo "It took $DIFF seconds"
This case k=10 is also "It took 5 seconds".
Why is there no change among cluster sizes? Does the code have some problems?

5 seconds is probably too fast to get you a reliable measurement.
Furthermore, with a larger k the result may converge with fewer iterations.
You may want to use -time to see the time needed to run the actual algorithm. With your method, parsing and writing will have a non-neglible impact. -resulthandler DiscardResultHandler will disable output, which is also reasonable for benchmarking.
You don't need to set -kmeans.maxiter 9999. By default, ELKI will run k-means until convergence.
I believe there was an implementation weakness in k-Means++ initialization that made it more costly than necessary. Maybe initialization, parsing, writing, contributes a lot to your total runtime.
Also try using -resulthandler LogResultStructureResultHandler -verbose to make sure the parser understood the file as intended (check the dimensionality!) With -verbose you can also check that it does a reasonable number of iterations.

Related

Flutter Sharding Test still take same time as Serial

I am creating a parallel Flutter unit test on GitLab CI to avoid 60 minutes run-time limit.
The initial command was
flutter test --machine --coverage ./lib > test-report.jsonl
The process took around 59 minutes because we have a lot of unit tests.
In order to reduce the CI pipeline time, I modify the pipeline to be parallel using flutter shard and GitLab CI parallel feature.
The command is like this
flutter test \
--total-shards $CI_NODE_TOTAL \
--shard-index $( expr $CI_NODE_INDEX - 1 ) \
--machine \
--coverage \
--coverage-path coverage/lcov.info-$( expr $CI_NODE_INDEX - 1 ) \
./lib \
> test-report.jsonl-$( expr $CI_NODE_INDEX - 1 )
However, all the parallel jobs still run more than a 60-minute time limit.
What did I miss or how to debug it?

Reversing a hash to find something which works, but hashcat seems to have issues

I saw some unfamiliar code on a project i was working on.
I saw a function which said:
var salt = 1514691869198;
var result hex_hmac_sha1(salt, hmac_sha1(password))
# result is: 462435F34EAD6BB7C70751D90984DADD90EED9A4
I was having some issues with hashcat though. It seems to be getting killed early because of a driver or something.
It seems that option -m160 would be the one I would want to use since 160 = HMAC-SHA1 (key = $salt) in the man page for it.
Given the sha1.js file i was looking at, which gave me the code above, it showed the salt as the key which makes me think the 160 code as the most relevant.
Obviously this is a nested sha, but trying to find something to reverse it would be ideal.
I am aware reversing a hash would not return the actual password, but I figured I could run a wordlist and attempt to find a hash which matches this one.
That being said, I was thinking I can find a string which works. I am having issues though building either the hashcat command or finding this answer in general. I was not sure how i would want to put the hash in the command. I was thinking it would be along the lines of:
hashcat -m160 462435F34EAD6BB7C70751D90984DADD90EED9A4: 1514691869198 mywordlist.txt
but it seems to fail for me with the following:
* Device #1: Not a native Intel OpenCL runtime. Expect massive speed loss.
You can use --force to override, but do not report related errors.
No devices found/left.
Started: Sat Dec 30 22:52:33 2017
Stopped: Sat Dec 30 22:52:33 2017
and if i used --force it would say:
hashcat (pull/1273/head) starting...
OpenCL Platform #1: The pocl project
====================================
* Device #1: pthread-Intel(R) Core(TM) i7-4770HQ CPU # 2.20GHz,
2656/2656 MB allocatable, 1MCU
Hashes: 1 digests; 1 unique digests, 1 unique salts
Bitmaps: 16 bits, 65536 entries, 0x0000ffff mask, 262144 bytes, 5/13
rotates
Rules: 1
Applicable optimizers:
* Zero-Byte
* Not-Iterated
* Single-Hash
* Single-Salt
Watchdog: Hardware monitoring interface not found on your system.
Watchdog: Temperature abort trigger disabled.
Watchdog: Temperature retain trigger disabled.
* Device #1: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=64 -D CUDA_ARCH=0 -D VECT_SIZE=1 -D DEVICE_TYPE=2 -D DGST_R0=3 -D DGST_R1=4 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=5 -D KERN_TYPE=160 -D _unroll -cl-std=CL1.2'
* Device #1: Kernel m00160_a0.0bbec6e5.kernel not found in cache! Building may take a while...
Kernel library file /usr/share/pocl/kernel-i686-pc-linux-gnu.bc doesn't exist.
Try reading How to use hashcat on CPU only
Relevant parts:
Download latest OpenCL Drivers and Runtimes for CPU:
https://software.intel.com/en-us/articles/opencl-drivers#latest_CPU_runtime
Latest release (16.1.1) – at time of writing

Tokenizer in moses-SMT system stuck even with 10 sentences

I was trying to make a baseline MT system. Just for checking How it works I made Source (S) and Target (T) language corpus of just 2000 sentences. The very first step is to prepare the data for Machine Translation (MT) system. In this step first we have to perform tokenization as mentioned here Baseline SMT. I've used this code:
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< ~/corpus/training/news-commentary-v8.fr-en.en \
> ~/corpus/news-commentary-v8.fr-en.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< ~/corpus/training/news-commentary-v8.fr-en.fr \
> ~/corpus/news-commentary-v8.fr-en.tok.fr
( say S = French & T = English)
I checked after 2 hours it was still running. I got curious since it was not expected. Then I tried with just ten sentences. To my surprise, it's been 30 minutes and it is still running.
Did I do anything wrong?
PS: OS = Ubuntu 14.04.5 LTS
Sony ultrabook
No dual boot.
Please Follow bellow steps ;
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder
git clone https://github.com/moses-smt/giza-pp.git
cd giza-pp
make
mkdir tools
cp giza-pp/GIZA++-v2/GIZA++ giza-pp/GIZA++-v2/snt2cooc.out giza-pp/mkcls-v2/mkcls tools
scripts/tokenizer/tokenizer.perl -l fr < ~/corpus/training/news-commentary-v8.fr-en.fr > ~/corpus/news-commentary-v8.fr-en.tok.fr

Batch Filtering with Multi-Filter throws a 'Class attribute not set' exception

We have a data set of 15k classified tweets with which we need to perform sentiment analysis. I would like to test against a test set of 5k classified tweets. Due to Weka needing the same attributes within the header of the test set as exist in the header of training set, I will have to use batch filtering if I want to be able to run my classifier against this 5k test set.
However, there are several filters that I need to run my training set through, so I figured the running a multifilter against the training set would be a good idea. The multifilter works fine when not running the batch argument, but when I try to batch filter I get an error from the CLI as it tried to execute the first filter within the multi-filter:
CLI multiFilter command w/batch argument:
java weka.filters.MultiFilter -F "weka.filters.supervised.instance.Resample -B 1.0 -S 1 -Z 15.0 -no-replacement" \
-F "weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -S -stemmer weka.core.stemmers.NullStemmer -M 2 -tokenizer weka.core.tokenizers.AlphabeticTokenizer" \
-F "weka.filters.unsupervised.attribute.Reorder -R 2-last,first"\
-F "weka.filters.supervised.attribute.AttributeSelection -E \"weka.attributeSelection.InfoGainAttributeEval \" -S \"weka.attributeSelection.Ranker -T 0.0 -N -1\"" \
-F weka.filters.AllFilter \
-b -i input\Train.arff -o output\Train_b_out.arff -r input\Test.arff -s output\Test_b_out.arff
Here is the resultant error from the CLI:
weka.core.UnassignedClassException: weka.filters.supervised.instance.Resample: Class attribute not set!
at weka.core.Capabilities.test(Capabilities.java:1091)
at weka.core.Capabilities.test(Capabilities.java:1023)
at weka.core.Capabilities.testWithFail(Capabilities.java:1302)
at weka.filters.Filter.testInputFormat(Filter.java:434)
at weka.filters.Filter.setInputFormat(Filter.java:452)
at weka.filters.SimpleFilter.setInputFormat(SimpleFilter.java:195)
at weka.filters.Filter.batchFilterFile(Filter.java:1243)
at weka.filters.Filter.runFilter(Filter.java:1319)
at weka.filters.MultiFilter.main(MultiFilter.java:425)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at weka.gui.SimpleCLIPanel$ClassRunner.run(SimpleCLIPanel.java:265)
And here are the headers with a portion of data for both the training and test input arffs:
Training:
#RELATION classifiedTweets
#ATTRIBUTE ##sentence## string
#ATTRIBUTE ##class## {1,-1,0}
#DATA
"Conditioning be very important for curly dry hair",0
"Combine with Sunday paper coupon and",0
"Price may vary by store",0
"Oil be not really moisturizers",-1
Testing:
#RELATION classifiedTweets
#ATTRIBUTE ##sentence## string
#ATTRIBUTE ##class## {1,-1,0}
#DATA
"5",0
"I give the curl a good form and discipline",1
"I have be cowashing every day",0
"LOL",0
"TITLETITLE Walgreens Weekly and Midweek Deal",0
"And then they walk away",0
Am I doing something wrong here? I know that supervised resampling requires the class attribute to be on the bottom of the attribute list within the header, and it is... within both the test and training input files.
EDIT:
Further testing reveals that this error does not occur with relationship to the batch filtering, it occurs whenever I run the supervised resample filter from the CLI... The data that I use works on every other filter I've tried within the CLI, so I don't understand why this filter is any different... resampling the data in the GUI works fine as well...
Update:
This also happens with the SMOTE filter instead of the resample filter
Could not get the batch filter to work with any resampling filter. However, our workaround was to simply resample (and then randomize) the training data as step 1. From this reduced set we ran batch filters for everything else we wanted on the test set. This seemed to work fine.
You could have used the multifilter along with the ClassAssigner method to make it work:
java -classpath $jcp weka.filters.MultiFilter
-F "weka.filters.unsupervised.attribute.ClassAssigner -C last"
-F "weka.filters.supervised.instance.Resample -B 1.0 -S 1 -Z 66.0"

Tell Sphinx (or thinking Sphinx) to ignore periods when indexing

I have a strange issue with Sphinx, I am trying to be able to match things like:
L.A. Confidential
So people can search for "LA Confidential" and still get that title. Similarly for "P.M."
to be able to match "PM", etc.
I tried putting the period (full stop character U+002E) in the ignore_char list. This
didn't make any difference.
So then I tried implementing index_sp =1. This did not solve the issue either.
According to what I understand of the documentation, either of these should have solved
this issue correct?
I wonder if it has somthing to do with our math mode which is set to extended2, using
Sphinx 2.0.3.
Any help would be greatly appreciated.
Edit, here is my thinking_sphinx.yml config:
Note that the period character (U+002E) is not used anywhere else in my config except in the ignore_chars line.
production:
mem_limit: 512M
morphology: stem_en
wordforms: "db/sphinx/wordforms.txt"
stopwords: "db/sphinx/stopwords.txt"
ngram_chars: "U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, \
U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, \
U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, \
U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, \
U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, \
U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, \
U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, \
U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, \/
U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, \
U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, \
U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, \
U+A000..U+A48C, U+A492..U+A4C6"
ngram_len: 1
ignore_chars: "U+0027, U+2013, U+2014, U+0026, U+002E, ., &"
(huge char_set entry here for different languages, ommited.)
I ran a test locally with the following in my thinking_sphinx.yml for Thinking Sphinx v3.0.4 and it worked:
development:
ignore_chars: U+002E
The same in sphinx.yml for Thinking Sphinx v2.0.14 worked too. I am using Sphinx 2.0.8, but I'll be a little surprised if that's the problem. It's certainly unrelated to your match mode.