ELKI DBSCAN for million files - cluster-analysis

I am using dbscan for clustering points, as my points are more than 1 million I use r*-tree too.
I use ELKI in command line:
java -cp elki.jar
de.lmu.ifi.dbs.elki.application.KDDCLIApplication
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory
-algorithm clustering.DBSCAN
-dbc.in points1.txt
-dbscan.epsilon 20
-dbscan.minpts 10
-out results3/DBSCANeps20min10
for small files its ok but for 4 million files the error occurred:
at de.lmu.ifi.dbs.elki.database.ids.integer.DoubleIntegerArrayQuickSort.quickSort(Unknown Source)

This is a known bug in an old version of ELKI, when there are many duplicate distances.
It can be resolved by updating to a current version.

Related

Reversing a hash to find something which works, but hashcat seems to have issues

I saw some unfamiliar code on a project i was working on.
I saw a function which said:
var salt = 1514691869198;
var result hex_hmac_sha1(salt, hmac_sha1(password))
# result is: 462435F34EAD6BB7C70751D90984DADD90EED9A4
I was having some issues with hashcat though. It seems to be getting killed early because of a driver or something.
It seems that option -m160 would be the one I would want to use since 160 = HMAC-SHA1 (key = $salt) in the man page for it.
Given the sha1.js file i was looking at, which gave me the code above, it showed the salt as the key which makes me think the 160 code as the most relevant.
Obviously this is a nested sha, but trying to find something to reverse it would be ideal.
I am aware reversing a hash would not return the actual password, but I figured I could run a wordlist and attempt to find a hash which matches this one.
That being said, I was thinking I can find a string which works. I am having issues though building either the hashcat command or finding this answer in general. I was not sure how i would want to put the hash in the command. I was thinking it would be along the lines of:
hashcat -m160 462435F34EAD6BB7C70751D90984DADD90EED9A4: 1514691869198 mywordlist.txt
but it seems to fail for me with the following:
* Device #1: Not a native Intel OpenCL runtime. Expect massive speed loss.
You can use --force to override, but do not report related errors.
No devices found/left.
Started: Sat Dec 30 22:52:33 2017
Stopped: Sat Dec 30 22:52:33 2017
and if i used --force it would say:
hashcat (pull/1273/head) starting...
OpenCL Platform #1: The pocl project
====================================
* Device #1: pthread-Intel(R) Core(TM) i7-4770HQ CPU # 2.20GHz,
2656/2656 MB allocatable, 1MCU
Hashes: 1 digests; 1 unique digests, 1 unique salts
Bitmaps: 16 bits, 65536 entries, 0x0000ffff mask, 262144 bytes, 5/13
rotates
Rules: 1
Applicable optimizers:
* Zero-Byte
* Not-Iterated
* Single-Hash
* Single-Salt
Watchdog: Hardware monitoring interface not found on your system.
Watchdog: Temperature abort trigger disabled.
Watchdog: Temperature retain trigger disabled.
* Device #1: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=64 -D CUDA_ARCH=0 -D VECT_SIZE=1 -D DEVICE_TYPE=2 -D DGST_R0=3 -D DGST_R1=4 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=5 -D KERN_TYPE=160 -D _unroll -cl-std=CL1.2'
* Device #1: Kernel m00160_a0.0bbec6e5.kernel not found in cache! Building may take a while...
Kernel library file /usr/share/pocl/kernel-i686-pc-linux-gnu.bc doesn't exist.
Try reading How to use hashcat on CPU only
Relevant parts:
Download latest OpenCL Drivers and Runtimes for CPU:
https://software.intel.com/en-us/articles/opencl-drivers#latest_CPU_runtime
Latest release (16.1.1) – at time of writing

save ELKI output to a file

I want to save the cluster outputs to a file. for example I want to save cluster1 points into c1.txt and cluster2 points into c2.txt and so on.
ELKI release 0.7
java -jar elki.jar -dbc.in ./f1 -dbc.out ./dir1 -algorithm clustering.DBSCAN -dbscan.epsilon 5 -dbscan.minpts 10
but it has this error:
the following parameter were not processed: [-dbc.out,./dir1]
The command could not be correct, so how can I save clusters?
To save to a file, use
-resulthandler ResultWriter
-out folder/
to limit what files are being written, you can use -out.filter.

Matlab Distributed Server parfor can't find mex opencv files

everyone!
I'm trying to parallelize an algorithm that uses mex files from mexopencv (KNearest.m, KNearest_.mexw32).
The program is based vlfeat (vlsift.mex32) + mexopencv (KNearest.m and KNearest_.mexw32).I classify descriptors obtained from images.
All the code is located on the fileshare
\\ LAB-07 \ untitled \ DISTRIB \ (this is the program code)
\\ LAB-07 \ untitled \ + cv (mexopencv)
When I run the program with matlabpool close everything works well.
Then I make matlabpool open (2 computers on 2 cores each. ultimately 4 worker, but now I use for testing only 2 workers on the computer and run the program which)
PathDependencises from fileshare -> \LAB-07\untitled\DISTRIB\ , \LAB-07\untitled+cv
Before parfor loop I train classifier on the local machine
classifiers = cv.KNearest
classifiers.train(Descriptors',Labels','MaxK',1)
Then run parfor
descr=vlsift(img);
PredictClasses = classifiers.predict(descr');
Error
Error in ==> KNearest>KNearest.find_nearest at 173
Invalid MEX-file '\\LAB-07\untitled\+cv\private\KNearest_.mexw32':
The specified module could not be found.
That is KNearest.m finds, but no KNearest_.mexw32. Because KNearest_.mexw32 located in private folder, I changed the code KNearest.m (everywhere where it appeal KNearest_ () changed to cv.KNearest_ (). Example: this.id = сv.KNearest_ ()) and placed in a folder with KNearest_.mexw32 KNearest.m. As a result, get the same error
Immediately after matlabpool open file search on workers
pctRunOnAll which ('KNearest.m')
'KNearest.m' not found.
'KNearest.m' not found.
'KNearest.m' not found.
pctRunOnAll which ('KNearest_.mexw32')
'KNearest_.mexw32' not found.
'KNearest_.mexw32' not found.
'KNearest_.mexw32' not found.
after cd \LAB-07\untitled+cv
pctRunOnAll which ('KNearest.m')
\\LAB-07\untitled\+cv\KNearest.m
\\LAB-07\untitled\+cv\KNearest.m % cv.KNearest constructor
\\LAB-07\untitled\+cv\KNearest.m
>> pctRunOnAll which ('KNearest_.mexw32')
\\LAB-07\untitled\+cv\KNearest_.mexw32
\\LAB-07\untitled\+cv\KNearest_.mexw32
\\LAB-07\untitled\+cv\KNearest_.mexw32
I ran and FileDependecies, but the same result.
I do not know this is related or not, I display during the execution of the program classifiers
after training and before parfor
classifiers =
cv.KNearest handle
Package: cv
Properties:
id: 5
MaxK: 1
VarCount: 128
SampleCount: 9162
IsRegression: 0
Methods, Events, Superclasses
Within parfor before classifiers.predict
classifiers =
cv.KNearest handle
Package: cv
Properties:
id: 5
I tested the file cvtColor.mexw32. I left in a folder only 2 files cvtColor.mexw32 and vl_sift
parfor i=1:2
im1=imread('Copy_of_start40.png');
im_vl = im2single(rgb2gray(im1));
desc=vl_sift(im_vl);
im1 = cvtColor(im1,'RGB2GRAY');
end
The same error, and vl_sift work, cvtColor no...
If the worker machines can see the code in your shared filesystem, you should not need FileDependencies or PathDependencies at all. It looks like you're using Windows. It seems to me that the most likely problem is file permissions. MDCS workers running under a jobmanager on Windows by default run not using your own account (they run using the "LocalSystem" account I think), and so may well simply not have access to files on a shared filesystem. You could try making sure your code is world-readable.
Otherwise, you can add the files to the pool by using something like
matlabpool('addfiledependencies', {'\\LAB-07\untitled\+cv'})
Note that MATLAB interprets directories with a + in as defining "packages", not sure if this is intentional in your case.
EDIT
Ah, re-reading your original post, and your comments below - I suspect the problem is that the workers cannot see the libraries on which your MEX file depends. (That's what the "Invalid MEX-file" message is indicating). You could use http://www.dependencywalker.com/ to work out what are the dependencies of your MEX file, and make sure they're available on the workers (I think they need to be on %PATH%, or in the current directory).
Edric thanks. There was a problem in the PATH for parfor. With http://www.dependencywalker.com/ looked missing files and put them in a folder +cv. Only this method works in parfor.
But predict in parfor gives an error
PredictClasses = classifiers.predict(descr');
??? Error using ==> parallel_function at 598
Error in ==> KNearest>KNearest.find_nearest at 173
Unexpected Standard exception from MEX file.
What() is:..\..\..\src\opencv\modules\ml\src\knearest.cpp:365: error: (-2) The search
tree must be constructed first using train method
I solved this problem by calling each time within parfor train
classifiers = cv.KNearest
classifiers.train(Descriptors',Labels','MaxK',1)
But it's an ugly solution :)

finding error in command line

I am trying to run some kind of programm using command line, but I got an error.
The command line is:
quantisnp2.exe --outdir D:\output\ --config "C:\Program files\QuantiSNP\params.dat" --levels "C:\Program files\QuantiSNP\levels.dat" --sampleid CNV1 --gender female --emiters 10 --Lsettings 2000000 --doXcorrect --genotypes --gcdir D:\gc\ --input-files C:\Program files\CNV1.txt
QuantiSNP:Single-file mode input find.
QuantiSNP:Processing file: C:|Program
QuantiSNP:Local CG content directory specified. Local CG content correction will be used.
??? Error using ==>textread at 167
File not found.
Error in ==> quantisnp2 at 293
The first thing I'd be looking at is the unquoted C:\Program files\CNV1.txt at the end of the command (all your other ones are quoted).
There's a good chance that's being treated as two arguments, C:\Program and files\CNV1.txt.
You may also want to check the spelling of emiters, I'm pretty certain the correct English word would be emitters though, of course, this could be a case of the QuantiSNP developers not knowing how to spell :-)

Tesseract problem with mftraining stage

I've successfully created a box file with tesseract
now after running the unicharset_extractor
having it creating the unicharset file that looks like:
...
n 3 NULL -1
s 3 NULL 23
t 3 NULL 43
...
I've continued with this command
mftraining -U unicharset -O testlang.unicharset testlang.tr
only to get the next error
Reading testlang.tr ...
testlang has no defined properties.
Error: Illegal short name for a feature!
I've never worked with Tesseract, but it seems there is an open issue in the bug database that looks a lot like your problem : http://code.google.com/p/tesseract-ocr/issues/detail?id=385
It seems that it is related to scientific notation not being correctly supported by some functions.
On the issue page a user suggests a solution, and another one proposes a patch. You could try applying the patch to see if it helps.