How to visualize decision tree model/object in pyspark? - pyspark

Is there any way to visualize/plot decision tree created using either mllib or ml library in pyspark. Also how to get information like number of records in leaf nodes. Thanks

First you need to use model.toDebugString to get an output like that on your random forest model :
"RandomForestClassificationModel (uid=rfc_6c4ceb92ba78) with 20 trees
Tree 0 (weight 1.0):
If (feature 0 <= 3="" 10="" 1.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 6.0)
Predict: 0.0
Else (feature 10 > 0.0)
If (feature 12 <= 12="" 63.0)="" predict:="" 0.0="" else="" (feature=""> 63.0)
Predict: 0.0
Else (feature 0 > 1.0)
If (feature 13 <= 3="" 1.0)="" if="" (feature="" <="3.0)" predict:="" 0.0="" else=""> 3.0)
Predict: 1.0
Else (feature 13 > 1.0)
If (feature 7 <= 7="" 1.0)="" predict:="" 0.0="" else="" (feature=""> 1.0)
Predict: 0.0
Tree 1 (weight 1.0):
If (feature 2 <= 11="" 15="" 1.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 0.0)
Predict: 1.0
Else (feature 15 > 0.0)
If (feature 11 <= 11="" 0.0)="" predict:="" 0.0="" else="" (feature=""> 0.0)
Predict: 1.0
Else (feature 2 > 1.0)
If (feature 12 <= 5="" 31.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 0.0)
Predict: 0.0
Else (feature 12 > 31.0)
If (feature 3 <= 3="" 4.0)="" predict:="" 0.0="" else="" (feature=""> 4.0)
Predict: 0.0
Tree 2 (weight 1.0):
If (feature 8 <= 4="" 6="" 1.0)="" if="" (feature="" <="2.0)" predict:="" 0.0="" else=""> 10875.0)
Predict: 1.0
Else (feature 6 > 2.0)
If (feature 1 <= 1="" 36.0)="" predict:="" 0.0="" else="" (feature=""> 36.0)
Predict: 1.0
Else (feature 8 > 1.0)
If (feature 5 <= 4="" 0.0)="" if="" (feature="" <="4113.0)" predict:="" 0.0="" else=""> 4113.0)
Predict: 1.0
Else (feature 5 > 0.0)
If (feature 11 <= 11="" 2.0)="" predict:="" 0.0="" else="" (feature=""> 2.0)
Predict: 0.0
Tree 3 ...
Save it under some .txt file then use : https://github.com/tristaneljed/Decision-Tree-Visualization-Spark

You can get the number of statistics of all the leaf nodes, like impurity, gain, gini, Array of element classified into each label by the model data file.
The data file is located where you save the model/ data/
model.save(location)
modeldf = spark.read.parquet(location+"data/*")
This file contains much of the needed meta data for the decision tree or even randomForest. You can extract all the needed information like.
noderows = modeldf.select("id","prediction","leftChild","rightChild","split").collect()
df = pd.Dataframe([[rw['id'],rw['gain],rw['impurity'],rw['gini']] for rw in noderows if rw['leftChild'] < 0 and rw['rightChild'] < 0])
df.show()

Related

How to decode .Bin that looks like this?

I have some data from SOILMASTER software that are saved in .Bin format. The program can read them, but when I try to put them into notepad, this shows up...
kPa N N mm mm 5.4 -1.0 -1.0 -0.013 0.000 4.0 -1.0 -1.0 -0.012 0.000 4.0 -1.0 -1.0 -0.012 0.000 *0 4.0 -1.0 -1.0 -0.012 0.000 8# 4.3 -1.0 -1.0 -0.012 0.000 FP 4.3 -1.0 -1.0 -0.013 0.000 T` 4.1 -1.0 -1.0 -0.012 0.000 bp 4.2 -1.0 -1.0 -0.012 0.000 p€ 4.1 -1.0 -1.1 -0.012 0.000
~ 4.2 -1.0 -1.1 -0.012 0.000
Ś  4.3 -1.0 -1.0 -0.012 0.000 š° 4.0 -1.0 -1.0 -0.012 0.000
the same type of data should look like this...
mins mins kPa N N mm mm
1 00:00:00:00 0.000 5.3 -1.0 -1.0 0.012 0.000
2 00:00:00:03 0.060 4.1 -1.0 -1.0 0.012 0.000
3 00:00:00:07 0.120 4.2 -1.0 -1.0 0.016 0.000
4 00:00:00:10 0.180 4.1 -1.0 -1.0 0.019 0.000
5 00:00:00:14 0.240 4.1 -1.0 -1.0 0.021 0.000
I tried to go through some decoders online and tried the encoding formats in notepad++ and nothing worked. Is there some process that can decode them? It would help a lot.
Thank you for your time.
Tom

Rounding to multiple places in Crsytal Reports

I don't know if this is even possible to do on Crystal reports. What I have is a list of how to report results depending on where the result falls. So if the result (raw data) is 1.63 then I have to round it to the nearest 0.1 so it would be 1.6.
Here is the list:
0-1.0 round to nearest 0.05
1-10 round to nearest 0.1
10-40 round to nearest 1
40-100 round to nearest 5
100-400 round to nearest 10
400-1000 round to nearest 50
1000+ round to nearest 100
I thought using ceiling/floor would work but I don't know what I am doing wrong because it is asking for a boolean right after the then. This is the formula I was attempting to use. Our system uses one form of rounding so we were hoping to use the report to help with the rounding issue.
If ({PRM_SxData.nResult} in 0 to 1.0 )
then (Ceiling ({PRM_SxData.nResult}, 0.05)) and (Floor
({PRM_SxData.nResult}, 0.05))
else
IF {PRM_SxData.nResult} in 1.01 to 10
then ((Ceiling ({PRM_SxData.nResult}, 0.1)) and (Floor
({PRM_SxData.nResult}, 0.1)))
else
IF {PRM_SxData.nResult} in 10.01 to 40
then ((Ceiling ({PRM_SxData.nResult}, 1)) and (Floor ({PRM_SxData.nResult},
1)))
else
IF {PRM_SxData.nResult} in 40.01 to 100
then ((Ceiling ({PRM_SxData.nResult}, 5)) and (Floor ({PRM_SxData.nResult},
5)))
else
IF {PRM_SxData.nResult} in 100.01 to 400
then ((Ceiling ({PRM_SxData.nResult}, 10)) and (Floor ({PRM_SxData.nResult},
10)))
else
IF {PRM_SxData.nResult} in 400.01 to 1000
then ((Ceiling ({PRM_SxData.nResult}, 50)) and (Floor ({PRM_SxData.nResult},
50)))
else
IF {PRM_SxData.nResult} > 1000.01
then ((Ceiling ({PRM_SxData.nResult}, 100)) and (Floor
({PRM_SxData.nResult}, 100)))
else " "
It's messy but it is the best I can come up with.
try this:
If ({PRM_SxData.nResult} >= 0 and {PRM_SxData.nResult}<= 1.0 )
then Floor({PRM_SxData.nResult}, 0.05)
else
IF {{PRM_SxData.nResult} >= in 1.01 and {PRM_SxData.nResult} <= 10
then Floor ({PRM_SxData.nResult}, 0.1)
else
IF {PRM_SxData.nResult} >= 10.01 and {PRM_SxData.nResult} <= 40
then Floor ({PRM_SxData.nResult}, 1)
else
IF {PRM_SxData.nResult} >= 40.01 and {PRM_SxData.nResult} <= 100
then Floor ({PRM_SxData.nResult}, 5)
else
IF {PRM_SxData.nResult} >= 100.01 and {PRM_SxData.nResult} <= 400
then Floor ({PRM_SxData.nResult}, 10)
else
IF {PRM_SxData.nResult} >= 400.01 and {PRM_SxData.nResult} <= 1000
then Floor ({PRM_SxData.nResult}, 50)
else
IF {PRM_SxData.nResult} >= 1000.01
then Floor ({PRM_SxData.nResult}, 100)
else 0

No of rows in If Else condition in Apache Spark Decision Tree

I have a dataset of 100 records , I ran decision tree using the dataset .
On println(model.toDebugString)
Output is :
DecisionTreeModel classifier of depth 3 with 7 nodes
If (feature 0 <= 2.0)
Predict: 0.0
Else (feature 0 > 2.0)
If (feature 1 <= 12354.0)
If (feature 2 <= 14544.0)
Predict: 1.0
Else (feature 2 > 14544.0)
Predict: 0.0
Else (feature 1 > 12354.0)
Predict: 1.0
Is it possible to know how many no of rows are going to If condition and to the Else condition ?
like 40 rows are in If (feature 0 <= 2.0) and 60 rows are in Else
(feature 0 > 2.0)
Unfortunately there is no magical method to compute that for now. You'll need to loop over your condition and filter then count.
example : df.filter([condition1]).count

How do I format the digit precision of my REPL output in Lisp?

My question is:
How can I set the precision of my REPL print output?
As an example, this simple function here:
(defun gaussian (rows cols sigma)
(let ((filter (make-array `(,rows ,cols)))
(rowOffset (/ (- rows 1) 2.0))
(colOffset (/ (- cols 1) 2.0)))
(loop for i from 0 to (- rows 1)
do (loop for j from 0 to (- cols 1)
do (setf (aref filter i j)
(gaussDistVal i j rowOffset ColOffset sigma))))
filter))
If I call (gaussian 5 5 1), my output is the following:
#2A((0.01831564 0.082085 0.13533528 0.082085 0.01831564)
(0.082085 0.36787945 0.60653067 0.36787945 0.082085)
(0.13533528 0.60653067 1.0 0.60653067 0.13533528)
(0.082085 0.36787945 0.60653067 0.36787945 0.082085)
(0.01831564 0.082085 0.13533528 0.082085 0.01831564))
Whereas I'd like to get:
#2A((0.0 0.1 0.1 0.1 0.0)
(0.0 0.4 0.6 0.4 0.1)
(0.1 0.6 1.0 0.6 0.1)
(0.0 0.4 0.6 0.4 0.1)
(0.0 0.1 0.1 0.1 0.0))
If you have the answer, could you also please tell me where these "REPL customisations" are documented?
(SBCL 1.2.11; Slime on Emacs 25)
Using the Common Lisp pretty printer
Common Lisp has an extensive pretty printer. A rarely used feature is a dispatch table for controlling the printing of objects of a certain type. See set-pprint-dispatch how to configure this functionality.
The function format has features to output various forms of float numbers.
This example combines both:
CL-USER 32 > (set-pprint-dispatch 'float
#'(lambda (s obj)
(format s "~,1F" obj)))
NIL
CL-USER 33 > (setf *print-pretty* t)
T
CL-USER 34 > #2A((0.01831564 0.082085 0.13533528 0.082085 0.01831564)
(0.082085 0.36787945 0.60653067 0.36787945 0.082085)
(0.13533528 0.60653067 1.0 0.60653067 0.13533528)
(0.082085 0.36787945 0.60653067 0.36787945 0.082085)
(0.01831564 0.082085 0.13533528 0.082085 0.01831564))
#2A((0.0 0.1 0.1 0.1 0.0)
(0.1 0.4 0.6 0.4 0.1)
(0.1 0.6 1.0 0.6 0.1)
(0.1 0.4 0.6 0.4 0.1)
(0.0 0.1 0.1 0.1 0.0))
One may also want to use it temporarily:
CL-USER 37 > (let ((*print-pprint-dispatch* (copy-pprint-dispatch)))
(set-pprint-dispatch 'float
#'(lambda (s obj)
(format s "~,1F" obj)))
(pprint #2A((0.01831564 0.082085 0.13533528 0.082085 0.01831564)
(0.082085 0.36787945 0.60653067 0.36787945 0.082085)
(0.13533528 0.60653067 1.0 0.60653067 0.13533528)
(0.082085 0.36787945 0.60653067 0.36787945 0.082085)
(0.01831564 0.082085 0.13533528 0.082085 0.01831564))))
#2A((0.0 0.1 0.1 0.1 0.0)
(0.1 0.4 0.6 0.4 0.1)
(0.1 0.6 1.0 0.6 0.1)
(0.1 0.4 0.6 0.4 0.1)
(0.0 0.1 0.1 0.1 0.0))

Delete specific lines from a file based on another file

I have some text files in a folder named ff as follows. I need to delete the lines in these files based on another file aa.txt.
32bm.txt:
249 253 A P - 0 0 8 0, 0.0 6,-1.4 0, 0.0 2,-0.4 -0.287 25.6-102.0 -74.4 161.1 37.1 13.3 10.9
250 254 A K B Z 254 0E 77 -48,-2.5 -48,-0.3 4,-0.2 4,-0.3 -0.720 360.0 360.0 -93.4 135.2 38.1 11.1 8.1
252 !* 0 0 0 0, 0.0 0, 0.0 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 360.0 0.0 0.0 0.0
253 143 B R 0 0 96 0, 0.0 -2,-3.7 0, 0.0 2,-0.2 0.000 360.0 360.0 360.0 110.4 38.4 10.4 3.0
254 144 B Q B -Z 250 0E 62 -4,-0.3 -4,-0.2 -3,-0.1 2,-0.1 -0.347 360.0-157.5 -58.1 119.5 39.4 13.6 4.8
255 145 B T - 0 0 22 -6,-1.4 2,-0.3 -2,-0.2 -7,-0.2 -0.396 7.8-127.4 -91.5 173.9 36.3 15.7 5.4
2fok.txt:
1 361 X G 0 0 137 0, 0.0 2,-0.2 0, 0.0 3,-0.0 0.000 360.0 360.0 360.0 97.3 25.2 -16.6 -6.6
2 362 X A - 0 0 98 1,-0.0 0, 0.0 0, 0.0 0, 0.0 -0.649 360.0 -33.9-148.3 84.1 28.0 -18.6 -4.8
3 363 X R - 0 0 226 -2,-0.2 2,-0.0 1,-0.1 -1,-0.0 1.000 68.7-149.8 66.4 76.9 31.1 -16.5 -4.0
1 361 B G 0 0 137 0, 0.0 2,-0.2 0, 0.0 3,-0.0 0.000 360.0 360.0 360.0 97.3 25.2 -16.6 -6.6
2 362 B A - 0 0 98 1,-0.0 0, 0.0 0, 0.0 0, 0.0 -0.649 360.0 -33.9-148.3 84.1 28.0 -18.6 -4.8
3 363 B R - 0 0 226 -2,-0.2 2,-0.0 1,-0.1 -1,-0.0 1.000 68.7-149.8 66.4 76.9 31.1 -16.5 -4.0`enter code here`
aa.txt:
32bm B 143 145
2fok X 361 363
2moj B 361 367
-
-
-
For example, in the 32bm.txt, I need only the lines having B (column3) and the numbers from 143 to 145 (column2).
Desired output:
32bm.txt
253 143 B R 0 0 96 0, 0.0 -2,-3.7 0, 0.0 2,-0.2 0.000 360.0 360.0 360.0 110.4 38.4 10.4 3.0
254 144 B Q B -Z 250 0E 62 -4,-0.3 -4,-0.2 -3,-0.1 2,-0.1 -0.347 360.0-157.5 -58.1 119.5 39.4 13.6 4.8
255 145 B T - 0 0 22 -6,-1.4 2,-0.3 -2,-0.2 -7,-0.2 -0.396 7.8-127.4 -91.5 173.9 36.3 15.7 5.4
2fok.txt
1 361 X G 0 0 137 0, 0.0 2,-0.2 0, 0.0 3,-0.0 0.000 360.0 360.0 360.0 97.3 25.2 -16.6 -6.6
2 362 X A - 0 0 98 1,-0.0 0, 0.0 0, 0.0 0, 0.0 -0.649 360.0 -33.9-148.3 84.1 28.0 -18.6 -4.8
3 363 X R - 0 0 226 -2,-0.2 2,-0.0 1,-0.1 -1,-0.0 1.000 68.7-149.8 66.4 76.9 31.1 -16.5 -4.0
With awk you do something like this:
#!/usr/bin/awk -f
NR == FNR { # If we are in the first file
low[$1,$2]=$3 # store the low value in a map
hi[$1,$2]=$4 # store the high value in another map
next # skip the remaining commands
}
# We are not in the first file
($2 >= low[FILENAME,$3]) && ($2 <= hi[FILENAME,$3])
# The FILENAME variable holds the name of the current file
# If the number we read is within the range, do the
# default action (which is to print the current line
put the script in a file named script.awk and run like this:
$ ./script.awk aa.txt 32bm 2fok
253 143 B R 0 0 96 0, 0.0 -2,-3.7 0, 0.0 2,-0.2 0.000 360.0 360.0 360.0 110.4 38.4 10.4 3.0
254 144 B Q B -Z 250 0E 62 -4,-0.3 -4,-0.2 -3,-0.1 2,-0.1 -0.347 360.0-157.5 -58.1 119.5 39.4 13.6 4.8
255 145 B T - 0 0 22 -6,-1.4 2,-0.3 -2,-0.2 -7,-0.2 -0.396 7.8-127.4 -91.5 173.9 36.3 15.7 5.4
1 361 X G 0 0 137 0, 0.0 2,-0.2 0, 0.0 3,-0.0 0.000 360.0 360.0 360.0 97.3 25.2 -16.6 -6.6
2 362 X A - 0 0 98 1,-0.0 0, 0.0 0, 0.0 0, 0.0 -0.649 360.0 -33.9-148.3 84.1 28.0 -18.6 -4.8
3 363 X R - 0 0 226 -2,-0.2 2,-0.0 1,-0.1 -1,-0.0 1.000 68.7-149.8 66.4 76.9 31.1 -16.5 -4.0
Or if you prefer a one liner:
awk 'NR==FNR{low[$1,$2]=$3;hi[$1,$2]=$4;next}$2>=low[FILENAME,$3]&&$2<=hi[FILENAME,$3]' aa.txt 32bm 2fok