No of rows in If Else condition in Apache Spark Decision Tree - scala

I have a dataset of 100 records , I ran decision tree using the dataset .
On println(model.toDebugString)
Output is :
DecisionTreeModel classifier of depth 3 with 7 nodes
If (feature 0 <= 2.0)
Predict: 0.0
Else (feature 0 > 2.0)
If (feature 1 <= 12354.0)
If (feature 2 <= 14544.0)
Predict: 1.0
Else (feature 2 > 14544.0)
Predict: 0.0
Else (feature 1 > 12354.0)
Predict: 1.0
Is it possible to know how many no of rows are going to If condition and to the Else condition ?
like 40 rows are in If (feature 0 <= 2.0) and 60 rows are in Else
(feature 0 > 2.0)

Unfortunately there is no magical method to compute that for now. You'll need to loop over your condition and filter then count.
example : df.filter([condition1]).count

Related

(q/kdb+) Search column amounts

Can someone help me on the below?
nColss:1 3 4 4.5;
aa:([]amount:250000+500000*5?10;n1M:0.5*5?4;n3M:2+0.5*5?4;n4M:4+0.5*5?4;n4.5M:6+0.5*5?4);
aa:update nRng:{[l;n] (min l | l l bin n),(l l binr n & max l)}[nColss] each aa[`amount]%1000000 from aa;
aa:update nRng2:{`$("n",'string x),'"M"} each aa[`nRng] from aa;
amount n1M n3M n4M n4.5M nRng nRng2
250000 1.5 2 4 7 1 1f `n1M`n1M
2250000 0.5 2 5 6.5 1 3f `n1M`n3M
4250000 1.5 2.5 5 6 4 4.5 `n4M`n4.5M
250000 1 3.5 4.5 7.5 1 1f `n1M`n1M
1250000 1 2.5 4 7 1 3f `n1M`n3M
How can I generate a column nValue containing for each line the value of the columns specified in the nRng2 column?
Something like this
nValue
1.5 1.5
0.5 2
5 6
1 1
1 2.5
I was trying something like
aa[aa[`nRng2]]
that generates
index value
0 (1.5 0.5 1.5 1 1;1.5 0.5 1.5 1 1)
1 (1.5 0.5 1.5 1 1;2 2 2.5 3.5 2.5)
2 (4 5 5 4.5 4;7 6.5 6 7.5 7)
3 (1.5 0.5 1.5 1 1;1.5 0.5 1.5 1 1)
4 (1.5 0.5 1.5 1 1;2 2 2.5 3.5 2.5)
then I would need to take the diagonal of this matrix, but I am stuck at it.
I get slightly different values in the aa table when I enter your example code, but something like this seems to work:
q)aa[`nValue]:{x x`nRng2} each aa
q)aa
amount n1M n3M n4M n4.5M nRng nRng2 nValue
-----------------------------------------------------
4750000 0.5 2 4.5 6 4.5 4.5 n4.5M n4.5M 6 6
1250000 0 3.5 5 6 1 3 n1M n3M 0 3.5
3750000 0.5 2.5 5.5 7 3 4 n3M n4M 2.5 5.5
250000 1 3 5 6.5 1 1 n1M n1M 1 1
750000 0 3 4.5 7 1 1 n1M n1M 0 0
To give a quick explanation of what this is doing; by doing each aa we are essentially passing each record from the table into the lambda function as a dictionary (a table in kdb+ is simply a list of dictionaries). Within this we index into the record with nRng2 to get the column names, and then index into the dictionary again using those column names. We then assign this using index notation to add a new column

Create a Diagonal Matrix with specified number of rows and columns in Scala

I have an input mllib Block matrix named matrix like,
matrix : org.apache.spark.mllib.linalg.Matrix =
0.0 2.0 1.0 2.0
2.0 0.0 2.0 4.0
1.0 2.0 0.0 3.0
2.0 4.0 3.0 0.0
As per my Scala code, diagonals will be zero for sure. I need the diagonals of the matrix to be 1. If I have a diagonal matrix with diagonal values as 1 like,
diagonalMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 1.0
I can add those matrices, So the diagonals of matrix will be changed to 1.
matrix : org.apache.spark.mllib.linalg.Matrix =
1.0 2.0 1.0 2.0
2.0 1.0 2.0 4.0
1.0 2.0 1.0 3.0
2.0 4.0 3.0 1.0
We can create a diagonal matrix with specified number of rows and columns and diagonals as 1 based on the answer given below. But as the number of rows and columns were too big, I need an optimized solution. Or is there any better solution to make diagonals of matrix to 1 ?
val nR = 5
val nC = 5
val seq = for {
i <- 0 until nC
j <- 0 until nR
v = if (i == j) 1d else 0d
} yield v
val matrix = DenseMatrix(nR, nC, seq.toArray)

How to visualize decision tree model/object in pyspark?

Is there any way to visualize/plot decision tree created using either mllib or ml library in pyspark. Also how to get information like number of records in leaf nodes. Thanks
First you need to use model.toDebugString to get an output like that on your random forest model :
"RandomForestClassificationModel (uid=rfc_6c4ceb92ba78) with 20 trees
Tree 0 (weight 1.0):
If (feature 0 <= 3="" 10="" 1.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 6.0)
Predict: 0.0
Else (feature 10 > 0.0)
If (feature 12 <= 12="" 63.0)="" predict:="" 0.0="" else="" (feature=""> 63.0)
Predict: 0.0
Else (feature 0 > 1.0)
If (feature 13 <= 3="" 1.0)="" if="" (feature="" <="3.0)" predict:="" 0.0="" else=""> 3.0)
Predict: 1.0
Else (feature 13 > 1.0)
If (feature 7 <= 7="" 1.0)="" predict:="" 0.0="" else="" (feature=""> 1.0)
Predict: 0.0
Tree 1 (weight 1.0):
If (feature 2 <= 11="" 15="" 1.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 0.0)
Predict: 1.0
Else (feature 15 > 0.0)
If (feature 11 <= 11="" 0.0)="" predict:="" 0.0="" else="" (feature=""> 0.0)
Predict: 1.0
Else (feature 2 > 1.0)
If (feature 12 <= 5="" 31.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 0.0)
Predict: 0.0
Else (feature 12 > 31.0)
If (feature 3 <= 3="" 4.0)="" predict:="" 0.0="" else="" (feature=""> 4.0)
Predict: 0.0
Tree 2 (weight 1.0):
If (feature 8 <= 4="" 6="" 1.0)="" if="" (feature="" <="2.0)" predict:="" 0.0="" else=""> 10875.0)
Predict: 1.0
Else (feature 6 > 2.0)
If (feature 1 <= 1="" 36.0)="" predict:="" 0.0="" else="" (feature=""> 36.0)
Predict: 1.0
Else (feature 8 > 1.0)
If (feature 5 <= 4="" 0.0)="" if="" (feature="" <="4113.0)" predict:="" 0.0="" else=""> 4113.0)
Predict: 1.0
Else (feature 5 > 0.0)
If (feature 11 <= 11="" 2.0)="" predict:="" 0.0="" else="" (feature=""> 2.0)
Predict: 0.0
Tree 3 ...
Save it under some .txt file then use : https://github.com/tristaneljed/Decision-Tree-Visualization-Spark
You can get the number of statistics of all the leaf nodes, like impurity, gain, gini, Array of element classified into each label by the model data file.
The data file is located where you save the model/ data/
model.save(location)
modeldf = spark.read.parquet(location+"data/*")
This file contains much of the needed meta data for the decision tree or even randomForest. You can extract all the needed information like.
noderows = modeldf.select("id","prediction","leftChild","rightChild","split").collect()
df = pd.Dataframe([[rw['id'],rw['gain],rw['impurity'],rw['gini']] for rw in noderows if rw['leftChild'] < 0 and rw['rightChild'] < 0])
df.show()

Is there a way to only show the first record in a Crystal Report that does not meet a specified condition?

Say I have data formatted as follows in a Crystal Report:
Job: 1
Asm Opr LbrQty
0 10 0.0
0 10 60.0
0 10 60.0
0 20 65.0
0 30 0.0
0 30 20.0
0 30 40.0
Job: 2
Asm Opr LbrQty
0 10 60.0
0 10 60.0
0 10 75.0
0 20 0.0
0 20 165.0
0 30 0.0
0 30 20.0
0 30 40.0
0 40 60.0
1 10 60.0
1 10 60.0
1 10 75.0
1 20 0.0
1 20 165.0
1 30 0.0
1 30 20.0
1 40 0.0
1 40 60.0
I only want the report to show the first Opr within an Asm where LbrQty is NOT zero, as below:
Job: 1
Asm Opr LbrQty
0 10 60.0
0 20 65.0
0 30 20.0
Job: 2
Asm Opr LbrQty
0 10 60.0
0 20 165.0
0 30 20.0
0 40 60.0
1 10 60.0
1 20 165.0
1 30 20.0
1 40 60.0
I've attempted to use the following as my Supression Formula, which works for the most part, but still occasionally displays multiple records with the same Opr:
(
Previous ({OprSeq}) = ({OprSeq}) and
Previous ({JobNum}) = ({JobNum}) and
Previous ({LaborQty}) <> 0
) or
(
({LaborQty}) = 0
)
How can I change my formula to give me the behavior I require?
Try below way:
Create a running total with following criteria:
In field to summairze take lbrqty and take count as summary option
In evalute use option formula and write below code:
{lbrqty}>0
In Reset use option On change of field opr
Now use this running total to supress.. Now in supress of the section write below code:
if {#RTotal1}=1
then false
else true

Can I use IsNull or na in if() to check missing value?

This one is not duplicate I have new question. I tried to write this
package org.apache.spark.h2o.utils
import water.fvec.{NewChunk, Frame, Chunk}
import water._
class Miss extends MRTask {
override def map(c: Chunk, nc: NewChunk): Unit = {
for (row <- 0 until c.len()) {
if(c.atd(row) == 0){
nc.addNum(0)
}
else
nc.addNum(1)
}
}
}
Can I use na or IsNull in if (...) to check whether or not that row is null?
Code result
A B C D E NaN
min 0
mean 0
stddev 0
max 1
missing 0
0 5.1 3.5 1.4 0.2 Iris-setosa 1
1 4.9 3 1.4 0.2 Iris-setosa 1
2 4.7 3.2 1.3 0.2 Iris-setosa 1
3 4.6 3.1 1.5 0.2 Iris-setosa 1
4 5 3.6 1.4 0.2 Iris-setosa 1
5 5.4 3.9 1.7 0.4 Iris-setosa 1
6 4.6 3.4 1.4 0.3 Iris-setosa 1
7 5 3.4 1.5 0.2 Iris-setosa 1
8 4.4 2.9 1.4 0.2 Iris-setosa 1
9 4.9 3.1 1.5 0.1 Iris-setos...
Something like that:
c.atd(row) match {
case nan: Double if nan.isNaN => nc.addNum(0)
case 0 => nc.addNum(0)
case _ => nc.addNum(1)
}