Entropy and Decision Trees - classification

Suppose I have a table of customer information with attributes such as customer ID, name, date of birth, nationality, income, etc.
Each customer in the table has a unique customer ID. I know that the Gini coefficient for each Customer ID value is zero yielding an overall Gini for Customer ID to be zero.
Can I also assume that the entropy of Customer ID is also zero? Why or why not?

Yes. Both Gini and entropy quantify impurity, and a value of 0 indicates perfect purity. Thus, if Gini equals 0, so does entropy.

Related

Tableau: how to make a count if in a for loop?

I'm just starting off Tableau and would like to do a count if in a for loop.
I have the following variables:
City
User
Round: takes values of either A or B
Amount
I would like to have a countif function that shows how many users received any positive amount in both round A and round B in a given city.
In my dashboard, each row represents a city, and I would like to have a column that shows the total number of users in each city that received amounts in both rounds.
Thanks!
You can go for a simple solution that works.
Create a calculated field called "Positive Rounds per User" using the below formula:
// counts the number of unique rounds that had positive amounts per user in a city
{ FIXED [User], [City]: COUNTD(IIF([Amount]>0, [Round], NULL))}
You can use the above to create another calculated field called "unique users":
// unique number of users that have 2 in "Positive Rounds per User" field
COUNTD(IIF([Positive Rounds per User]=2, [User], NULL))
You can combine the calculation of 1 and 2 into one but it gets complex to read so better to split them up

Hiding column in Spotfire CrossTable

I have one data table with various identifiers in 3 columns (Called BU, Company, and Group). I created a cross table that sums the face by 2 layers – an identifier (‘Actual’ and ‘Plan’) and a reporting period (‘9/30/16’ and '9/30/17'). The table was easy, aside from the variance section. I am currently using the formula to compute the variance
SN(Sum([Face]) - Sum([Face]) OVER (ParallelPeriod([Axis.Columns])),
Sum([Face])) AS [PlanVariance]
Unfortunately, this gives me the correct values in the Plan Variance section of the cross table, for the plan identifier. However, it provides the wrong values in the actual identifier. (The actual identifier under plan variance is equal to the actual identifier under the Sum (Face) section. If I remove the SN function, the Plan Variance is empty for all identifiers that have no face for a group AND is empty for the actual section under Plan Variance.
Is there a way to create a cross table that would show the variance for the Plan Identifier ONLY? Can I stop the cross table from calculating the plan variance on the actual segment? Or is there a way to have the actual field hidden in the plan variance section of the final visualization?
Thanks for any help/advice you can provide!

Criteria to classify retail customers as churn Y or N

I have retail transactions data set. some of the attributes are CUSTID, BILL_DT, ITEM_Desc, VALUE. I want to classify the custid as churn y or n. Should i use the days between last purchase date till now as a criteria to classify? Can i say anything beyond 180 days that customer has churned? What is the criteria which the big retailers like costco, walmart uses?
Thanks,
From the transaction data, you could extract history consisting of a sequence of the pair (time elapsed since last transaction, amount spent in current transaction). If you know which customers have actually churned, you could build a predictive model using the Markov Chain classifier.
https://pkghosh.wordpress.com/2015/07/06/customer-conversion-prediction-with-markov-chain-classifier/

2 PK samples for the same patient in Matlab Simbiology: how to calculate intra-individual variability?

Sorry I'm new to MatLab's Simbiology toolbox!
I'm trying to build a population pharmacokinetics model that includes intra-individual variability / residual unexplained varibility.
Would anyone kindly advise how to input the data if I have two pharmacokinetics samples per patient, collected one week apart? In particular, I am not sure how to label the Group ID (ie. patient ID) for the same patient for different PK samples (taken a week apart).
Thanks in advance:)
If you want SimBiology to use the same model parameters for both PK samples, then the measurements need to have the same ID. You just need to ensure that the time data is correctly defined so that the second

calculating percentage of average in tsql

I have a set of records associated with IDs. there may be any number of records in this association, with currency values. one of these values is flagged as selected. I need to calculate an average of all the associated currency values, then take a percentage between this average and the lowest value, grouped by ID. all the data needed is in one table:
input:
table x: ID, Selected, DollarAmt
output:
view y: ID, Average, Percentage
I'm having problems creating this query(view) and it's driving me nuts. can anyone at least point me in the right direction?
Thanks all.
You can use this query:
select Id,
AVG(DollarAmt) Average,
AVG(DollarAmt)/MIN(DollarAmt) Percentage
from TableX
group by Id
But i´still don´t understand the need of the "Selected" variable in TableX
Regards