How to fill missing dates in a dataset? - date

I got a dataset with variables like customer ID, age, language, gender_code, broker score etc. In the dataset there is also a variable called "start date", this is a date of a contract with the costumers. But some of these dates ar missing so I need to fill them up with the most correct dates. Can I use the K-Nearest neigh neighbor model for this? Or are there other better models to use?
thanks in advance,
I think KNN is the best method but I am not sure because KNN makes predictions on the missing values by using other predicted values.

Related

Modeler question: Is there a function in SPSS for multiple 'if' statements? Forecasting dates

I am trying to build a forecast for interest expense for floating debt in my company.
I have been given a set of ResetDates which help me match a given rate based on when the ResetDate is.
I have been successful in forecasting one period, but I need a much longer set of periods to satisfy my requirements.
I've tried derive nodes and nested if statements as well as filler nodes.
I am given this data to work with, I can only look at one ResetDate ahead.
Here you will find the data I used: Columns A/B/C/D is what i'm given, Column E (or 5th column from left to right) is what I want to derive as my output
I want to use 'InterestPayDate' and derive:
if it's more than 'NextReset' , the add 90 days to the 'NextReset' to create 'NextReset2'
That is as far as I can get.... where my problem lies is I want to look at NextReset2 and derive:
if 'InterestPayDate' is more than 'NextReset2', then add 90 days to 'NextReset2', if it's less than 'NextReset2', keep the current value for 'NextReset2'
Output should look like Column E here
Not sure if I need to dig deeper into the logical functions, in all honesty, I've just picked up SPSS and I am really trying to learn. Hopefully, you can point me in the right direction.
Thank you.
After computing the first NextReset2, you need to use a Filler node like the one below to change the value of the field.
You might need more than one identical nodes like this - one for each potential 90-day period that you are looking to extend the NextReset2 date. In your sample data, you will need at least two Filler nodes to get the correct value of NextReset2 for the last of the records.
There might be a more elegant way to do it, but this will work and it's easy enough to make copies of a node and string them together like this.
Please also see a sample IBM SPSS Modeler stream showing this approach here and using your sample data.

Qliksense: Compute median of grouped data

I'm facing an issue in QlikSense, trying to compute some statistical indicators (Percentiles, Quartiles, StdDev, Median etc.) on a dataset which is already grouped by the source.
I mean that my dataset is something similar to the following, in which I have for each combination of Week and Customer Age the total number of purchases:
I want to show the median of Customer Age, and due to the structure of the dataset I can't use fractile or median built-in functions, since they would come out with something different.
Let's suppose I want to calculate the median age of people for all the 3 weeks, so that I want to know what's the age of people who have done the 50% of my purchases.
To let you better understand the question, I show you the histogram:
In this case, the median I want to get is 24-26 years, since the 50% of the total population falls under that range.
I found a useful reference here, but I am having troubles in writing this formula in QlikSense
https://mba-lectures.com/statistics/descriptive-statistics/603/relationship-between-quartiles-decile...
Thanks a lot in advance.
[EDIT]: This is my Data Model View:
[EDIT 2]: Here is my qvf with a dataset more similar to the original one I'm using. As you can see, I can't get the correct result using your formula. In addition, I would like to use it in order to plot the trend of the median through weeks, but it doesn't seem to be possible (Even if I use the modified version of the formula I pointed out in the comments).
If you want to calculate median in such a scenario you need to weighted median and basically check which dimension value is in the middle:
Aggr(
If(
(Rangesum(
Above([# Purchases],0,RowNo())
)
/Sum(TOTAL [# Purchases]))>=0.5
and
(Rangesum(
Above([# Purchases],1,RowNo()-1))
/Sum(TOTAL [# Purchases]))<0.5
,[Customer Age])
,[Customer Age])

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

Running k-medoids algorithm in ELKI

I am trying to run ELKI to implement k-medoids (for k=3) on a dataset in the form of an arff file (using the ARFFParser in ELKI):
The dataset is of 7 dimensions, however the clustering results that I obtain show clustering only on the level of one dimension, and does this only for 3 attributes, ignoring the rest. Like this:
Could anyone help with how can I obtain a clustering visualization for all dimensions?
ELKI is mostly used with numerical data.
Currently, ELKI does not have a "mixed" data type, unfortunately.
The ARFF parser will split your data set into multiple relations:
a 1-dimensional numerical relation containing age
a LabelList relation storing sex and region
a 1-dimensional numerical relation containing salary
a LabelList relation storing married
a 1-dimensional numerical relation storing children
a LabelList relation storing car
Apparently it has messed up the relation labels, though. But other than that, this approach works perfectly well with arff data sets that consist of numerical data + a class label, for example - the use case this parser was written for. It is a well-defined and consistent behaviour, though not what you expected it to do.
The algorithm then ran on the first relation it could work with, i.e. age only.
So here is what you need to do:
Implement an efficient data type for storing mixed type data.
Modify the ARFF parser to produce a single relation of mixed type data.
Implement a distance function for this type, because the lack of a mixed type data representation means we do not have a distance to go with it either.
Choose this new distance function in k-Medoids.
Share the code, so others do not have to do this again. ;-)
Alternatively, you could write a script to encode your data in a numerical data set, then it will work fine. But in my opinion, the results of one-hot-encoding etc. are not very convincing usually.

2 PK samples for the same patient in Matlab Simbiology: how to calculate intra-individual variability?

Sorry I'm new to MatLab's Simbiology toolbox!
I'm trying to build a population pharmacokinetics model that includes intra-individual variability / residual unexplained varibility.
Would anyone kindly advise how to input the data if I have two pharmacokinetics samples per patient, collected one week apart? In particular, I am not sure how to label the Group ID (ie. patient ID) for the same patient for different PK samples (taken a week apart).
Thanks in advance:)
If you want SimBiology to use the same model parameters for both PK samples, then the measurements need to have the same ID. You just need to ensure that the time data is correctly defined so that the second