featuretools dfs vs categorical_encoding - feature-engineering

When I want to add categorical_encoding I can do it in two different ways :
With dfs with setting categorical feature as relationship and getting mean/std/skew statistics . In this case categorical feature and value/s in same dataframe
With categorical_encoding sub-library and fit_transform
I see the only difference that in second case I have wider range of parameters , i.e. setting method='leave_one_out' that can be more accurate than using regular mean in case of dfs
Am I right ? If categorical_encoding uses parallel processing ?

You can do the categorical encoding with DFS and also stack additional primitives to create new features. The library for categorical encoding does not use parallel processing, but does provide a wider range of encoders.

Related

Converting Dataframe from Spark to the type used by DL4j

Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error:
"type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".
In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.
Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)
We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java
But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".
In general, a big reason we do this is because all of our ndarrays are off heap.
Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.
When we do that conversion, it ends up being:
raw data -> numerical representation -> ndarray
What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do:
someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))
That will give you a "dataset" You need a pair of ndarrays matching your input data and a label.
The label from there is relative to the kind of problem you're solving whether that be classification or regression though.

Some questions about split_train_test() function

I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?
In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

RapidMiner: Ability to classify based off user set support threshold?

I am have built a small text analysis model that is classifying small text files as either good, bad, or neutral. I was using a Support-Vector Machine as my classifier. However, I was wondering if instead of classifying all three I could classify into either Good or Bad but if the support for that text file is below .7 or some user specified threshold it would classify that text file as neutral. I know this isn't looked at as the best way of doing this, I am just trying to see what would happen if I took a different approach.
The operator Drop Uncertain Predictions might be what you want.
After you have applied your model to some test data, the resulting example set will have a prediction and two new attributes called confidence(Good) and confidence(Bad). These confidences are between 0 and 1 and for the two class case they will sum to 1 for each example within the example set. The highest confidence dictates the value of the prediction.
The Drop Uncertain Predictions operator requires a min confidence parameter and will set the prediction to missing if the maximum confidence it finds is below this value (you can also have different confidences for different class values for more advanced investigations).
You could then use the Replace Missing Values operator to change all missing predictions to be a text value of your choice.

SVM-pref package from Cornell university

I'm using SVM-pref (http://svmlight.joachims.org) for a binary classification problem. I don't have much experience with this package and so I seek help with the following questions:
(1) My features are all discrete/nominal. Is there a special way to represent the feature vectors like a special way to convert the nominal values into continuous values or do we just replace the nominal values for dummy numbers like 1, 2, 3 .. etc.?
(2) If the answer to the first question is we replace nominal values with dummy numbers, then my second question is we start numbering feature values from 1 so we have 1:1 but not 1:0 otherwise the learner will consider a zero-value feature as non-existent. Is that correct?
(3) How to we configure the best -c values and the values for the rest of the parameters? Is it only by error and trial or are their other approaches used to decide on these parameters?
To use categorical features in SVM you must encode them using dummy variables, e.g. one-hot coding. For every level of the category, you should introduce a dimension. Something like this for a feature with levels A, B and C:
A -> [1,0,0]
B -> [0,1,0]
C -> [0,0,1]
See answer to previous question: use one dimension per categorical level.
Typically this is done by testing possible values in a cross-validation setting.
Here is also another useful and informative discussion about representing nominal features for SVM classifiers.