New Feature in Scikit-Learn Pipeline - Interaction between two existing Features - feature-engineering

I have two features in my data set: height and Area. I want to create a new feature by Interacting Area and Height using pipeline in scikit-learn.
Can anyone please guide me on how I can achieve this?
Thanks

You can achieve this with a custom transformer, implementing a fit and transform method. Optionnaly you can make it inherit from sklearn TransformerMixin for bullet-profing.
from sklearn.base import TransformerMixin
class CustomTransformer(TransformerMixin):
def fit(self, X, y=None):
"""The fit method doesn't do much here,
but it still required if your pipeline
ever need to be fit. Just returns self."""
return self
def transform(self, X, y=None):
"""This is where the actual transformation occurs.
Assuming you want to compute the product of your feature
height and area.
"""
# Copy X to avoid mutating the original dataset
X_ = X.copy()
# change new_feature and right member according to your needs
X_["new_feature"] = X_["height"] * X_["area"]
# you then return the newly transformed dataset. It will be
# passed to the next step of your pipeline
return X_
You can test it with this code :
import pandas as pd
from sklearn.pipeline import Pipeline
# Instantiate fake DataSet, your Transformer and Pipeline
X = pd.DataFrame({"height": [10, 23, 34], "area": [345, 33, 45]})
custom = CustomTransformer()
pipeline = Pipeline([("heightxarea", custom)])
# Test it
pipeline.fit(X)
pipeline.transform(X)
For such a simple processing, it might seem like an overkill, but it is a good practice to put any dataset manipulations into Transformers. They are more reproducible that way.

Related

selecting a range of colums in SKlearn column transformer

I am encoding catagorical data, many columns need to be seletced, I have typed them in individually and it works ok but there is obviouly a more elegant way.
dataset =pd.read_csv('train.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[2,5,6,7,8,9,10,11,12,13,14,15,16,21,22,23,24,25,27,28,29,30,31,32,33,34,35,39,40,41,42,53,54,55,56,57,58,60,63,64,65,72,73,74,78,79])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
I have tried using (23:34) I have tried using slice but that does not work as it is not that data type.
Which method should I use for selecting a range of columns?
Also what datatype is it at this point were I am selecting the columns?
I made a search I an not able to see a solution for this exact question.
Finally, is this an effecient way to encode catagorical data or should I be looking at an alternative method?
Thanks!
you can use the following workaround:
ct = ColumnTransformer(
transformers=[
("ordinal_enc", OrdinalEncoder(), data.loc[:, "col1":"col100"].columns)
])

Visualizing an AutoDiff MultibodyPlant in PyDrake

I am trying to build a simple multibody plant system in Drake using the basic DrakeVisualizer. However, for my use case, I also want to be able to automatically track the derivatives through the physics simulation, so am using the AutoDiffXd version of system:
timestep = 1e-3
builder = DiagramBuilder_[AutoDiffXd]()
plant = MultibodyPlant(timestep)
scene_graph = SceneGraph_[AutoDiffXd]()
brick_file = FindResourceOrThrow("drake/examples/manipulation_station/models/061_foam_brick.sdf")
parser = Parser(plant)
brick = parser.AddModelFromFile(brick_file, model_name="brick")
plant.Finalize()
plant_ad = plant.ToAutoDiffXd()
plant_ad.RegisterAsSourceForSceneGraph(scene_graph)
scene_graph.AddRenderer("renderer", MakeRenderEngineVtk(RenderEngineVtkParams()))
DrakeVisualizer.AddToBuilder(builder, scene_graph)
builder.AddSystem(plant_ad)
builder.AddSystem(scene_graph)
builder.Connect(plant_ad.get_geometry_poses_output_port(), scene_graph.get_source_pose_port(plant_ad.get_source_id()))
builder.Connect(scene_graph.get_query_output_port(), plant_ad.get_geometry_query_input_port())
diagram = builder.Build()
context = diagram.CreateDefaultContext()
simulator = Simulator_[AutoDiffXd](diagram, context)
simulator.AdvanceTo(2.0)
However, when I run this, I get the following error:
File "/home/craig/Repos/drake-exps/autoDiffExperiment.py", line 102, in auto_phys
DrakeVisualizer.AddToBuilder(builder, scene_graph)
TypeError: AddToBuilder(): incompatible function arguments. The following argument types are supported:
1. (builder: pydrake.systems.framework.DiagramBuilder_[float], scene_graph: drake::geometry::SceneGraph<double>, lcm: pydrake.lcm.DrakeLcmInterface = None, params: pydrake.geometry.DrakeVisualizerParams = <pydrake.geometry.DrakeVisualizerParams object at 0x7ff6274e14b0>) -> pydrake.geometry.DrakeVisualizer
2. (builder: pydrake.systems.framework.DiagramBuilder_[float], query_object_port: pydrake.systems.framework.OutputPort_[float], lcm: pydrake.lcm.DrakeLcmInterface = None, params: pydrake.geometry.DrakeVisualizerParams = <pydrake.geometry.DrakeVisualizerParams object at 0x7ff627736730>) -> pydrake.geometry.DrakeVisualizer
Invoked with: <pydrake.systems.framework.DiagramBuilder_[AutoDiffXd] object at 0x7ff65654f8f0>, <pydrake.geometry.SceneGraph_[AutoDiffXd] object at 0x7ff656562130>
From this error, it appears the DrakeVisualizer class only accepts systems which use float scalars exlusively. So I am stuck --- either I can go back to floats (but lose the autodiff differentiable simulation functionality I was after in the first place), or continue to use autodiffxd systems (but be completely unable to visualize what is going on in my simulation).
Is there a way to get both that I am missing?
Sorry for the pain and inconvenience. Your description and assessment are all spot on. Most of the visualization mechanisms are float only and, in its current state, attempts to visualizing an AutoDiff diagram will fail.
You have a couple of options (neither of which is appealing):
Go with one of the outcomes you've described above (no vis or no derivatives).
Put in a Drake feature request to be able to attach a visualizer to an AutoDiff diagram.
I can come up with some hacky workarounds (that aren't immediately clear would even work). So, if you're desperate for derivatives and visualization, they could be explored. But, ultimately, the feature request and a formal Drake solution would be the best long-term resolution.
=====================================
Big update. As of #14569, the DrakeVisualizer class is now templated on the scalar type (item 2 in the list above). That has two implications:
You can build an AutoDiffXd-valued diagram with a visualizer in it (as in your example), or
You can create a double-valued diagram and scalar convert it (i.e., diagram.ToAutoDiffXd() into an AutoDiffXd-valued diagram.

How to writeback to dataframe using transform_df in palantir foundry?

I created a library for updating description of the columns of the input dataset. This function takes three parameter as input (input_dataset, output_dataset, config file) and eventually writes back the description of output dataset. So now we want to import this library across various use cases. How to go for those cases where we are writing spark transformation i.e taking inputs through transform_df because here we can't assign output to output variable. In that situation how can i call my description library function? How to proceed in those situation in palantir foundry. Any suggestions?
This method isn't currently supported using the #transform_df decorator; you'll have to use the #transform decorator at the moment.
The reasoning behind this resulted from recognizing the need for broader access to metadata APIs like the #transform decorator already allows. Thus it seemed more in line with this pattern to keep it there since the #transform_df decorator is inherently higher-level.
You can always simply move over your transformations from...
from transforms.api import transform_df, Input, Output
#transform_df(
Output("/my/output"),
my_input("/my/input"),
)
def my_compute_function(my_input):
df = my_input
# ... logic ....
return my_input
...to...
from transforms.api import transform, Input, Output
#transform(
my_output=Output("/my/output"),
my_input=Input("/my/input")
)
def my_compute_function(my_input, my_output):
df = my_input.dataframe()
# ... logic ....
my_output.write_dataframe(df)
...in which only 6 lines of code need be changed.

How do I create arbitrary parameterized layers in DiffEqFlux.lj neuralODE? Julia Julialang Flux.jl

I am able to create and optimize neuralODEs in julia(1.3 and 1.2) using Flux.jl and DiffEqFlux.jl but it fails under a crucial important general case.
what works:
I can train the Neural net parameters if it is built out of the
provided Flux.jl layers like Dense().
I can include an arbitrary function as a layer in the network chain, e.g. x -> x.*x
What fails:
However if the arbitrary function has parameters I want to train then Flux. Train will not adjust these parameters causing it to fail.
I have tried making these added parameters Tracked and included in the list of parameters given to the training system but it ignores them and they remain unvaried.
The documentation says very cryptically that one can use Flux.#functor on a layer to make sure it's parameters get tracked. However functor was not included in Flux till version 0.10.0 and the only version of Flux compatible with NeuralODEs in DiffEqFlux is 0.9.0
So here's an toy example of a 2 layer neural net I want to use
p = param([1.0])
dudt = chain( x -> p[1]*x.*x, Dense(2,2) )
ps = Flux.params(dudt)
then I use the flux train on this. when I do this the parameter p is not varied, but the parameters in the Dense layer are.
I have tried explicitly including like this
ps = Flux.Params([p,dudt])
but that has the same result and the same problem
I think what I need to do is build a struct with an associted function that implements the
x->p[1]*x*x
then call #functor on this. That struct can then be used in the chain.
But as I noted the version of Flux with #functor is not compatible with DiffEqFlux of any version.
So I need a way to make flux pay attention to my custom parameters, not just the ones in Dense()
How???
I think I get what your question is, but please clarify if I am answering the wrong question here. The issue is that the p is only grabbing from a global reference and thus not differentiated during adjoints. A much better way to handle this in 2020 is to use FastChain. The FastChan interface lets you define layer functions and their parameter dependencies, so this is a nice way to make your neural network incorporate arbitrary functions with parameters. Here's what that looks like:
using DifferentialEquations
using Flux, Zygote
using DiffEqFlux
x = Float32[2.; 0.]
p = Float32[2.0]
tspan = (0.0f0,1.0f0)
mylayer(x,p) = p[1]*x
DiffEqFlux.paramlength(::typeof(mylayer)) = 1
DiffEqFlux.initial_params(::typeof(mylayer)) = rand(Float32,1)
dudt = FastChain(FastDense(2,50,tanh),FastDense(50,2),mylayer)
p = DiffEqFlux.initial_params(dudt)
function f(u,p,t)
dudt(u,p)
end
ex_neural_ode(x,p) = solve(ODEProblem(f,x,tspan,p),Tsit5())
solve(ODEProblem(f,x,tspan,p),Tsit5())
du0,dp = Zygote.gradient((x,p)->sum(ex_neural_ode(x,p)),x,p)
where the last value of p is the one parameter for p in mylayer. Or you can directly use Flux:
using DifferentialEquations
using Flux, Zygote
using DiffEqFlux
x = Float32[2.; 0.]
p2 = Float32[2.0]
tspan = (0.0f0,1.0f0)
dudt = Chain(Dense(2,50,tanh),Dense(50,2))
p,re = Flux.destructure(dudt)
function f(u,p,t)
re(p[1:end-1])(u) |> x-> p[end]*x
end
ex_neural_ode() = solve(ODEProblem(f,x,tspan,[p;p2]),Tsit5())
grads = Zygote.gradient(()->sum(ex_neural_ode()),Flux.params(x,p,p2))
grads[x]
grads[p]
grads[p2]

Transforming dates in tensorflow or tensorflow extended

I am working with Tensorflow Extended, preprocessing data and among this data are date values (e.g. values of the form 16-04-2019). I need to apply some preprocessing to this, like the difference between two dates and extracting the day, month and year from it.
For example, I could need to have the difference in days between 01-04-2019 and 16-04-2019, but this difference could also span days, months or years.
Now, just using Python scripts this is easy to do, but I am wondering if it is also possible to do this with Tensorflow? It's important for my use case to do this within Tensorflow, because the transform needs to be done in the graph format so that I can serve the model with the transformations inside the pipeline.
I am using Tensorflow 1.13.1, Tensorflow Extended and Python 2.7 for this.
Posting from similar issue on tft github.
Here's a way to do it:
import tensorflow_addons as tfa
import tensorflow as tf
from typing import TYPE_CHECKING
#tf.function(experimental_follow_type_hints=True)
def fn_seconds_since_1970(date_time: tf.string, date_format: str = "%Y-%m-%d %H:%M:%S %Z"):
seconds_since_1970 = tfa.text.parse_time(date_time, date_format, output_unit='SECOND')
seconds_since_1970 = tf.cast(seconds_since_1970, dtype=tf.int64)
return seconds_since_1970
string_date_tensor = tf.constant("2022-04-01 11:12:13 UTC")
seconds_since_1970 = fn_seconds_since_1970(string_date_tensor)
seconds_in_hour, hours_in_day = tf.constant(3600, dtype=tf.int64), tf.constant(24, dtype=tf.int64)
hours_since_1970 = seconds_since_1970 / seconds_in_hour
hours_since_1970 = tf.cast(hours_since_1970, tf.int64)
hour_of_day = hours_since_1970 % hours_in_day
days_since_1970 = seconds_since_1970 / (seconds_in_hour * hours_in_day)
days_since_1970 = tf.cast(days_since_1970, tf.int64)
day_of_week = (days_since_1970 + 4) % 7 #Jan 1st 1970 was a Thursday, a 4, Sunday is a 0
print(f"On {string_date_tensor.numpy().decode('utf-8')}, {seconds_since_1970} seconds had elapsed since 1970.")
My two cents on the broader underlying issue, here the question is computing time differences, for which we want to do these computations on tensors. Then the question becomes "What are the units of these tensors?" This is a question of granularity. "The next question is what are the data types involved?" Start with a string likely, end with a numeric. Then the next question becomes is there a "native" tensorflow function that can do this? Enter tensorflow addons!
Just like we are trying to optimize training by doing everything as tensor operations within the graph, similarly we need to optimize "getting to the graph". I have seen the way datetime would work with python functions here, and I would do everything I could do avoid going into python function land as the code becomes so complex and the performance suffers as well. It's a lose-lose in my opinion.
PS - This op is not yet implemented on windows as per this, maybe because it only returns unix timestamps :)
I had a similar problem. The issue because of an if-check with in TFX that doesn't take dates types into account. As far as I've been able to figure out, there are two options:
Preprocess the date column and cast it to an int (e.g. calling toordinal() on each element) field before reading it into TFX
Edit the TFX function that checks types to account for date-like types and cast them to ordinal on the fly.
You can navigate to venv/lib/python3.7/site-packages/tfx/components/example_gen/utils.py and look for the function dict_to_example. You can add a datetime check there like so
def dict_to_example(instance: Dict[Text, Any]) -> tf.train.Example:
"""Converts dict to tf example."""
feature = {}
for key, value in instance.items():
# TODO(jyzhao): support more types.
if isinstance(value, datetime.datetime): # <---- Check here
value = value.toordinal()
if value is None:
feature[key] = tf.train.Feature()
...
value will become an int, and the int will be handled and cast to a Tensorflow type later on in the function.