pyspark - how to cross validate several ML algorithms - pyspark

i want to be able to choose the best fit algorithm with it's best params .
how can i do it in one go , without creating few pipelines for each algorithm , and without doing checks in the cross validation for params that are not relevant for specific algorithm ?
i.e i want to check how logistic regression perform against randomforest.
my code is :
lr = LogisticRegression().setFamily("multinomial")
# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer,labelIndexer2, assembler, lr , labelconverter])
paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam, [0.1, 0.3, 0.01]) \
.addGrid(lr.elasticNetParam, [0.1, 0.8, 0.01]) \
.addGrid(lr.maxIter, [10, 20, 25]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
numFolds=2) # use 3+ folds in practice
# Train model. This also runs the indexer.
model = crossval.fit(trainingData)

I've written a quick and dirty workaround in Python/Pyspark. It is a bit primitive (it doesn't have a corresponding Scala class) and I think it lacks the save/load capabilities but it might be a starting point for your case. Eventually it might become a new functionality in Spark it would be nice to have.
The idea is to have a special pipeline stage that acts as a switch between different objects, and maintains a dictionary to refer to them with strings. The user can enable one or another by name. They can be either Estimators, Transformers or mix both - the user is responsible for keeping the coherence in the pipeline (doing things that make sense, at her own risk). The parameter with the name of the enabled stage can be included in the grid to be cross-validated.
from pyspark.ml.wrapper import JavaEstimator
from pyspark.ml.base import Estimator, Transformer, Param, Params, TypeConverters
class PipelineStageChooser(JavaEstimator):
selectedStage = Param(Params._dummy(), "selectedStage", "key of the selected stage in the dict",
typeConverter=TypeConverters.toString)
stagesDict = None
_paramMap = {}
def __init__(self, stagesDict, selectedStage):
super(PipelineStageChooser, self).__init__()
self.stagesDict = stagesDict
if selectedStage not in self.stagesDict.keys():
raise KeyError("selected stage {0} not found in stagesDict".format(selectedStage))
if isinstance(self.stagesDict[selectedStage], Transformer):
self.fittedSelectedStage = self.stagesDict[selectedStage]
for stage in stagesDict.values():
if not (isinstance(stage, Estimator) or isinstance(stage, Transformer)):
raise TypeError("Cannot recognize a pipeline stage of type %s." % type(stage))
self._set(selectedStage=selectedStage)
self._java_obj = None
def fit(self, dataset, params=None):
selectedStage_str = self.getOrDefault(self.selectedStage)
if isinstance(self.stagesDict[selectedStage_str], Estimator):
return self.stagesDict[selectedStage_str].fit(dataset, params = params)
elif isinstance(self.stagesDict[selectedStage_str], Transformer):
return self.stagesDict[selectedStage_str]
Use example:
count_vectorizer = CountVectorizer() # set params
hashing_tf = HashingTF() # set params
chooser = PipelineStageChooser(stagesDict={"count_vectorizer": count_vectorizer,
"hashing_tf": hashing_tf},
selectedStage="count_vectorizer")
pipeline = Pipeline(stages = [chooser])
# Test which among CountVectorizer or HashingTF works better to create features
# Could be used as well to decide between different ML algorithms
paramGrid = ParamGridBuilder() \
.addGrid(chooser.selectedStage, ["count_vectorizer", "hashing_tf"])\
.build()

Related

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)

I have an issue with regex extract with multiple matches

I am trying to extract 60 ML and 0.5 ML from the string "60 ML of paracetomol and 0.5 ML of XYZ" . This string is part of a column X in spark dataframe. Though I am able to test my regex code to extract 60 ML and 0.5 ML in regex validator, I am not able to extract it using regexp_extract as it targets only 1st matches. Hence I am getting only 60 ML.
Can you suggest me the best way of doing it using UDF ?
Here is how you can do it with a python UDF:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import re
data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')
# Define the function you want to return
def extract(s)
all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
return all_matches
# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))
# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))
Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.
pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)
df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))

How to perform cross validation without using paramgrid Builder in pyspark?

I want to do a sklearn kind of cross validation in pyspark without using ParamGrid Builder.
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(regParam=0.1,elasticNet=0.2,maxIter=100)
crossval = CrossValidator(estimator=lr,
evaluator=BinaryClassificationEvaluator(),
numFolds=2)
Is it possible to perform the cross validation in this way without using paramGrid Builder? My usecase is I want to pass the parameters into the linear regression class as arguments but not as paramGrid object.
One easy solution would be to only provide the parameters you want to use in your ParamGrid:
paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam, [0.1]) \
.addGrid(lr.elasticNet, [0.2]) \
.addGrid(lr.maxIter, [100])
.build()
crossval = CrossValidator(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2)
You can always code your own version of K-fold, splitting your dataset in K parts with :
fold1, fold2 = df.randomSplit([0.5,0.5])
folds = [fold1, fold2]
res = []
for fold in folds:
train, test = fold.randomSplit([0.80,0.20])
model.train(train)
res.append(model.evaluate(test))
do_what_you_want(res)

Can operations on a numpy.memmap be deferred?

Consider this example:
import numpy as np
a = np.array(1)
np.save("a.npy", a)
a = np.load("a.npy", mmap_mode='r')
print(type(a))
b = a + 2
print(type(b))
which outputs
<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>
So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?
I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.
Here is an extended example showing my problem:
import numpy as np
# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))
# I want to print the first value using f and memmaps
def f(value):
print(value[1])
# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)
# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)
Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable.
For a very specific case it may be easier than redesigning your code to solve the problem in a better way.
I'd recommend reading over these examples from the docs to help understand how it works.
import numpy as np
class Defered(np.ndarray):
"""
An array class that deferrs calculations applied to it, only
calculating them when an index is requested
"""
def __new__(cls, arr):
arr = np.asanyarray(arr).view(cls)
arr.toApply = []
return arr
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
## Convert all arguments to ndarray, otherwise arguments
# of type Defered will cause infinite recursion
# also store self as None, to be replaced later on
newinputs = []
for i in inputs:
if i is self:
newinputs.append(None)
elif isinstance(i, np.ndarray):
newinputs.append(i.view(np.ndarray))
else:
newinputs.append(i)
## Store function to apply and necessary arguments
self.toApply.append((ufunc, method, newinputs, kwargs))
return self
def __getitem__(self, idx):
## Get index and convert to regular array
sub = self.view(np.ndarray).__getitem__(idx)
## Apply stored actions
for ufunc, method, inputs, kwargs in self.toApply:
inputs = [i if i is not None else sub for i in inputs]
sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
return sub
This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.
This is just how python works. By default numpy operations return a new array, so b never exists as a memmap - it is created when + is called on a.
There's a couple of ways to work around this. The simplest is to do all operations in place,
a += 1
This requires loading the memory mapped array for reading and writing,
a = np.load("a.npy", mmap_mode='r+')
Of course this isn't any good if you don't want to overwrite your original array.
In this case you need to specify that b should be memmapped.
b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)
Assigning can be done by using the out keyword provided by numpy ufuncs.
np.add(a, 2, out=b)

updating subset of parameters in dynet

Is there a way to update a subset of parameters in dynet? For instance in the following toy example, first update h1, then h2:
model = ParameterCollection()
h1 = model.add_parameters((hidden_units, dims))
h2 = model.add_parameters((hidden_units, dims))
...
for x in trainset:
...
loss.scalar_value()
loss.backward()
trainer.update(h1)
renew_cg()
for x in trainset:
...
loss.scalar_value()
loss.backward()
trainer.update(h2)
renew_cg()
I know that update_subset interface exists for this and works based on the given parameter indexes. But then it is not documented anywhere how we can get the parameter indexes in dynet Python.
A solution is to use the flag update = False when creating expressions for parameters (including lookup parameters):
import dynet as dy
import numpy as np
model = dy.Model()
pW = model.add_parameters((2, 4))
pb = model.add_parameters(2)
trainer = dy.SimpleSGDTrainer(model)
def step(update_b):
dy.renew_cg()
x = dy.inputTensor(np.ones(4))
W = pW.expr()
# update b?
b = pb.expr(update = update_b)
loss = dy.pickneglogsoftmax(W * x + b, 0)
loss.backward()
trainer.update()
# dy.renew_cg()
print(pb.as_array())
print(pW.as_array())
step(True)
print(pb.as_array()) # b updated
print(pW.as_array())
step(False)
print(pb.as_array()) # b not updated
print(pW.as_array())
For update_subset, I would guess that the indices are the integers suffixed at the end of parameter names (.name()).
In the doc, we are supposed to use a get_index function.
Another option is: dy.nobackprop() which prevents the gradient to propagate beyond a certain node in the graph.
And yet another option is to zero the gradient of the parameter that do not need to be updated (.scale_gradient(0)).
These methods are equivalent to zeroing the gradient before the update. So, the parameter will still be updated if the optimizer uses its momentum from previous training steps (MomentumSGDTrainer, AdamTrainer, ...).