Passing Argument to a Generator to build a tf.data.Dataset - tf.keras

I am trying to build a tensorflow dataset from a generator. I have a list of tuples called some_list , where each tuple has an integer and some text.
When I do not pass some_list as an argument to the generator, the code works fine
import tensorflow as tf
import random
import numpy as np
some_list=[(1,'One'),[2,'Two'],[3,'Three'],[4,'Four'],
(5,'Five'),[6,'Six'],[7,'Seven'],[8,'Eight']]
def text_gen1():
random.shuffle(some_list)
size=len(some_list)
i=0
while True:
yield some_list[i][0],some_list[i][1]
i+=1
if i>size:
i=0
random.shuffle(some_list)
#Not passing any argument
tf_dataset1 = tf.data.Dataset.from_generator(text_gen1,output_types=(tf.int32,tf.string),
output_shapes = ((),()))
for count_batch in tf_dataset1.repeat().batch(3).take(2):
print(count_batch)
(<tf.Tensor: shape=(3,), dtype=int32, numpy=array([7, 1, 2])>, <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Seven', b'One', b'Two'], dtype=object)>)
(<tf.Tensor: shape=(3,), dtype=int32, numpy=array([3, 5, 4])>, <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Three', b'Five', b'Four'], dtype=object)>)
However, when I try to pass some_list as an argument, the code fails
def text_gen2(file_list):
random.shuffle(file_list)
size=len(file_list)
i=0
while True:
yield file_list[i][0],file_list[i][1]
i+=1
if i>size:
i=0
random.shuffle(file_list)
tf_dataset2 = tf.data.Dataset.from_generator(text_gen2,args=[some_list],output_types=
(tf.int32,tf.string),output_shapes = ((),()))
for count_batch in tf_dataset1.repeat().batch(3).take(2):
print(count_batch)
ValueError: Can't convert Python sequence with mixed types to Tensor.
I noticed , when I try to pass a list of integers as an argument , the code works. However, a list of tuples seems to make it crash. Can someone shed some light on it ?

The problem is what it says is, you cannot have heterogeneous data types (int and str) in the same tf.Tensor. I did a few changes and came up with the code below.
Separate your some_list to two lists using zip(), i.e. int_list and str_list and make your generator function accept two lists.
I don't understand why you're manually shuffling stuff within the generator. You can do it in a cleaner way using tf.data.Dataset.shuffle()
import tensorflow as tf
import random
import numpy as np
some_list=[(1,'One'),[2,'Two'],[3,'Three'],[4,'Four'],
(5,'Five'),[6,'Six'],[7,'Seven'],[8,'Eight']]
def text_gen2(int_list, str_list):
for x, y in zip(int_list, str_list):
yield x, y
tf_dataset2 = tf.data.Dataset.from_generator(
text_gen2,
args=list(zip(*some_list)),
output_types=(tf.int32,tf.string),output_shapes = ((),())
)
i = 0
for count_batch in tf_dataset2.repeat().batch(4).shuffle(buffer_size=6):
print(count_batch)
i += 1
if i > 10: break;

Related

Why am I getting 'isinstance': Cannot determine Numba type?

I am new with Numba. I am trying to accelerate a pretty complicated solver. However, I keep getting an error such as
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend) Untyped global name 'isinstance': Cannot determine Numba type of <class 'builtin_function_or_method'>
I wrote a small example to reproduce the same error:
import numba
import numpy as np
from numba import types
from numpy import zeros_like, isfinite
from numpy.linalg import solve
from numpy.random import uniform
#numba.njit(parallel=True)
def foo(A_, b_, M1=None, M2=None):
x_ = zeros_like(b_)
r = b_ - A_.dot(x_)
flag = 1
if isinstance(M1, types.NoneType): # Error here
y = r
else:
y = solve(M1, r)
if not isfinite(y).any():
flag = 2
if isinstance(M2, types.NoneType):
z = y
else:
z = solve(M2, y)
if not isfinite(z).any():
flag = 2
return z, flag
N = 10
tmp = np.random.rand(N, N)
A = np.dot(tmp, tmp.T)
x = np.zeros((N, 1), dtype=np.float64)
b = np.vstack([uniform(0.0, 1.0) for i in range(N)])
X_1, info = foo(A, b)
Also if I change the decorator to generated_jit() I get the following error:
r = b_ - A_.dot(x_)
AttributeError: 'Array' object has no attribute 'dot'
Numba compiles the function and requires every variables to be statically typed. This means that each variable has only one unique type: one variable cannot be of both the type NoneType and something else as opposed to with CPython based on dynamic typing. Dynamic typing is also a major source of the slowdown of CPython. Thus, using isinstance in nopython JITed Numba functions does not make much sense. In fact, this built-in function is not supported.
That being said, Numba supports optional arguments by specifying optional(ArgumentType) in the signature (note that the resulting type of the variable is optional(ArgumentType) and not ArgumentType nor NoneType. You can then test if the argument is set using if yourArgument is None:. I do not know what is the type of M1 and M2 in your code but they need to be explicitly defined in the signature with optional argument.

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)

How to write a flexible multiple exponential fit

I'd like to write a more or less universial fit function for general function
$f_i = \sum_i a_i exp(-t/tau_i)$
for some data I have.
Below is an example code for a biexponential function but I would like to be able to fit a monoexponential or a triexponential function with the smallest code adaptions possible.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
t = np.linspace(0, 10, 100)
a_1 = 1
a_2 = 1
tau_1 = 5
tau_2 = 1
data = 1*np.exp(-t/5) + 1*np.exp(-t/1)
data += 0.2 * np.random.normal(size=t.size)
def func(t, a_1, tau_1, a_2, tau_2): # plus more exponential functions
return a_1*np.exp(-t/tau_1)+a_2*np.exp(-t/tau_2)
popt, pcov = curve_fit(func, t, data)
print(popt)
plt.plot(t, data, label="data")
plt.plot(t, func(t, *popt), label="fit")
plt.legend()
plt.show()
In principle I thought of redefining the function to a general form
def func(t, a, tau): # with a and tau as a list
tmp = 0
tmp += a[i]*np.exp(-t/tau[i])
return tmp
and passing the arguments to curve_fit in the form of lists or tuples. However I get a TypeError as shown below.
TypeError: func() takes 4 positional arguments but 7 were given
Is there anyway to rewrite the code that you can only by the input parameters of curve_fit "determine" the degree of the multiexponential function? So that passing
a = (1)
results in a monoexponential function whereas passing
a = (1, 2, 3)
results in a triexponential function?
Regards
Yes, that can be done easily with np.broadcasting:
def func(t, a, taus): # plus more exponential functions
a=np.array(a)[:,None]
taus=np.array(taus)[:,None]
return (a*np.exp(-t/taus)).sum(axis=0)
func accepts 2 lists, converts them into 2-dim np.array, computes a matrix with all the exponentials and then sums it up. Example:
t=np.arange(100).astype(float)
out=func(t,[1,2],[0.3,4])
plt.plot(out)
Keep in mind a and taus must be the same length, so sanitize your inputs as you see fit. Or you could also directly pass np.arrays instead of lists.

Can operations on a numpy.memmap be deferred?

Consider this example:
import numpy as np
a = np.array(1)
np.save("a.npy", a)
a = np.load("a.npy", mmap_mode='r')
print(type(a))
b = a + 2
print(type(b))
which outputs
<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>
So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?
I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.
Here is an extended example showing my problem:
import numpy as np
# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))
# I want to print the first value using f and memmaps
def f(value):
print(value[1])
# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)
# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)
Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable.
For a very specific case it may be easier than redesigning your code to solve the problem in a better way.
I'd recommend reading over these examples from the docs to help understand how it works.
import numpy as np
class Defered(np.ndarray):
"""
An array class that deferrs calculations applied to it, only
calculating them when an index is requested
"""
def __new__(cls, arr):
arr = np.asanyarray(arr).view(cls)
arr.toApply = []
return arr
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
## Convert all arguments to ndarray, otherwise arguments
# of type Defered will cause infinite recursion
# also store self as None, to be replaced later on
newinputs = []
for i in inputs:
if i is self:
newinputs.append(None)
elif isinstance(i, np.ndarray):
newinputs.append(i.view(np.ndarray))
else:
newinputs.append(i)
## Store function to apply and necessary arguments
self.toApply.append((ufunc, method, newinputs, kwargs))
return self
def __getitem__(self, idx):
## Get index and convert to regular array
sub = self.view(np.ndarray).__getitem__(idx)
## Apply stored actions
for ufunc, method, inputs, kwargs in self.toApply:
inputs = [i if i is not None else sub for i in inputs]
sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
return sub
This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.
This is just how python works. By default numpy operations return a new array, so b never exists as a memmap - it is created when + is called on a.
There's a couple of ways to work around this. The simplest is to do all operations in place,
a += 1
This requires loading the memory mapped array for reading and writing,
a = np.load("a.npy", mmap_mode='r+')
Of course this isn't any good if you don't want to overwrite your original array.
In this case you need to specify that b should be memmapped.
b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)
Assigning can be done by using the out keyword provided by numpy ufuncs.
np.add(a, 2, out=b)

Including time as an explicit variable in constraint in a Pyomo Model

I am using PyOMO to model a semi-batch reaction.
Consider an ODE system that describes a semi-batch reactor where one of the reactants is fed at a given volume flow for t1 units of time, the reaction goes on until t end, and obviously t1 < t end.
To specify the stop in the flow, I can either use a conditional rule (assume t1 = 3.5*60):
def _vol_flow_in_schedule(mod,t):
if t<=3.5*60:
return mod.vol_flow_in[t] == (12.3/1000)/(3.5*60)
else:
return mod.vol_flow_in[t] == 0
m1.vol_flow_in_schedule = Constraint(m1.time,rule=_vol_flow_in_schedule)
which will create a discontinuity (and then my model does not converge). What I want to do is use a sigmoidal function that will transition the flow to zero without a discontinuity.
To implement the sigmoidal though I need to refer to the time variable itself.
The below MATLAB code gives me the result I want:
t=[0:1:500];
acc=2; %Acceleration parameter, higher values yields sharper change.
time_of_step=3.5*60;
init_value = (12.3/1000)/(3.5*60);
end_value = 0;
sigmoidal=(init_value+(end_value-init_value)/2)...
+((end_value-init_value)/2)*atan((t-time_of_step)*acc)/atan(max(t));
This implementation however needs the time variable explicitly in the function. How can I access the time variable inside the PyOMO rule? I tried the below, but I get an " Cannot treat the scalar component 't_of_step' as an array" error:
m1.init_value = Param(initialize = (12.3/1000)/(3.5*60))
m1.end_value = Param(initialize = 0)
m1.t_of_step = Param(initialize = 210)
m1.acc = Param(initialize = 5)
.
.
def _vol_flow_sigmoidal (mod,t):
return mod.vol_flow_in[t] == (mod.init_value+(mod.end_value-mod.init_value)/2)+((mod.end_value-mod.init_value)/2)*atan((t-mod.t_of_step)*mod.acc)/atan(1500)
m1.vol_flow_sigmoidal = Constraint(m1.time,rule=_vol_flow_sigmoidal)
Hopefully I've described clearlyt what I am after. Any hints are most welcome,
Thanks!
Sal
How are you declaring the m1.time index?
My guess is that you are using a NumPy array to initialize the m1.time index. There is a known problem in Pyomo (see Issue #31) where the NumPy operator overloading and the Pyomo operator overloading end up fighting with each other (basically, NumPy gets fooled into thinking Pyomo scalars are actually indexed and attempts to treat them like arrays).
I was able to reproduce the error with the following complete example:
# pyomo 4.4.1
from pyomo.environ import *
import numpy as np
m1 = ConcreteModel()
m1.time = Set(initialize=np.array([0,100,200,300,400,500]))
m1.vol_flow_in = Var(m1.time)
m1.init_value = Param(initialize = (12.3/1000)/(3.5*60))
m1.end_value = Param(initialize = 0)
m1.t_of_step = Param(initialize = 210)
m1.acc = Param(initialize = 5)
def _vol_flow_sigmoidal (mod,t):
return mod.vol_flow_in[t] == (mod.init_value+(mod.end_value-mod.init_value)/2)\
+((mod.end_value-mod.init_value)/2)*atan((t-mod.t_of_step)*mod.acc)/atan(1500)
m1.vol_flow_sigmoidal = Constraint(m1.time,rule=_vol_flow_sigmoidal)
There are two alternatives that do work, both based on avoiding using NumPy arrays to initialize Pyomo Sets. You can either completely avoid Numpy:
m1.time = Set(initialize=[0,100,200,300,400,500])
or explicitly cast the NumPy array to a list:
timeArray = np.array([0,100,200,300,400,500])
m1.time = Set(initialize=timeArray.tolist())
Finally, for completeness, two other notes:
This also applies to initializing ContinuousSet objects in pyomo.dae
You will see the same behavior even if you avoid the explicit Pyomo Set declaration. That is, the following will also generate the error:
m1.time = np.array([0,100,200,300,400,500])
# ...
m1.vol_flow_sigmoidal = Constraint(m1.time,rule=_vol_flow_sigmoidal)
This is because Pyomo will quietly create the Set object for you behind the scenes as m1.vol_flow_sibmodial_index and then use that Set to index the Constraint.