Use own dataset in Denoising Diffusion Implicit Models - tf.keras

Can you tell me how I create and implement my own dataset in this code?
def prepare_dataset(split):
# the validation dataset is shuffled as well, because data order matters
# for the KID estimation
return (
tfds.load(dataset_name, split=split, shuffle_files=True)
.map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)
.cache()
.repeat(dataset_repetitions)
.shuffle(10 * batch_size)
.batch(batch_size, drop_remainder=True)
.prefetch(buffer_size=tf.data.AUTOTUNE)
)
# load dataset
train_dataset = prepare_dataset("train[:80%]+validation[:80%]+test[:80%]")
val_dataset = prepare_dataset("train[80%:]+validation[80%:]+test[80%:]")
It always redirects me to use the TensorFlow datasets. It's from the Keras Notebook 'Denoising Diffusion Implicit Models'.
I tried following the keras guide on how to create your own dataset, but I'm not sure how to implement it in this code. I also tried this way of creating your own dataset, but I don't know if it's possible to implement it into the code above.

Related

how to get the prediction of a model in pyspark

i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext
I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).
You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.

pretrained densenet/vgg16/resnet50 + gp does not train on cifar10 data

I'm trying to train a hybrid model with GP on top of pre-trained CNN (Densenet, VGG and Resnet) with CIFAR10 data, mimic the ex2 function in the gpflow document. But the testing result is always between 0.1~0.2, which generally means random guess (Wilson+2016 paper shows hybrid model for CIFAR10 data should get accuracy of 0.7). Could anyone give me a hint of what could be wrong?
I've tried same code with simpler cnn models (2 conv layer or 4 conv layer) and both have reasonable results. I've tried to use different Keras applications: Densenet121, VGG16, ResNet50, neither works. I've tried to freeze the weights in the pre-trained models still not working.
def cnn_dn(output_dim):
base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(32,32,3))
bout = base_model.output
fcl = GlobalAveragePooling2D()(bout)
#for layer in base_model.layers:
# layer.trainable = False
output=Dense(output_dim, activation='relu')(fcl)
md=Model(inputs=base_model.input, outputs=output)
return md
#add gp on top, reference:ex2() function in
#https://nbviewer.jupyter.org/github/GPflow/GPflow/blob/develop/doc/source/notebooks/tailor/gp_nn.ipynb
#needs to slightly change build graph part because keras variable #sharing is not the same as tensorflow
#......
## build graph
with tf.variable_scope('cnn'):
md=cnn_dn(gp_dim)
f_X = tf.cast(md(X), dtype=float_type)
f_Xtest = tf.cast(md(Xtest), dtype=float_type)
#......
## predict
res=np.argmax(sess.run(my, feed_dict={Xtest:xts}),1).reshape(yts.shape)
correct = res == yts.astype(int)
print(np.average(correct.astype(float)))
I finally figure out that the solution is training larger iterations. In the original code, I just use 50 iterations as used in the ex2() function for MNIST data and it is not enough for more complicated network and CIFAR10 data. Adjusting some hyper-parameter (e.g. learning rate and activation function) also helps.

Is there any plan to implement complex survey design within Spark's MLLIB package?

I'm working to implement a logistic regression in Pyspark that is currently written in SAS using proc surveylogistic. The SAS implementation is able to account for complex survey design involving clusters/strata/sample weights.
There are some avenues out there for at least getting the model into Python: for example, I was able to get a close match of both coefficients and standard errors using the statsmodels package from this research project on Github
However, my data is big and so I'd like to take advantage of Spark's distributed capabilities through the MLLIB package. For example, the current setup to run the logit in Spark is:
import pyspark.ml.feature as ft
featuresCreator = ft.VectorAssembler(
inputCols = X_features_list,
outputCol = "features")
glm_binomial = GeneralizedLinearRegression(family="binomial", link="logit", maxIter=25, regParam = 0,
labelCol='df',
weightCol='wgt_panel')
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[featuresCreator, glm_binomial])
model = pipeline.fit(encoded_df_nonan)
The "weightcol" works for just simple sample weights, but I'm wondering if anyone is aware of a method for implementing a more complex weighting scheme in Spark (note that the above would match a proc logistic, not a proc surveylogistic). For comparison, the method used to calculate the covariance matrix in the surveylogistic is here.

Fitting Spark ML Kmeans for subsets or groups of data

I've got a Dataset where each Row is a (class: String, vectors: Array[Array[Float]]), and I'd like to fit a kmeans model in Spark MLLib per class. I can explode the vectors to normalize the data, loop through the classes, filter the entire dataset by class, and fit a model per iteration of the loop, but that's horribly inefficient (although it's how Spark does it in the fit method of the OneVsRest classifier here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala).
Here's a snippet that accomplishes this with a ParArray, inspired by OneVsRest's approach:
val classes = normalized_data.select("class").distinct.map(_.getString(0)).collect
val kmeans = new KMeans().setK(5)
val models = classes.par.map { class =>
val training_data = unpacked_data.filter($"label" === class)
val model = kmeans.fit(training_data)
(class, model)
}
It seems that the KMeans fit method needs the data to be a Dataset with one-row-per-vector, which suggests normalizing / exploding the data, but what's the best way to go about this? Can't I somehow leverage the fact that I start with all of my data points in each row and/or group on the label to use just these points without explicitly filtering the entire dataset for every class I want to build a model for?
PS- I know KMeans.fit actually needs org.apache.spark.ml.linalg.Vector; presume I've transformed my Array[Float] accordingly.

Using hidden activations in loss function

I want to create a custom loss function for a double-input double-output model in Keras that:
minimizes the reconstruction error of two autoencoders;
maximizes the correlation of the bottleneck features of the autoencoders.
For this I need to pass to the loss function:
both inputs;
both outputs / reconstructions;
output of intermediate layers for both (hidden activations).
I know I can pass both inputs and outputs to Model, but am struggling to find a way to pass the hidden activations.
I could create two new Models that have the output of the intermediate layers and pass that to loss, like:
intermediate_layer_model1 = Model(input=input1, output=autoencoder.get_layer('encoded1').output)
intermediate_layer_model2 = Model(input=input2, output=autoencoder.get_layer('encoded2').output)
autoencoder.compile(optimizer='adadelta', loss=loss(intermediate_layer_model1, intermediate_layer_model2))
But still, I would need to find a way to match the y_true in loss to the correct intermediate model.
What is the right way to approach this?
Edit
Here's an approach that I think should work. Simplified:
# autoencoder 1
input1 = Input(shape=(input_dim,))
encoded1 = Dense(encoding_dim, activation='relu', name='encoded1')(input1)
decoded1 = Dense(input_dim, activation='sigmoid', name='decoded1')(encoded1)
# autoencoder 2
input2 = Input(shape=(input_dim,))
encoded2 = Dense(encoding_dim, activation='relu', name='encoded2')(input2)
decoded2 = Dense(input_dim, activation='sigmoid', name='decoded2')(encoded2)
# merge encodings
merge_layer = merge([encoded1, encoded2], mode='concat', name='merge', concat_axis=1)
model = Model(input=[input1, input2], output=[decoded1, decoded2, merge_layer])
model.compile(optimizer='rmsprop', loss={
'decoded1': 'binary_crossentropy',
'decoded2': 'binary_crossentropy',
'merge': correlation,
})
Then in correlation I can split y_pred and do the calculations.
How about:
Defining a single model with a multiple outputs (be sure that you named a coding and reconstruction layer properly):
duo_model = Model(input=input, output=[coding_layer, reconstruction_layer])
Compiling your model with two different losses (or even performing a loss reweighting):
duo_model.compile(optimizer='rmsprop',
loss={'coding_layer': correlation_loss,
'reconstruction_layer': 'mse'})
Taking your final model as a:
encoder = Model(input=input, output=[coding_layer])
autoencoder = Model(input=input, output=[reconstruction_layer])
After proper compilation this should do the job.
When it comes to defining a proper correlation loss function there are two ways:
when coding layer and your output layer have the same dimension -
you could easly use predefinied cosine_proximity function from
Keras library.
when coding layer has different dimensonality -
you shoud first find embedding of coding vector and reconstruction vector to the same space and then - compute correlation there. Remember that this embedding should either be a Keras layer / function or Theano / Tensor flow operation (depending on which backend you are using). Of course you can compute both embedding and correlation function as a part of one loss function.