Hyperopt failed to execute mlflow.end_run() with tracking URI: databricks - pyspark

I'm using Azure Databricks + Hyperopt + MLflow for some hyperparameter tuning on a small dataset. Seem like the job is running, and I get output in MLflow, but the job ends with the following error message:
Hyperopt failed to execute mlflow.end_run() with tracking URI: databricks
Here is my code code with some information redacted:
from pyspark.sql import SparkSession
# spark session initialization
spark = (SparkSession.builder.getOrCreate())
sc = spark.sparkContext
# Data Processing
import pandas as pd
import numpy as np
# Hyperparameter Tuning
from hyperopt import fmin, tpe, hp, anneal, Trials, space_eval, SparkTrials, STATUS_OK
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
# Modeling
from sklearn.ensemble import RandomForestClassifier
# cleaning
import gc
# tracking
import mlflow
# track runtime
from datetime import date, datetime
# notebook settings \ variable settings
n_splits = #
n_repeats = #
max_evals = #
dfL = pd.read_csv("/my/data/loc/mydata.csv")
x_train = dfL[['f1','f2','f3']]
y_train = dfL['target']
def define_model(params):
model = RandomForestClassifier(n_estimators=int(params['n_estimators']),
class_weight={0:params['class_0_weight'], 1:params['class_1_weight']})
return model
space = {'n_estimators': hp.quniform('n_estimators', #, #, #),
'criterion': hp.choice('#', ['#','#']),
'max_depth': hp.quniform('max_depth', #, #, #),
'min_samples_split': hp.quniform('min_samples_split', #, #, #),
'min_samples_leaf': hp.quniform('min_samples_leaf', #, #, #),
'min_weight_fraction_leaf': hp.quniform('min_weight_fraction_leaf', #, #, #),
'max_features': hp.quniform('max_features', #, #, #),
'min_impurity_decrease': hp.quniform('min_impurity_decrease', #, #, #),
'bootstrap': hp.choice('bootstrap', [#,#]),
'random_state': hp.quniform('random_state', #, #, #),
'class_0_weight': hp.choice('class_0_weight', [#,#,#]),
'class_1_weight': hp.choice('class_1_weight', [#,#,#])}
# define hyperopt objective
def objective(params, n_splits=n_splits, n_repeats=n_repeats):
# define model
model = define_model(params)
# get cv splits
kfold = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=1331)
# define and run sklearn cv scorer
scores = cross_val_score(model, x_train, y_train, cv=kfold, scoring='roc_auc')
score = scores.mean()
return {'loss': score*(-1), 'status': STATUS_OK}
spark_trials = SparkTrials(parallelism=36, spark_session=spark)
with mlflow.start_run():
best = fmin(objective, space, algo=tpe.suggest, trials=spark_trials, max_evals=max_evals)
and then at the end I get..
100%|██████████| 200/200 [1:35:28<00:00, 100.49s/trial, best loss: -0.9584565527065526]
Hyperopt failed to execute mlflow.end_run() with tracking URI: databricks
Exception: 'MLFLOW_RUN_ID'
Total Trials: 200: 200 succeeded, 0 failed, 0 cancelled.
My Azure Databricks cluster is..
6.6 ML (includes Apache Spark 2.4.5, Scala 2.11)
min 9 max 18 nodes
Am I doing something wrong or is this a bug?

This message is a known (but harmless) issue and has been fixed for MLR 7.0. I have tried executing on the DBR 7.0 ML cluster and it's working.
You don’t need start_run(); a run is started for you automatically with SparkTrials. The error is because of this only.
So with SparkTrials, it still works without start_run(); SparkTrials should automatically run and log for you.


Spark vs scikit-learn

I use pyspark for traffic classification using the decision tree model & I measure the time required for training the model. It took 2 min and 17 s. Then, I perform the same task using scikit-learn. In the second case, the training time is 1 min and 19 s. Why? since it is supposed that Spark performs the task in a distributed way.
This is the code for pyspark:
df = (spark.read.format("csv")\
.option('header', 'true')\
.option("inferSchema", "true")\
.load("D:/PHD Project/Paper_3/Datasets_Download/IP Network Traffic Flows Labeled with 75 Apps/Dataset-Unicauca-Version2-87Atts.csv"))
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 10)
pModel = dt.fit(trainDF)
in scikit - learn
import warnings
path = 'D:/PHD Project/Paper_3/Datasets_Download/IP Network Traffic Flows Labeled with 75 Apps/Dataset-Unicauca-Version2-87Atts.csv'
df= pd.read_csv(path)
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

CuPy error when pushing / popping pycuda context

I am using tensorRT to perform inference with CUDA. I'd like to use CuPy to preprocess some images that I'll feed to the tensorRT engine. The preprocessing function, called my_function, works fine as long as tensorRT is not run between different calls of the my_function method (see code below). Specifically, the issue is not strictly related by tensorRT but by the fact that tensorRT inference requires to be wrapped by push and pop operations of the pycuda context.
With respect to the following code, the last execution of my_function will raise the following error:
File "/home/ubuntu/myfile.py", line 188, in _pre_process_cuda
img = ndimage.zoom(img, scaling_factor)
File "/home/ubuntu/.local/lib/python3.6/site-packages/cupyx/scipy/ndimage/interpolation.py", line 482, in zoom
kern(input, zoom, output)
File "cupy/core/_kernel.pyx", line 822, in cupy.core._kernel.ElementwiseKernel.__call__
File "cupy/cuda/function.pyx", line 196, in cupy.cuda.function.Function.linear_launch
File "cupy/cuda/function.pyx", line 164, in cupy.cuda.function._launch
File "cupy_backends/cuda/api/driver.pyx", line 299, in cupy_backends.cuda.api.driver.launchKernel
File "cupy_backends/cuda/api/driver.pyx", line 124, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
Note: in the following code I haven't reported the entire tensorRT inference code. In fact, simply pushing and popping a pycuda context generates the error
import numpy as np
import cv2
import time
from PIL import Image
import requests
from io import BytesIO
from matplotlib import pyplot as plt
import cupy as cp
from cupyx.scipy import ndimage
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
def my_function(numpy_frame):
dtype = 'float32'
img = cp.array(numpy_frame, dtype='float32')
# print(img)
img = ndimage.zoom(img, (0.5, 0.5, 3))
img = (cp.array(2, dtype=dtype) / cp.array(255, dtype=dtype)) * img - cp.array(1, dtype=dtype)
img = img.transpose((2, 0, 1))
img = img.ravel()
return img
# load image
url = "https://www.pexels.com/photo/109919/download/?search_query=&tracking_id=411xe21veam"
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img = np.array(img)
# initialize tensorrt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
cfx = cuda.Device(0).make_context()
my_function(img) # ok
my_function(img) # ok
# ----- TENSORRT ---------
# .... tensorrt inference....
# ----- TENSORRT ---------
my_function(img) # <---- error
I even tried to do it other ways, but unfortunately with the same result:
my_function(img) # ok
my_function(img) # error
#admin: if you can think of a better name for this question feel free to edit it :)
There were multiple contexts open. For instance, it seems that all of the following open a context:
import pycuda.autoinit
So if you run the three command above, then simply running one cfx.pop() won't be enough. You will need to run cfx.pop() three times to pop all the contexts.

How to limit FPGrowth itemesets to just 2 or 3

I am running the FPGrowth algorithm using pyspark in python3.6 using jupyter notebook. When I am trying to save the association rules output of rules generated is huge. So I want to limit the number of consequent. Here is the code which I have tried. I also changed the spark context parameters.
Maximum Pattern Length fpGrowth (Apache) PySpark
from pyspark.sql.functions import col, size
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
conf = SparkConf().setAppName("App")
conf = (conf.setMaster('local[*]')
.set('spark.executor.memory', '100G')
.set('spark.driver.memory', '400G')
.set('spark.driver.maxResultSize', '200G'))
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
R = Row('ID', 'items')
df=spark.createDataFrame([R(i, x) for i, x in enumerate(lol)])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.7, minConfidence=0.9)
model = fpGrowth.fit(df)
ar=model.associationRules.where(size(col('antecedent')) == 2).where(size(col('cosequent')) == 1)
It gives an error
TypeError Traceback (most recent call last)
<ipython-input-1-f90c7a9f11ae> in <module>
---> 73 ar=model.associationRules.where(size(col('antecedent')) ==
2).where(size(col('consequent')) == 1)
TypeError: 'str' object is not callable
Can someone help me to solve the issue.
Here lol is list of list of transactions: [['a','b'],['c','a','e']....]
Python: 3.6.5
Windows 10
From the above discussion and following this link, it helped me to resolve the problem.
'str' object is not callable TypeError
import pyspark.sql.functions as func
model.associationRules.where(func.size(func.col('antecedent')) == 1).where(func.size(func.col('consequent')) == 1).show()

AWS Glue job failing with OOM exception when changing column names

I have an ETL job where I load some data from S3 into a dynamic frame, relationalize it, and iterate through the dynamic frames returned. I want to query the result of this in Athena later so I want to change the names of the columns from having '.' to '_' and lower case them. When I do this transformation, I change the DynamicFrame into a spark dataframe and have been doing it this way. I've also seen a problem in another SO question where it turned out there is a reported problem with AWS Glue rename field transform so I've stayed away from that.
I've tried a couple things, including adding a load limit size to 50MB, repartitioning the dataframe, using both dataframe.schema.names and dataframe.columns, using reduce instead of loops, using sparksql to change it and nothing has worked. I'm fairly certain that its this transformation that failing because I've put some print statements in and the print that I have right after the completion of this transformation never shows up. I used a UDF at one point but that also failed. I've tried the actual transformation using df.toDF(new_column_names) and df.withColumnRenamed() but it never gets this far because I've not seen it get past retrieving the column names. Here's the code I've been using. I've been changing the actual name transformation as I said above, but the rest of it has stayed pretty much the same.
I've seen some people try and use the spark.executor.memory, spark.driver.memory, spark.executor.memoryOverhead and spark.driver.memoryOverhead. I've used those and set them to the most AWS Glue will let you but to no avail.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import explode, col, lower, trim, regexp_replace
import copy
import json
import boto3
import botocore
import time
# ========================================================
# ========================================================
def lower_and_pythonize(s=None):
if s is not None:
return s.replace('.', '_').lower()
return None
# pyspark implementation of renaming
# exprs = [
# regexp_replace(lower(trim(col(c))),'\.' , '_').alias(c) if t == "string" else col(c)
# for (c, t) in data_frame.dtypes
# ]
# ========================================================
# ========================================================
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#my params
bucket_name = '<my-s3-bucket>' # name of the bucket. do not include 's3://' thats added later
output_key = '<my-output-path>' # key where all of the output is saved
input_keys = ['<root-directory-i'm using'] # highest level key that holds all of the desired data
s3_exclusions = "[\"*.orc\"]" # list of strings to exclude. Documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3
s3_exclusions = s3_exclusions.replace('\n', '')
dfc_root_table_name = 'root' # name of the root table generated in the relationalize process
input_paths = ['s3://' + bucket_name + '/' + x for x in input_keys] # turn input keys into s3 paths
output_connection_opts = {"path": "s3://" + bucket_name + "/" + output_key} # dict of options. Documentation link found above the write_dynamic_frame.from_options line
s3_client = boto3.client('s3', 'us-east-1') # s3 client used for writing to s3
s3_resource = boto3.resource('s3', 'us-east-1') # s3 resource used for checking if key exists
group_mb = 50 # NOTE: 75 has proven to be too much when running on all of the april data
group_size = str(group_mb * 1024 * 1024)
input_connection_opts = {'paths': input_paths,
'groupFiles': 'inPartition',
'groupSize': group_size,
'recurse': True,
'exclusions': s3_exclusions} # dict of options. Documentation link found above the create_dynamic_frame_from_options line
num_paritions = int(sc._conf.get('spark.executor.cores')) * 4
print('Loading all json files into DynamicFrame...')
loading_time = time.time()
df = glueContext.create_dynamic_frame_from_options(connection_type='s3', connection_options=input_connection_opts, format='json')
print('Done. Time to complete: {}s'.format(time.time() - loading_time))
# using the list of known null fields (at least on small sample size) remove them
#df = df.drop_fields(drop_paths)
# drop any remaining null fields. The above covers known problems that this step doesn't fix
print('Dropping null fields...')
dropping_time = time.time()
df_without_null = DropNullFields.apply(frame=df, transformation_ctx='df_without_null')
print('Done. Time to complete: {}s'.format(time.time() - dropping_time))
df = None
print('Relationalizing dynamic frame...')
relationalizing_time = time.time()
dfc = Relationalize.apply(frame=df_without_null, name=dfc_root_table_name, info="RELATIONALIZE", transformation_ctx='dfc', stageThreshold=3)
print('Done. Time to complete: {}s'.format(time.time() - relationalizing_time))
keys = dfc.keys()
keys.sort(key=lambda s: len(s))
print('Writting all dynamic frames to s3...')
writting_time = time.time()
for key in keys:
good_key = lower_and_pythonize(s=key)
data_frame = dfc.select(key).toDF()
# lowercase all the names and remove '.'
print('Removing . and _ from names for {} frame...'.format(key))
df_fix_names_time = time.time()
print('Repartitioning data frame...')
print('Changing names...')
for old_name in data_frame.schema.names:
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
df_now = DynamicFrame.fromDF(dataframe=data_frame, glue_ctx=glueContext, name='df_now')
print('Done. Time to complete: {}'.format(time.time() - df_fix_names_time))
# if a conflict of types appears, make it 2 columns
# https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
print('Fixing any type conficts for {} frame...'.format(key))
df_resolve_time = time.time()
resolved = ResolveChoice.apply(frame = df_now, choice = 'make_cols', transformation_ctx = 'resolved')
print('Done. Time to complete: {}'.format(time.time() - df_resolve_time))
# check if key exists in s3. if not make one
out_connect = copy.deepcopy(output_connection_opts)
out_connect['path'] = out_connect['path'] + '/' + str(good_key)
s3_resource.Object(bucket_name, output_key + '/' + good_key + '/').load()
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == '404' or 'NoSuchKey' in e.response['Error']['Code']:
# object doesn't exist
s3_client.put_object(Bucket=bucket_name, Key=output_key+'/'+good_key + '/')
## https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html
print('Writing {} frame to S3...'.format(key))
df_writing_time = time.time()
datasink4 = glueContext.write_dynamic_frame.from_options(frame = df_now, connection_type = "s3", connection_options = out_connect, format = "orc", transformation_ctx = "datasink4")
out_connect = None
datasink4 = None
print('Done. Time to complete: {}'.format(time.time() - df_writing_time))
print('Done. Time to complete: {}s'.format(time.time() - writting_time))
Here is the error I'm getting
19/06/07 16:33:36 DEBUG Client:
client token: N/A
diagnostics: Application application_1559921043869_0001 failed 1 times due to AM Container for appattempt_1559921043869_0001_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001Then, click on links to logs of each attempt.
Diagnostics: Container [pid=9630,containerID=container_1559921043869_0001_01_000001] is running beyond physical memory limits. Current usage: 5.6 GB of 5.5 GB physical memory used; 8.8 GB of 27.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1559921043869_0001_01_000001 :
|- 9630 9628 9630 9630 (bash) 0 0 115822592 675 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' '-Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks' '-Djavax.net.ssl.trustStoreType=JKS' '-Djavax.net.ssl.trustStorePassword=amazon' '-DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem' '-DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem' '-DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file runscript.py --arg 'script_2019-06-07-15-29-50.py' --arg '--JOB_NAME' --arg 'tss-json-to-orc' --arg '--JOB_ID' --arg 'j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe' --arg '--JOB_RUN_ID' --arg 'jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233' --arg '--job-bookmark-option' --arg 'job-bookmark-disable' --arg '--TempDir' --arg 's3://aws-glue-temporary-059866946490-us-east-1/zmcgrath' --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stderr
|- 9677 9648 9630 9630 (python) 12352 2628 1418354688 261364 python runscript.py script_2019-06-07-15-29-50.py --JOB_NAME tss-json-to-orc --JOB_ID j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --JOB_RUN_ID jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --job-bookmark-option job-bookmark-disable --TempDir s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath
|- 9648 9630 9630 9630 (java) 265906 3083 7916974080 1207439 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file runscript.py --arg script_2019-06-07-15-29-50.py --arg --JOB_NAME --arg tss-json-to-orc --arg --JOB_ID --arg j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --arg --JOB_RUN_ID --arg jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --arg --job-bookmark-option --arg job-bookmark-disable --arg --TempDir --arg s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1559921462650
final status: FAILED
tracking URL: http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001
user: root
Here are the log contents from the job
Log Upload Time:Fri Jun 07 16:33:36 +0000 2019
Log Contents:
Loading all json files into DynamicFrame...
Done. Time to complete: 59.5056920052s
Dropping null fields...
null_fields [<some fields that were dropped>]
Done. Time to complete: 529.95293808s
Relationalizing dynamic frame...
Done. Time to complete: 2773.11689401s
Writting all dynamic frames to s3...
Removing . and _ from names for root frame...
Repartitioning data frame...
Changing names...
End of LogType:stdout
As I said earlier, the Done. print after changing the names never appears in the logs. I've seen plenty of people getting the same error I'm seeing and I've tried a fair bit of them with no success. Any help you can provide would b e much appreciated. Let me know if you need any more information. Thanks
Prabhakar's comment reminded me that I have tried the memory worker type in AWS Glue and it still failed. As stated above, I have tried raising the amount of memory in the memoryOverhead from 5 to 12, but to avail. Neither of these made the job complete successfully
I put in the following code for column name change instead of the above code for easier debugging
print('Changing names...')
name_counter = 0
for old_name in data_frame.schema.names:
print('Name number {}. name being changed: {}'.format(name_counter, old_name))
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
name_counter += 1
And I got the following output
Removing . and _ from names for root frame...
Repartitioning data frame...
Changing names...
End of LogType:stdout
So it must be a problem with the data_frame.schema.names part. Could it be this line with my loop through all of the DynamicFrames? Am I looping through the DynamicFrames from the relationalize transformation correctly?
Update 2
Glue recently added more verbose logs and I found this
ERROR YarnClusterScheduler: Lost executor 396 on ip-172-32-78-221.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
This happens for more than just this executor too; it looks like almost all of them.
I can try to increase the executor memory overhead, but I would like to know why getting the column names results in an OOM error. I wouldn't think that something that trivial would take up that much memory?
I attempted to run the job with both spark.driver.memoryOverhead=7g and spark.yarn.executor.memoryOverhead=7g and I again got an OOM error

Callbackfunction modelcheckpoint causes error in keras

I seem to get this error when I am using the callback function modelcheckpoint..
I read from a github issue that the solution would be make use of model.get_weight, but I am implicitly only storing that since i am only storing the one with best weight.
Keras only seem to save weights using h5, which make me question is there any other way to do store them using the eras API, if so how? If not, how do i store it?
Made an example to recreate the problem:
import glob, os
import sys
from os import listdir
from os.path import isfile, join
import numpy as np
import warnings
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from keras.utils import np_utils
from keras import metrics
import keras
from keras import backend as K
from keras.models import Sequential
from keras.optimizers import SGD, Adam
from keras.layers.core import Dense, Activation, Lambda, Reshape,Flatten
from keras.layers import Conv1D,Conv2D,MaxPooling2D, MaxPooling1D, Reshape
#from keras.utils.visualize_util import plot
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers.merge import Concatenate, Add
import h5py
import random
import tensorflow as tf
import math
from keras.callbacks import CSVLogger
from keras.callbacks import ModelCheckpoint
if len(sys.argv) < 5:
print "Missing Arguments!"
print "python keras_convolutional_feature_extraction.py <workspace> <totale_frames> <fbank-dim> <window-height> <batch_size>"
print "Example:"
print "python keras_convolutional_feature_extraction.py deltas 15 40 5 100"
total_frames = int(sys.argv[2])
total_frames_with_deltas = total_frames*3
dim = int(sys.argv[3])
window_height = int(sys.argv[4])
inserted_batch_size = int(sys.argv[5])
stride = 1
splits = ((dim - window_height)+1)/stride
#input_train_data = "/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_train_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(inserted_batch_size)+"_fws_input"
#output_train_data ="/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_train_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(inserted_batch_size)+"_fws_output"
#input_test_data = "/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_test_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(1)+"_fws_input"
#output_test_data = "/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_test_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(1)+"_fws_output"
#train_files =[f for f in listdir(input_train_data) if isfile(join(input_train_data, f))]
#test_files =[f for f in listdir(input_test_data) if isfile(join(input_test_data, f))]
#print len(train_files)
print "hallo"
def train_generator():
while True:
# input = random.choice(train_files)
# h5f = h5py.File(input_train_data+'/'+input, 'r')
# train_input = h5f['train_input'][:]
# train_output = h5f['train_output'][:]
# h5f.close()
train_input = np.random.randint(100,size=((inserted_batch_size,splits*total_frames_with_deltas,window_height,3)))
train_list_list = []
train_input = train_input.reshape((inserted_batch_size,splits*total_frames_with_deltas,window_height,3))
train_input_list = np.split(train_input,splits*total_frames_with_deltas,axis=1)
for i in range(len(train_input_list)):
train_input_list[i] = train_input_list[i].reshape(inserted_batch_size,window_height,3)
#for i in range(len(train_input_list)):
# train_input_list[i] = train_input_list[i].reshape(inserted_batch_size,33,window_height,1,3)
train_output = np.random.randint(5, size = (1,total_frames,5))
middle = int(math.ceil(total_frames/2))
train_output = train_output[:,middle:middle+1,:].reshape((inserted_batch_size,1,5))
#print train_output.shape
#print len(train_input_list)
#print train_input_list[0].shape
yield (train_input_list, train_output)
print "hallo"
def test_generator():
while True:
# input = random.choice(test_files)
# h5f = h5py.File(input_test_data+'/'+input, 'r')
# test_input = h5f['test_input'][:]
# test_output = h5f['test_output'][:]
# h5f.close()
test_input = np.random.randint(100,size=((inserted_batch_size,splits*total_frames_with_deltas,window_height,3)))
test_input = test_input.reshape((inserted_batch_size,splits*total_frames_with_deltas,window_height,3))
test_input_list = np.split(test_input,splits*total_frames_with_deltas,axis=1)
#test_input_list = np.split(test_input,45,axis=3)
for i in range(len(test_input_list)):
test_input_list[i] = test_input_list[i].reshape(inserted_batch_size,window_height,3)
#for i in range(len(test_input_list)):
# test_input_list[i] = test_input_list[i].reshape(inserted_batch_size,33,window_height,1,3)
test_output = np.random.randint(5, size = (1,total_frames,5))
middle = int(math.ceil(total_frames/2))
test_output = test_output[:,middle:middle+1,:].reshape((inserted_batch_size,1,5))
yield (test_input_list, test_output)
print "hallo"
def fws():
#print "Inside"
# Params:
# batch , lr, decay , momentum, epochs
#Input shape: (batch_size,40,45,3)
#output shape: (1,15,50)
# number of unit in conv_feature_map = splitd
model_output = []
list_of_input = [Input(shape=(8,3)) for i in range(splits*total_frames_with_deltas)]
output = []
skip = total_frames_with_deltas
for steps in range(total_frames_with_deltas):
conv = Conv1D(filters = 100, kernel_size = 8)
column = 0
for _ in range(splits):
#print "column " + str(column) + "steps: " + str(steps)
column = column + 1
#print len(output)
#print splits*total_frames_with_deltas
conv = []
for section in range(splits):
column = 0
skip = splits
temp = []
for _ in range(total_frames_with_deltas):
column = column + 1
#print len(conv)
output_conc = Concatenate()(conv)
#print output_conc.get_shape
output_conv = Reshape((splits, -1))(output_conc)
#print output_conv.get_shape
pooled = MaxPooling1D(pool_size = 6, strides = 2)(output_conv)
reshape = Reshape((1,-1))(pooled)
dense1 = Dense(units = 1024, activation = 'relu', name = "dense_1")(reshape)
#dense2 = Dense(units = 1024, activation = 'relu', name = "dense_2")(dense1)
dense3 = Dense(units = 1024, activation = 'relu', name = "dense_3")(dense1)
final = Dense(units = 5, activation = 'relu', name = "final")(dense3)
model = Model(inputs = list_of_input , outputs = final)
sgd = SGD(lr=0.1, decay=1e-1, momentum=0.9, nesterov=True)
model.compile(loss="categorical_crossentropy", optimizer=sgd , metrics = ['accuracy'])
print "compiled"
model_yaml = model.to_yaml()
with open("model.yaml", "w") as yaml_file:
print "Model saved!"
log= CSVLogger('/home/carl/kaldi-trunk/dnn/experimental/yesno_cnn_50_training_total_frames_'+str(total_frames)+"_dim_"+str(dim)+"_window_height_"+str(window_height)+".csv")
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_weights_only=True, mode='max')
print "log"
#plot_model(model, to_file='model.png')
print "Fit"
hist_current = model.fit_generator(train_generator(),
epochs = 10000,
verbose = 1,
validation_data = test_generator(),
pickle_safe = True,
workers = 4,
callbacks = [log,checkpoint])
Execute the script by: python name_of_script.py yens 50 40 8 1
which give me a full traceback:
full traceback
carl#ca-ThinkPad-T420s:~/Dropbox$ python mini.py yesno 50 40 8 1
Using TensorFlow backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Model saved!
/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:2252: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
Epoch 1/10000
2017-05-26 13:01:45.851125: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-26 13:01:45.851345: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-26 13:01:45.851392: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
443/444 [============================>.] - ETA: 4s - loss: 100.1266 - acc: 0.3138Epoch 00000: saving model to yesno_cnn_50_training_total_frames_50_dim_40_window_height_8weights-improvement-00-0.48.hdf5
Traceback (most recent call last):
File "mini.py", line 205, in <module>
File "mini.py", line 203, in fws
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1933, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 411, in on_epoch_end
self.model.save_weights(filepath, overwrite=True)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2503, in save_weights
save_weights_to_hdf5_group(f, self.layers)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2746, in save_weights_to_hdf5_group
f.attrs['layer_names'] = [layer.name.encode('utf8') for layer in layers]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/attrs.py", line 93, in __setitem__
self.create(name, data=value, dtype=base.guess_dtype(value))
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/attrs.py", line 183, in create
attr = h5a.create(self._id, self._e(tempname), htype, space)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "h5py/h5a.pyx", line 47, in h5py.h5a.create (/tmp/pip-4rPeHA-build/h5py/h5a.c:1904)
RuntimeError: Unable to create attribute (Object header message is too large)
If you look at the amount of data Keras is trying to save under layer_names attribute (inside the output HDF5 file being create), you will find that it takes more than 64kB.
np.asarray([layer.name.encode('utf8') for layer in model.layers]).nbytes
>> 77100
I quote from https://support.hdfgroup.org/HDF5/faq/limits.html:
Is there an object header limit and how does that affect HDF5 ?
There is a limit (in HDF5-1.8) of the object header, which is 64 KB.
The datatype for a dataset is stored in the object header, so there is
therefore a limit on the size of the datatype that you can have. (See
The code above was (almost entirely) copied from the traceback:
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2746, in save_weights_to_hdf5_group
f.attrs['layer_names'] = [layer.name.encode('utf8') for layer in layers]
I am using numpy asarray method to get the figure fast but h5py gets similar figure (I guess), see https://github.com/h5py/h5py/blob/master/h5py/_hl/attrs.py#L102 if you want to find exact figure.
Anyway, either you will need to implement your own methods for saving/loading of the weights (or use existing workarounds), or you need to give a really short name to ALL the layers inside your model :), something like this:
list_of_input = [Input(shape=(8,3), name=('i%x' % i)) for i in range(splits*total_frames_with_deltas)]
conv = Conv1D(filters = 100, kernel_size = 8, name='cv%x' % steps)
conv.append(Add(name='add%x' % section)(temp))
output_conc = Concatenate(name='ct')(conv)
output_conv = Reshape((splits, -1), name='rs1')(output_conc)
pooled = MaxPooling1D(pool_size = 6, strides = 2, name='pl')(output_conv)
reshape = Reshape((1,-1), name='rs2')(pooled)
dense1 = Dense(units = 1024, activation = 'relu', name = "d1")(reshape)
dense2 = Dense(units
= 1024, activation = 'relu', name = "d2")(dense1)
dense3 = Dense(units = 1024, activation = 'relu', name = "d3")(dense1)
final = Dense(units = 5, activation = 'relu', name = "fl")(dense3)
You mustn't forget to name all the layers because the (numpy) string array into which the layer names are converted is using the size of the longest string for each individual string in it when it is saved!
After renaming the layers as proposed above (which takes almost 26kB) the model is saved successfully. Hope this elaborate answer helps someone.
Update: I have just made a PR to Keras which should fix the issue without implementing any custom loading/saving methods, see 7508
A simple solution, albeit possibly not the most elegant, could be to run a while loop with epochs = 1.
Get the weights at the end of every epoch together with the accuracy and the loss
Save the weights to file 1 with model.get_weight
if accuracy is greater than at the previous epoch (i.e. loop), store the weights to a different file (file 2)
Run the loop again loading the weights from file 1
Break the loops setting a manual early stopping so that it breaks if the loss does not improve for a certain number of loops
You can use get_weights() together with numpy.save.
It's not the best solution, because it will save several files, but it actually works.
The problem is that you won't have the "optimizer" saved with the current states. But you can perhaps work around that by using smaller learning rates after loading.
Custom callback using numpy.save:
def myCallback(epoch,logs):
global storedLoss
#do your comparisons here using the "logs" var.
if (logs['loss'] < storedLoss):
storedLoss = logs['loss']
for i in range(len(model.layers)):
WandB = model.layers[i].get_weights()
if len (WandB) > 0: #necessary because some layers have no weights
np.save("W" + "-" + str(i), WandB[0],False)
np.save("B" + "-" + str(i), WandB[1],False)
#remember that get and set weights use a list: [weights,biases]
#it may happen (not sure) that there is no bias, and thus you may have to check it (len(WandB)==1).
The logs var brings a dictionary with named metrics, such as "loss", and "accuracy", if you used it.
You can store the losses within the callback in a global var, and compare if each loss is better or worse than the last.
When fitting, use the lambda callback:
from keras.callbacks import LambdaCallback
In the example above, I used the LambdaCallback, which has more possibilities than just on_epoch_end.
For loading, do a similar loop:
#you have to create the model first and then set the layers
def loadModel(model):
for i in range(len(model.layers)):
WandBForCheck = model.layers[i].get_weights()
if len (WandBForCheck) > 0: #necessary because some layers have no weights
W = np.load(Wfile + str(i))
B = np.load(Bfile + str(i))
See follow-up at https://github.com/fchollet/keras/issues/6766 and https://github.com/farizrahman4u/keras-contrib/pull/90.
I saw the YAML and the root cause is probably that you have so many Inputs. A few Inputs with many dimensions is preferred to many Inputs, especially if you can use scanning and batch operations to do everything efficiently.
Now, ignoring that entirely, here is how you can save and load your model if it has too much stuff to save as JSON efficiently:
You can pass save_weights_only=True. That won't save optimizer weights, so isn't a great solution.
Just put together a PR for saving model weights and optimizer weights but not configuration. When you want to load, first instantiate and compile the model as you did when you were going to train it, then use load_all_weights to load the model and optimizer weights into that model. I'll try to merge it soon so you can use it from the master branch.
You could use it something like this:
from keras.callbacks import LambdaCallback
from keras_contrib.utils.save_load_utils import save_all_weights, load_all_weights
# do some stuff to create and compile model
# use `save_all_weights` as a callback to checkpoint your model and optimizer weights
model.fit(..., callbacks=[LambdaCallback(on_epoch_end=lambda epoch, logs: save_all_weights(model, "checkpoint-{:05d}.h5".format(epoch))])
# use `load_all_weights` to load model and optimizer weights into an existing model
# if not compiled (no `model.optimizer`), this will just load model weights
load_all_weights(model, 'checkpoint-1337.h5')
So I don't endorse the model, but if you want to get it to save and load anyways this should probably work for you.
As a side note, if you want to save weights in a different format, something like this would work.
pickle.dump([K.get_value(w) for w in model.weights], open( "save.p", "wb" ) )
Your model architecture must be too large to be saved.
Do not use callback model checkpoint. just once the training ends, save its weights with pickle.
Have a look at this link: Unable to save DataFrame to HDF5 ("object header message is too large")