ruamel.yaml edit document before dump - ruamel.yaml

I have a YAML file which has documents of this sort
%YAML 1.2
---
!some_tag
name: xyz
constants:
state: abc
After reading in the documents, before dumping using ruamel.yaml.YAML().dump, I want to remove this part of the document
%YAML 1.2
---
My output file should have just these sections of the document
!some_tag
name: xyz
constants:
state: abc
How can this be done?

If you have an explicit tag with version the .version attribute on the
YAML() instance gets set. You can either instantiate a new instance for dumping
or just set that attribute to None:
import sys
import pathlib
import ruamel.yaml
inp = pathlib.Path('input.yaml')
yaml = ruamel.yaml.YAML()
data = yaml.load(inp)
print(f'>>> version info {yaml.version}\n')
yaml.version = None
yaml.dump(data, sys.stdout)
which gives:
>>> version info (1, 2)
!some_tag
name: xyz
constants:
state: abc

Related

CuPy error when pushing / popping pycuda context

I am using tensorRT to perform inference with CUDA. I'd like to use CuPy to preprocess some images that I'll feed to the tensorRT engine. The preprocessing function, called my_function, works fine as long as tensorRT is not run between different calls of the my_function method (see code below). Specifically, the issue is not strictly related by tensorRT but by the fact that tensorRT inference requires to be wrapped by push and pop operations of the pycuda context.
With respect to the following code, the last execution of my_function will raise the following error:
File "/home/ubuntu/myfile.py", line 188, in _pre_process_cuda
img = ndimage.zoom(img, scaling_factor)
File "/home/ubuntu/.local/lib/python3.6/site-packages/cupyx/scipy/ndimage/interpolation.py", line 482, in zoom
kern(input, zoom, output)
File "cupy/core/_kernel.pyx", line 822, in cupy.core._kernel.ElementwiseKernel.__call__
File "cupy/cuda/function.pyx", line 196, in cupy.cuda.function.Function.linear_launch
File "cupy/cuda/function.pyx", line 164, in cupy.cuda.function._launch
File "cupy_backends/cuda/api/driver.pyx", line 299, in cupy_backends.cuda.api.driver.launchKernel
File "cupy_backends/cuda/api/driver.pyx", line 124, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
Note: in the following code I haven't reported the entire tensorRT inference code. In fact, simply pushing and popping a pycuda context generates the error
Code:
import numpy as np
import cv2
import time
from PIL import Image
import requests
from io import BytesIO
from matplotlib import pyplot as plt
import cupy as cp
from cupyx.scipy import ndimage
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
def my_function(numpy_frame):
dtype = 'float32'
img = cp.array(numpy_frame, dtype='float32')
# print(img)
img = ndimage.zoom(img, (0.5, 0.5, 3))
img = (cp.array(2, dtype=dtype) / cp.array(255, dtype=dtype)) * img - cp.array(1, dtype=dtype)
img = img.transpose((2, 0, 1))
img = img.ravel()
return img
# load image
url = "https://www.pexels.com/photo/109919/download/?search_query=&tracking_id=411xe21veam"
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img = np.array(img)
# initialize tensorrt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
cfx = cuda.Device(0).make_context()
my_function(img) # ok
my_function(img) # ok
# ----- TENSORRT ---------
cfx.push()
# .... tensorrt inference....
cfx.pop()
# ----- TENSORRT ---------
my_function(img) # <---- error
I even tried to do it other ways, but unfortunately with the same result:
cfx.push()
my_function(img) # ok
cfx.pop()
cfx.push()
my_function(img) # error
cfx.pop()
#admin: if you can think of a better name for this question feel free to edit it :)
There were multiple contexts open. For instance, it seems that all of the following open a context:
import pycuda.autoinit
cfx.cuda.Device(0).make_context()
cfx.push()
So if you run the three command above, then simply running one cfx.pop() won't be enough. You will need to run cfx.pop() three times to pop all the contexts.

Hyperopt failed to execute mlflow.end_run() with tracking URI: databricks

I'm using Azure Databricks + Hyperopt + MLflow for some hyperparameter tuning on a small dataset. Seem like the job is running, and I get output in MLflow, but the job ends with the following error message:
Hyperopt failed to execute mlflow.end_run() with tracking URI: databricks
Here is my code code with some information redacted:
from pyspark.sql import SparkSession
# spark session initialization
spark = (SparkSession.builder.getOrCreate())
sc = spark.sparkContext
# Data Processing
import pandas as pd
import numpy as np
# Hyperparameter Tuning
from hyperopt import fmin, tpe, hp, anneal, Trials, space_eval, SparkTrials, STATUS_OK
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
# Modeling
from sklearn.ensemble import RandomForestClassifier
# cleaning
import gc
# tracking
import mlflow
# track runtime
from datetime import date, datetime
mlflow.set_experiment('/user/myname/myexp')
# notebook settings \ variable settings
n_splits = #
n_repeats = #
max_evals = #
dfL = pd.read_csv("/my/data/loc/mydata.csv")
x_train = dfL[['f1','f2','f3']]
y_train = dfL['target']
def define_model(params):
model = RandomForestClassifier(n_estimators=int(params['n_estimators']),
criterion=params['criterion'],
max_depth=int(params['max_depth']),
min_samples_split=params['min_samples_split'],
min_samples_leaf=params['min_samples_leaf'],
min_weight_fraction_leaf=params['min_weight_fraction_leaf'],
max_features=params['max_features'],
max_leaf_nodes=None,
min_impurity_decrease=params['min_impurity_decrease'],
min_impurity_split=None,
bootstrap=params['bootstrap'],
oob_score=False,
n_jobs=-1,
random_state=int(params['random_state']),
verbose=0,
warm_start=False,
class_weight={0:params['class_0_weight'], 1:params['class_1_weight']})
return model
space = {'n_estimators': hp.quniform('n_estimators', #, #, #),
'criterion': hp.choice('#', ['#','#']),
'max_depth': hp.quniform('max_depth', #, #, #),
'min_samples_split': hp.quniform('min_samples_split', #, #, #),
'min_samples_leaf': hp.quniform('min_samples_leaf', #, #, #),
'min_weight_fraction_leaf': hp.quniform('min_weight_fraction_leaf', #, #, #),
'max_features': hp.quniform('max_features', #, #, #),
'min_impurity_decrease': hp.quniform('min_impurity_decrease', #, #, #),
'bootstrap': hp.choice('bootstrap', [#,#]),
'random_state': hp.quniform('random_state', #, #, #),
'class_0_weight': hp.choice('class_0_weight', [#,#,#]),
'class_1_weight': hp.choice('class_1_weight', [#,#,#])}
# define hyperopt objective
def objective(params, n_splits=n_splits, n_repeats=n_repeats):
# define model
model = define_model(params)
# get cv splits
kfold = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=1331)
# define and run sklearn cv scorer
scores = cross_val_score(model, x_train, y_train, cv=kfold, scoring='roc_auc')
score = scores.mean()
return {'loss': score*(-1), 'status': STATUS_OK}
spark_trials = SparkTrials(parallelism=36, spark_session=spark)
with mlflow.start_run():
best = fmin(objective, space, algo=tpe.suggest, trials=spark_trials, max_evals=max_evals)
and then at the end I get..
100%|██████████| 200/200 [1:35:28<00:00, 100.49s/trial, best loss: -0.9584565527065526]
Hyperopt failed to execute mlflow.end_run() with tracking URI: databricks
Exception: 'MLFLOW_RUN_ID'
Total Trials: 200: 200 succeeded, 0 failed, 0 cancelled.
My Azure Databricks cluster is..
6.6 ML (includes Apache Spark 2.4.5, Scala 2.11)
Standard_DS3_v2
min 9 max 18 nodes
Am I doing something wrong or is this a bug?
This message is a known (but harmless) issue and has been fixed for MLR 7.0. I have tried executing on the DBR 7.0 ML cluster and it's working.
You don’t need start_run(); a run is started for you automatically with SparkTrials. The error is because of this only.
So with SparkTrials, it still works without start_run(); SparkTrials should automatically run and log for you.

AWS Glue job failing with OOM exception when changing column names

I have an ETL job where I load some data from S3 into a dynamic frame, relationalize it, and iterate through the dynamic frames returned. I want to query the result of this in Athena later so I want to change the names of the columns from having '.' to '_' and lower case them. When I do this transformation, I change the DynamicFrame into a spark dataframe and have been doing it this way. I've also seen a problem in another SO question where it turned out there is a reported problem with AWS Glue rename field transform so I've stayed away from that.
I've tried a couple things, including adding a load limit size to 50MB, repartitioning the dataframe, using both dataframe.schema.names and dataframe.columns, using reduce instead of loops, using sparksql to change it and nothing has worked. I'm fairly certain that its this transformation that failing because I've put some print statements in and the print that I have right after the completion of this transformation never shows up. I used a UDF at one point but that also failed. I've tried the actual transformation using df.toDF(new_column_names) and df.withColumnRenamed() but it never gets this far because I've not seen it get past retrieving the column names. Here's the code I've been using. I've been changing the actual name transformation as I said above, but the rest of it has stayed pretty much the same.
I've seen some people try and use the spark.executor.memory, spark.driver.memory, spark.executor.memoryOverhead and spark.driver.memoryOverhead. I've used those and set them to the most AWS Glue will let you but to no avail.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import explode, col, lower, trim, regexp_replace
import copy
import json
import boto3
import botocore
import time
# ========================================================
# UTILITY FUNCTIONS
# ========================================================
def lower_and_pythonize(s=None):
if s is not None:
return s.replace('.', '_').lower()
else:
return None
# pyspark implementation of renaming
# exprs = [
# regexp_replace(lower(trim(col(c))),'\.' , '_').alias(c) if t == "string" else col(c)
# for (c, t) in data_frame.dtypes
# ]
# ========================================================
# END UTILITY FUNCTIONS
# ========================================================
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#my params
bucket_name = '<my-s3-bucket>' # name of the bucket. do not include 's3://' thats added later
output_key = '<my-output-path>' # key where all of the output is saved
input_keys = ['<root-directory-i'm using'] # highest level key that holds all of the desired data
s3_exclusions = "[\"*.orc\"]" # list of strings to exclude. Documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3
s3_exclusions = s3_exclusions.replace('\n', '')
dfc_root_table_name = 'root' # name of the root table generated in the relationalize process
input_paths = ['s3://' + bucket_name + '/' + x for x in input_keys] # turn input keys into s3 paths
output_connection_opts = {"path": "s3://" + bucket_name + "/" + output_key} # dict of options. Documentation link found above the write_dynamic_frame.from_options line
s3_client = boto3.client('s3', 'us-east-1') # s3 client used for writing to s3
s3_resource = boto3.resource('s3', 'us-east-1') # s3 resource used for checking if key exists
group_mb = 50 # NOTE: 75 has proven to be too much when running on all of the april data
group_size = str(group_mb * 1024 * 1024)
input_connection_opts = {'paths': input_paths,
'groupFiles': 'inPartition',
'groupSize': group_size,
'recurse': True,
'exclusions': s3_exclusions} # dict of options. Documentation link found above the create_dynamic_frame_from_options line
print(sc._conf.get('spark.executor.cores'))
num_paritions = int(sc._conf.get('spark.executor.cores')) * 4
print('Loading all json files into DynamicFrame...')
loading_time = time.time()
df = glueContext.create_dynamic_frame_from_options(connection_type='s3', connection_options=input_connection_opts, format='json')
print('Done. Time to complete: {}s'.format(time.time() - loading_time))
# using the list of known null fields (at least on small sample size) remove them
#df = df.drop_fields(drop_paths)
# drop any remaining null fields. The above covers known problems that this step doesn't fix
print('Dropping null fields...')
dropping_time = time.time()
df_without_null = DropNullFields.apply(frame=df, transformation_ctx='df_without_null')
print('Done. Time to complete: {}s'.format(time.time() - dropping_time))
df = None
print('Relationalizing dynamic frame...')
relationalizing_time = time.time()
dfc = Relationalize.apply(frame=df_without_null, name=dfc_root_table_name, info="RELATIONALIZE", transformation_ctx='dfc', stageThreshold=3)
print('Done. Time to complete: {}s'.format(time.time() - relationalizing_time))
keys = dfc.keys()
keys.sort(key=lambda s: len(s))
print('Writting all dynamic frames to s3...')
writting_time = time.time()
for key in keys:
good_key = lower_and_pythonize(s=key)
data_frame = dfc.select(key).toDF()
# lowercase all the names and remove '.'
print('Removing . and _ from names for {} frame...'.format(key))
df_fix_names_time = time.time()
print('Repartitioning data frame...')
data_frame.repartition(num_paritions)
print('Done.')
#
print('Changing names...')
for old_name in data_frame.schema.names:
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
print('Done.')
#
df_now = DynamicFrame.fromDF(dataframe=data_frame, glue_ctx=glueContext, name='df_now')
print('Done. Time to complete: {}'.format(time.time() - df_fix_names_time))
# if a conflict of types appears, make it 2 columns
# https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
print('Fixing any type conficts for {} frame...'.format(key))
df_resolve_time = time.time()
resolved = ResolveChoice.apply(frame = df_now, choice = 'make_cols', transformation_ctx = 'resolved')
print('Done. Time to complete: {}'.format(time.time() - df_resolve_time))
# check if key exists in s3. if not make one
out_connect = copy.deepcopy(output_connection_opts)
out_connect['path'] = out_connect['path'] + '/' + str(good_key)
try:
s3_resource.Object(bucket_name, output_key + '/' + good_key + '/').load()
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == '404' or 'NoSuchKey' in e.response['Error']['Code']:
# object doesn't exist
s3_client.put_object(Bucket=bucket_name, Key=output_key+'/'+good_key + '/')
else:
print(e)
## https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html
print('Writing {} frame to S3...'.format(key))
df_writing_time = time.time()
datasink4 = glueContext.write_dynamic_frame.from_options(frame = df_now, connection_type = "s3", connection_options = out_connect, format = "orc", transformation_ctx = "datasink4")
out_connect = None
datasink4 = None
print('Done. Time to complete: {}'.format(time.time() - df_writing_time))
print('Done. Time to complete: {}s'.format(time.time() - writting_time))
job.commit()
Here is the error I'm getting
19/06/07 16:33:36 DEBUG Client:
client token: N/A
diagnostics: Application application_1559921043869_0001 failed 1 times due to AM Container for appattempt_1559921043869_0001_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001Then, click on links to logs of each attempt.
Diagnostics: Container [pid=9630,containerID=container_1559921043869_0001_01_000001] is running beyond physical memory limits. Current usage: 5.6 GB of 5.5 GB physical memory used; 8.8 GB of 27.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1559921043869_0001_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 9630 9628 9630 9630 (bash) 0 0 115822592 675 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' '-Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks' '-Djavax.net.ssl.trustStoreType=JKS' '-Djavax.net.ssl.trustStorePassword=amazon' '-DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem' '-DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem' '-DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file runscript.py --arg 'script_2019-06-07-15-29-50.py' --arg '--JOB_NAME' --arg 'tss-json-to-orc' --arg '--JOB_ID' --arg 'j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe' --arg '--JOB_RUN_ID' --arg 'jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233' --arg '--job-bookmark-option' --arg 'job-bookmark-disable' --arg '--TempDir' --arg 's3://aws-glue-temporary-059866946490-us-east-1/zmcgrath' --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stderr
|- 9677 9648 9630 9630 (python) 12352 2628 1418354688 261364 python runscript.py script_2019-06-07-15-29-50.py --JOB_NAME tss-json-to-orc --JOB_ID j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --JOB_RUN_ID jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --job-bookmark-option job-bookmark-disable --TempDir s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath
|- 9648 9630 9630 9630 (java) 265906 3083 7916974080 1207439 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file runscript.py --arg script_2019-06-07-15-29-50.py --arg --JOB_NAME --arg tss-json-to-orc --arg --JOB_ID --arg j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --arg --JOB_RUN_ID --arg jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --arg --job-bookmark-option --arg job-bookmark-disable --arg --TempDir --arg s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1559921462650
final status: FAILED
tracking URL: http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001
user: root
Here are the log contents from the job
LogType:stdout
Log Upload Time:Fri Jun 07 16:33:36 +0000 2019
LogLength:487
Log Contents:
4
Loading all json files into DynamicFrame...
Done. Time to complete: 59.5056920052s
Dropping null fields...
null_fields [<some fields that were dropped>]
Done. Time to complete: 529.95293808s
Relationalizing dynamic frame...
Done. Time to complete: 2773.11689401s
Writting all dynamic frames to s3...
Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout
As I said earlier, the Done. print after changing the names never appears in the logs. I've seen plenty of people getting the same error I'm seeing and I've tried a fair bit of them with no success. Any help you can provide would b e much appreciated. Let me know if you need any more information. Thanks
Edit
Prabhakar's comment reminded me that I have tried the memory worker type in AWS Glue and it still failed. As stated above, I have tried raising the amount of memory in the memoryOverhead from 5 to 12, but to avail. Neither of these made the job complete successfully
Update
I put in the following code for column name change instead of the above code for easier debugging
print('Changing names...')
name_counter = 0
for old_name in data_frame.schema.names:
print('Name number {}. name being changed: {}'.format(name_counter, old_name))
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
name_counter += 1
print('Done.')
And I got the following output
Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout
So it must be a problem with the data_frame.schema.names part. Could it be this line with my loop through all of the DynamicFrames? Am I looping through the DynamicFrames from the relationalize transformation correctly?
Update 2
Glue recently added more verbose logs and I found this
ERROR YarnClusterScheduler: Lost executor 396 on ip-172-32-78-221.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
This happens for more than just this executor too; it looks like almost all of them.
I can try to increase the executor memory overhead, but I would like to know why getting the column names results in an OOM error. I wouldn't think that something that trivial would take up that much memory?
Update
I attempted to run the job with both spark.driver.memoryOverhead=7g and spark.yarn.executor.memoryOverhead=7g and I again got an OOM error

How to read kaggle zip file dataset in the databricks

I want to read the zip file dataset from the kaggle but I am unable to read that dataset:
import urllib
urllib.request.urlretrieve("https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants/downloads/zomato-bangalore-restaurants.zip", "/tmp/zomato-bangalore-restaurants.zip")
then I run shell scripting to extracting the file:
%sh
unzip /tmp/zomato-bangalore-restaurants.zip
tail -n +2 zomato-bangalore-restaurants.csv > temp.csv
rm zomato-bangalore-restaurants.csv
Then I got an error:
Archive: /tmp/zomato-bangalore-restaurants.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of /tmp/zomato-bangalore-restaurants.zip or
/tmp/zomato-bangalore-restaurants.zip.zip, and cannot find /tmp/zomato-bangalore-restaurants.zip.ZIP, period.
tail: cannot open 'zomato-bangalore-restaurants.csv' for reading: No such file or directory
rm: cannot remove 'zomato-bangalore-restaurants.csv': No such file or directory
Note: Attempt to download a file from Kaggle is blocked because you are not logged in yet.
Here is the script to download all the competition data sets.
from requests import get, post
from os import mkdir, remove
from os.path import exists
from shutil import rmtree
import zipfile
def purge_all_downloads(db_full_path):
# Removes all the downloaded datasets
if exists(db_full_path): rmtree(db_full_path)
def datasets_are_available_locally(db_full_path, datasets):
# Returns True only if all the competition datasets are available locally in Databricks CE
if not exists(db_full_path): return False
for df in datasets:
# Assumes all the datasets end with '.csv' extention
if not exists(db_full_path + df + '.csv'): return False
return True
def remove_zip_files(db_full_path, datasets):
for df in datasets:
remove(db_full_path + df + '.csv.zip')
def unzip(db_full_path, datasets):
for df in datasets:
with zipfile.ZipFile(db_full_path + df + '.csv.zip', 'r') as zf:
zf.extractall(db_full_path)
remove_zip_files(db_full_path, datasets)
def download_datasets(competition, db_full_path, datasets, username, password):
# Downloads the competition datasets if not availible locally
if datasets_are_available_locally(db_full_path, datasets):
print 'All the competition datasets have been downloaded, extraced and are ready for you !'
return
purge_all_downloads(db_full_path)
mkdir(db_full_path)
kaggle_info = {'UserName': username, 'Password': password}
for df in datasets:
url = (
'https://www.kaggle.com/account/login?ReturnUrl=' +
'/c/' + competition + '/download/'+ df + '.csv.zip'
)
request = post(url, data=kaggle_info, stream=True)
# write data to local file
with open(db_full_path + df + '.csv.zip', "w") as f:
for chunk in request.iter_content(chunk_size = 512 * 1024):
if chunk: f.write(chunk)
# extract competition data
unzip(db_full_path, datasets)
print('done !')
For more details, refer "Download the competition data sets directly".
Hope this helps.

How to add meta_data to Pandas dataframe?

I use Pandas dataframe heavily. And need to attach some data to the dataframe, for example to record the birth time of the dataframe, the additional description of the dataframe etc.
I just can't find reserved fields of dataframe class to keep the data.
So I change the core\frame.py file to add a line _reserved_slot = {} to solve my issue. I post the question here is just want to know is it OK to do so ? Or is there better way to attach meta-data to dataframe/column/row etc?
#----------------------------------------------------------------------
# DataFrame class
class DataFrame(NDFrame):
_auto_consolidate = True
_verbose_info = True
_het_axis = 1
_col_klass = Series
_AXIS_NUMBERS = {
'index': 0,
'columns': 1
}
_reserved_slot = {} # Add by bigbug to keep extra data for dataframe
_AXIS_NAMES = dict((v, k) for k, v in _AXIS_NUMBERS.iteritems())
EDIT : (Add demo msg for witingkuo's way)
>>> df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDEFGHIJKLMN')[0:5])
>>> df
A B C D E
0 0.5890 -0.7683 -1.9752 0.7745 0.8019
1 1.1835 0.0873 0.3492 0.7749 1.1318
2 0.7476 0.4116 0.3427 -0.1355 1.8557
3 1.2738 0.7225 -0.8639 -0.7190 -0.2598
4 -0.3644 -0.4676 0.0837 0.1685 0.8199
5 0.4621 -0.2965 0.7061 -1.3920 0.6838
6 -0.4135 -0.4991 0.7277 -0.6099 1.8606
7 -1.0804 -0.3456 0.8979 0.3319 -1.1907
8 -0.3892 1.2319 -0.4735 0.8516 1.2431
9 -1.0527 0.9307 0.2740 -0.6909 0.4924
>>> df._test = 'hello'
>>> df2 = df.shift(1)
>>> print df2._test
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\lib\site-packages\pandas\core\frame.py", line 2051, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute '_test'
>>>
This is not supported right now. See https://github.com/pydata/pandas/issues/2485. The reason is the propogation of these attributes is non-trivial. You can certainly assign data, but almost all pandas operations return a new object, where the assigned data will be lost.
Your _reserved_slot will become a class variable. That might not work if you want to assign different value to different DataFrame. Probably you can assign what you want to the instance directly.
In [6]: import pandas as pd
In [7]: df = pd.DataFrame()
In [8]: df._test = 'hello'
In [9]: df._test
Out[9]: 'hello'
I think a decent workaround is putting your datafame into a dictionary with your metadata as other keys. So if you have a dataframe with cashflows, like:
df = pd.DataFrame({'Amount': [-20, 15, 25, 30, 100]},index=pd.date_range(start='1/1/2018', periods=5))
You can create your dictionary with additional metadata and put the dataframe there
out = {'metadata': {'Name': 'Whatever', 'Account': 'Something else'}, 'df': df}
and then use it as out[df]