How do I save a TFDV stats in the correct format for them to be loaded back in? - tensorflow-data-validation

It is puzzling to me that there is a tfdv.load_statistics() function, but no corresponding tfdv.write_statistics() function. How do I go about saving the statistics, and then loading them again?
e.g.
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_dataframe(df)
# how do I save?
# load back for later use
saved_stats = tfdv.load_statistics('saved_stats.stats')
I can save the string representation to a file, but this is not the format that load_statistics expects.
with open('saved_stats.stats', 'w') as o:
o.write(str(stats))
Pointers anyone?

have you tried this : tfdv.utils.stats_util.write_stats_text ?

In the current tfdv version 1.3.0 there are the following methods that can be used:
load_stats_text
write_stats_text
Example:
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_dataframe(df)
stats_path = "my-stats-file.stats"
# saving
tfdv.write_stats_text(stats, stats_path)
# loading
stats = tfdv.load_stats_text(stats_path)

Okay figure out this hacky way to do it.
df = ... # create pandas df
from tensorflow_metadata.proto.v0 import statistics_pb2
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_dataframe(df)
# save it
with open('saved_stats.stats', 'wb') as o:
o.write(stats.SerializeToString())
# load back for later use
with open('saved_stats.stats', 'rb') as i:
loaded_stats = statistics_pb2.FromString(i.read())

There's a function called tfdv.load_stats_binary that you can use to solve this problem.

Related

How pickle object and transfer between files in python?

I have a class that I want to share between different files or computers? I can pickle and load it from the same jupyter notebook. However, I cannot load it from a different machine/or different notebook. I have tried the following,
Initialize and and save from a jupyter notebook
# Simplified class definition
class MyClass:
def __init__(self, name):
self.name = name
self.dataToIndex = {}
self.index = 0
def addData(self, dataReceived):
self.dataToIndex[dataReceived] = self.index
self.index += 1
# initialize
my_dataset = MyClass("My_test_dataset")
# add some data
my_dataset.addData("One")
my_dataset.addData("Two")
# check data
my_dataset.dataToIndex
# save from one juputer notebook
import pickle
with open("path.obj", "wb") as inp:
pickle.dump(my_dataset, inp, pickle.HIGHEST_PROTOCOL)
# Read from the same jupyter notebook
with open("path.obj", 'rb') as inp:
transferred = pickle.load(inp)
# Output looks good
transferred.dataToIndex
{'One': 0, 'Two': 1}
Read from the new jupyter notebook
import pickle
with open("path.obj", 'rb') as inp:
transferred = pickle.load(inp)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-1-44a1a18ebb1b> in <module>
2 # Read from the same jupyter notebook
3 with open("path.obj", 'rb') as inp:
----> 4 transferred = pickle.load(inp)
AttributeError: Can't get attribute 'MyClass' on <module '__main__'>
Now, I want to be able to load into a different python script on in a different jupyter notebook.
I have checked this, Saving an Object (Data persistence)
and
https://www.stefaanlippens.net/python-pickling-and-dealing-with-attributeerror-module-object-has-no-attribute-thing.html
But could not figure out a solution. Any help is appreciated.
There is a tons of helpful posts on this. However, most of those fall under, how to save from within the class etc. For a future beginner like me, I just want to provide a simple solution that worked for me.
Take the class out of the notebook and put in a script like, my_class_def.py
class MyClass:
def __init__(self, name):
self.name = name
self.dataToIndex = {}
self.index = 0
def addData(self, dataReceived):
self.dataToIndex[dataReceived] = self.index
self.index += 1
Then use the pickle to save as it is.
In the different(new) script, import the class first using,
from my_class_def import MyClass
and then load the picked file.

How to pass a dataframe read from excel to another variable in spark-scala?

I have a dataframe var cache :DataFrame = _. As an initial run i have given, cache = existingDF, the existingdf is read from an excel using crealytics.spark.excel.
but in the subsequent run, the existingDF will get another updated excel file, it should be cache = cache.union(existingDF)
But I seem to get only existingDF inside cache. In short whenever i call cache it seems to read the excel. How do i avoid this? This issue is not there while reading it as csv. (It was there when i used .persist on the csv read, but got fixed when i removed .persist
More Simply:
var a = _
while(true){
val b = spark.read.format("com.crealytics.spark.excel")...
if (Option(a).isEmpty){
a = b
}
else if a!=b
a = b.union(a)
}
The variable a is always getting updated along with b, so it never becomes different from b. How do I avoid this?

How do i implement a locustfile where each locust takes unique value from csv files for it's task?

enter code here
from locust import HttpLocust, TaskSet, task
class ExampleTask(TaskSet):
csvfile = open('failed.csv', 'r')
data = csvfile.readlines()
bakdata = list(data)
#task
def fun(self):
try:
value = self.data.pop().split(',')
print('------This is the value {}'.format(value[0]))
except IndexError:
self.data = list(self.bakdata)
class ExampleUser(HttpLocust):
host = 'https://www.google.com'
task_set = ExampleTask
Following my csv file:
516,True,success
517,True,success
518,True,success
519,True,success
520,True,success
521,True,success
522,True,success
523,True,success
524,True,success
525,True,success
526,True,success
527,True,success
528,True,success
529,True,success
530,True,success
531,True,success
532,True,success
533,True,success
534,True,success
535,True,success
536,True,success
537,True,success
538,True,success
539,True,success
540,True,success
541,True,success
542,True,success
543,True,success
544,True,success
545,True,success
546,True,success
547,True,success
548,True,success
549,True,success
550,True,success
551,True,success
552,True,success
553,True,success
554,True,success
555,True,success
556,True,success
557,True,success
558,True,success
559,True,success
Here after csv file end , locust does not takes unique value, it takes same value for all the users which is simulated.
I'm not 100% sure, but I think your problem is this line:
self.data = list(self.bakdata)
This will give each User instance a different copy of the list.
It should work if you change it to:
ExampleTask.data = list(self.bakdata)
Or you can use locust-plugins's CSVReader, see the example here:
https://github.com/SvenskaSpel/locust-plugins/blob/master/examples/csvreader_ex.py

How do I print a dictionary vs a defaultdict based dictionary as a yaml file using ruamel.yaml?

Please refer to this trivial block of code shown below. My goal is to use defaultdict to come up with a relatively simple dictionary, and further print the results out as a yaml file.
When I manually define the dictionary, it seems to work just fine and the YAML is displayed exactly the way I want it, but when I use defaultdict to come up with a dictionary, I get an error message and unfortunately I am not able to decipher that.
When I print the dictionary as a JSON, it prints the exact same output. What I am missing?
import sys,ruamel.yaml
import json
from collections import defaultdict
def dict_maker():
return defaultdict(dict_maker)
S = ruamel.yaml.scalarstring.DoubleQuotedScalarString
app = "someapp"
d = {'beats':{'name':S(app), 'udp_address':S('239.1.1.1:10101')}}
foo = dict_maker()
foo["beats"]["name"] = S(app)
foo["beats"]["udp_address"] = S("239.1.1.1:10101")
print "Regular dictionary"
print json.dumps(d, indent=4)
print "defaultdict dictionary"
print json.dumps(foo, indent=4)
print "dictionary as a yaml\n"
ruamel.yaml.dump(d, sys.stdout, Dumper=ruamel.yaml.RoundTripDumper)
print "defaultdict dictionary as a yaml\n"
ruamel.yaml.dump(foo, sys.stdout, Dumper=ruamel.yaml.RoundTripDumper)
Error Message
raise RepresenterError("cannot represent an object: %s" % data)
ruamel.yaml.representer.RepresenterError: cannot represent an object: defaultdict(<function dict_maker at 0x7f1253725a28>, {'beats': defaultdict(<function dict_maker at 0x7f1253725a28>, {'name': u'someapp', 'udp_address': u'239.1.1.1:10101'})})
You seem to be using the word "dictionary" when refering to a Python dict. There is however no such thing as a "defaultdict based dictionary", that would imply that foo after
foo = dict_maker()
would be a dict, and of course it is not: foo is a defaultdict which is dict based (i.e. exactly the other way around from what you write).
That JSON dumps this, is not surprising, as it cannot do more than stupidly dump the key-value pairs as if it were a dict. But when you try to load that JSON back, you see how useless this is as, you cannot continue working with it (at least not in the way expected):
import sys
import json
from collections import defaultdict
import io
def dict_maker():
return defaultdict(dict_maker)
app = "someapp"
foo = dict_maker()
foo["beats"]["name"] = app
foo["beats"]["udp_address"] = "239.1.1.1:10101"
io = io.StringIO()
json.dump(foo, io, indent=4)
io.seek(0)
bar = json.load(io)
bar['otherapp']['name'] = 'some_alt_app'
print(bar['beats']['udp_address'])
The above throws: KeyError: 'otherapp'. And that is because JSON doesn't keep all the information needed.
However, if you use the unsafe YAML dumper, then ruamel.yaml can dump and load this fine:
import sys
from ruamel.yaml import YAML
from collections import defaultdict
import io
def dict_maker():
return defaultdict(dict_maker)
app = "someapp"
yaml = YAML(typ='unsafe')
foo = dict_maker()
foo["beats"]["name"] = app
foo["beats"]["udp_address"] = "239.1.1.1:10101"
io = io.StringIO()
yaml.dump(foo, io)
io.seek(0)
print(io.getvalue())
bar = yaml.load(io)
bar['otherapp']['name'] = 'some_alt_app'
print(bar['beats']['udp_address'])
this doesn't throw an error, as bar is again a defaultdict with dict_maker as the function it defaults to. The above prints
239.1.1.1:10101
as you would expect.
That the RoundTripDumper/Loader doesn't support this out-of-the-box, is because it is based on the SafeDumper/Loader, which cannot dump/load arbitrary Python instances like defaultdict and its dict_maker function reference. Enabling that would make the loading unsafe.
So if you need to use the RoundTripDumper you should add a representer for defaultdict or a subclass thereof (and possible one for dict_maker as well). To be able to load that, you need constructor(s) as well. How to do that is described in the documentation (Dumping Python classes)

Resource Not Found Exception in pyglet

I'm using Python 2.6.6 and pyglet 1.1.4. In my "Erosion" folder, I have "Erosion.py" and a folder named "Images." Inside images, there are .png images. One image is named "Guard.png."
In "Erosion.py" there is a segment that goes like so:
pyglet.resource.path = ['Images']
pyglet.resource.reindex()
self.image = pyglet.resource.image('%s%s' % (character, '.png'))
When I run this, I am given
File "C:\Python26\lib\site-packages\pyglet\resource.py", line 394, in file raise ResourceNotFoundException(name)
ResourceNotFoundException: Resource "Guard.png" was not found on the path. Ensure that the filename has the correct captialisation.
I have tried changing the path to ['./Images'] and ['../Images']. I've also tried removing the path and the reindex call and putting Erosion.py and Guard.png in the same folder.
This is what i do to be able to load resources:
pyglet.resource.path = ['C:\\Users\\myname\\Downloads\\temp']
pyglet.resource.reindex()
pic = pyglet.resource.image('1.jpg')
texture = pic.get_texture()
gl.glTexParameteri(gl.GL_TEXTURE_2D, gl.GL_TEXTURE_MAG_FILTER, gl.GL_NEAREST)
texture.width = 960
texture.height = 720
texture.blit(0, 0, 0)
try this
pyglet.image.load('path/to/image.png')
If the relative path is not working you can try with the absolute path using the os module.
import pyglet
import os
working_dir = os.path.dirname(os.path.realpath(__file__))
pyglet.resource.path = [os.path.join(working_dir,'Images')]
pyglet.resource.reindex()
image = pyglet.resource.image('character.png'))
It's better to use the os.path.join method instead of writing it as a string for a better cross-platform support.
I get a problem like this using pyglet and pyscripter.(the text editor) In order for the file to be found I have to restart the editor before running the program.
This might be a problem with pyscripter however.
It is worth looking into: http://pyglet.readthedocs.io/en/pyglet-1.3-maintenance/programming_guide/resources.html
You should include all the directories into:
pyglet.resource.path = [...]
Then pass the ONLY the file name when you call:
pyglet.resource.image("ball_color.jpeg")
Try:
import os
import pyglet
working_dir = os.path.dirname(os.path.realpath(__file__))
image_dir = os.path.join(working_dir, 'images')
sound_dir = os.path.join(working_dir, 'sound')
print working_dir
print "Images are in: " + image_dir
print "Sounds are in: " + sound_dir
pyglet.resource.path = [working_dir, image_dir, sound_dir]
pyglet.resource.reindex()
try:
window = pyglet.window.Window()
image = pyglet.resource.image("ball_color.jpeg")
#window.event
def on_draw():
window.clear()
image.blit(0, 0)
pyglet.app.run()
finally:
window.close()
The use of try and finally are not necessary, but they are recommended. You might end up with a lot of open windows and pyglet back processes if you don't close the window when there are errors.
pyglet 1.4.4 format: HUD_img = pyglet.resource.image('ui\ui_gray_block_filled.png')
pyglet 1.4.8 format: HUD_img = pyglet.resource.image('ui/ui_gray_block_filled.png')
I was using pyglet 1.4.4 and was able to use this the first format.
After upgrading to pyglet 1.4.8 I had the error: pyglet.resource.ResourceNotFoundException: Resource "your_resource_path" was not found on the path.
Switching to the 2nd format with forward slashes solved this issue for me. Not sure why this change was made...I'm sure there was a good reason for it.