Generate data statistics with GenerateStatistics from tf.Example generated from a beam pipeline - apache-beam

I am trying to generate data statistics from a beam pipeline with tensorflow data validation. From what i can see online there are two ways to do this: The statsgen component, and the GenerateStatistics component. My pipeline looks something like
import apache_beam as beam
from tfx.components.example_gen import utils
from tensorflow_data_validation import GenerateStatistics
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> beam.io.ReadFromParquet(gsc_input_file)
| 'Step2: ToTFExample' >> beam.Map(utils.dict_to_example)
| "Step3: GenerateStatistics" >> GenerateStatistics(tfdv_options)
)
But the above gives the following error:
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GenerateStatistics: expected <class 'pyarrow.lib.RecordBatch'>, got <class 'tensorflow.core.example.example_pb2.Example'>
Full type hint:
IOTypeHints[inputs=((<class 'pyarrow.lib.RecordBatch'>,), {}), outputs=((<class 'tensorflow_metadata.proto.v0.statistics_pb2.DatasetFeatureStatisticsList'>,), {})]
strip_pcoll_output()
It seems that from tensorflow_data_validation import GenerateStatistics wants RecordBatch.
I got the utils.dict_to_example from here. I was hoping to that GenerateStatistics would get as input some tfexample as statsgen, but it doesn't. What am I missing here? how could i get my pipeline to work?

Related

How can we fix this AttributeError: 'bytes' object has no attribute 'map'?

I'm trying to run this code on an AWS EMR instance to get features from images by Transfer learning model but I get this error code : AttributeError: 'bytes' object has no attribute 'map'
*
input = np.stack(content_series.map(preprocess))
preds = model.predict(input)
here is the whole code :
def preprocess(content):
img = Image.open(io.BytesIO(content)).resize([224, 224])
arr = img_to_array(img)
return preprocess_input(arr)
def featurize_series(model, content_series):
"""
Featurize a pd.Series of raw images using the input model.
:return: a pd.Series of image features
"""
input = np.stack(content_series.map(preprocess))
preds = model.predict(input)
output = [p.flatten() for p in preds]
return pd.Series(output)
#pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
'''
This method is a Scalar Iterator pandas UDF wrapping our featurization function.
The decorator specifies that this returns a Spark DataFrame column of type ArrayType(FloatType).
:param content_series_iter: This argument is an iterator over batches of data, where each batch
is a pandas Series of image data.
'''
# With Scalar Iterator pandas UDFs, we can load the model once and then re-use it
# for multiple data batches. This amortizes the overhead of loading big models.
model = model_fn()
for content_series in content_series_iter:
yield featurize_series(model, content_series)
do u have any suggestions ? tks
I tried to change instance config
The error is occurring because the content_series.map method is trying to apply the preprocess function to each element of content_series, which is assumed to be a pandas Series. However, the error message indicates that content_series is a 'bytes' object, which does not have a 'map' attribute.
It appears that the content_series_iter input to the featurize_udf function is an iterator over batches of image data, where each batch is a pandas Series of image data. To resolve the error, you'll need to modify the featurize_udf function to convert the 'bytes' objects in each batch to pandas Series before applying the preprocess function:
#pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
model = model_fn()
for content_series in content_series_iter:
content_series = pd.Series(content_series.tolist())
yield featurize_series(model, content_series)
This should resolve the AttributeError and allow you to run the code successfully on the AWS EMR instance.

'_InvalidUnpickledPCollection' object has no attribute 'windowing'

I am trying to create a pipeline with Apache Beam, where the first step is to center the input time series around 0 by taking the average of the input PCollection, then subtracting each element from the average with a Map. However, running the below script gives me the following error:
'_InvalidUnpickledPCollection' object has no attribute 'windowing'
import apache_beam as beam
import numpy as np
from apache_beam.testing.test_pipeline import TestPipeline
raw_input = np.array(range(1024), dtype=float) # time series is made up of floats
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
avg = input | "Average" >> beam.CombineGlobally(beam.combiners.MeanCombineFn())
centered = input | "Center" >> beam.Map(lambda x: x - beam.pvalue.AsSingleton(avg))
test_pl.run()
run_test()
Writing your lambda as a dedicated mapping function solves your error message, however, for reasons unknown to me the output seems to be duplicated (at least when I test this on play.beam.apache.org)
import apache_beam as beam
import numpy as np
from apache_beam.testing.test_pipeline import TestPipeline
def centering(element, average):
return element - average
raw_input = np.array(range(1024), dtype=float) # time series is made up of floats
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
avg = input | "Average" >> beam.CombineGlobally(beam.combiners.MeanCombineFn())
centered = input | "Center" >> beam.Map(centering, beam.pvalue.AsSingleton(avg))
centered | beam.Map(print)
test_pl.run()
run_test()
As gorilla_glue pointed out, the reason being that we need to provide an explicit side-input variable (either by providing a method as in my code above or within the lambda, i.e., lambda x, side_input: ...).

apache beam rows to tfrecord in order to GenerateStatistics

I have built a pipeline that read some data, does some manipulations and create some apache beam Row objects (Steps 1 and 2 in the code below). I then would like to generate statistic and write them to file. I can leverage tensorflow data validation library for that, but tfdv GenerateStatistics expects a pyarrow.lib.RecordBatch and not a Row object. I am aware of apache_beam.io.tfrecordio.WriteToTFRecord can write a PCollection to file as TFRecord, however, is there a way to do it without writing to file? Ideally a step 3 that would transform a Row object to TFRecord.
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> ...
| 'Step 2: Do some transformations' >> ... # Here the resulting objects are beam Rows
| 'Step 3: transform Rows to TFRecord' >> ...
| 'Step 4: GenerateStatistics' >> tfdv.GenerateStatistics()
| 'Step 5: WriteStatsOutput' >> WriteStatisticsToTFRecord(STATS_OUT_FILE)
)
Without a step 3, my pipeline would generate the following error:
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GenerateStatistics: expected <class 'pyarrow.lib.RecordBatch'>, got Row(attribute_1=Any, attribute_2=Any, ..., attribute_n=Any)
Full type hint:
IOTypeHints[inputs=((<class 'pyarrow.lib.RecordBatch'>,), {}), outputs=((<class 'tensorflow_metadata.proto.v0.statistics_pb2.DatasetFeatureStatisticsList'>,), {})]
strip_pcoll_output()
UPDATE:
I have worked on trying to implement the Step 3 in the pipeline above. GenerateStatistics takes as inputs pyarrow RecordBatch objects, and therefore I tried to build it from the dictionary I get from previous steps. I could came out with the following:
class DictToPyarrow(beam.DoFn):
def __init__(self, schema: pyarrow.schema, *unused_args, **unused_kwargs):
super().__init__(*unused_args, **unused_kwargs)
self.schema = schema
def process(self, pydict):
dict_list = dict()
for key, val in pydict.items():
dict_list[key] = [val]
table = pyarrow.Table.from_pydict(dict_list, schema=self.schema)
batches = table.to_batches()
yield batches[0]
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> ...
| 'Step 2: Do some transformations' >> ... # Here the resulting objects are beam Rows
| "STEP 3: To pyarrow" >> beam.ParDo(DictToPyarrow(schema))
| 'Step 4: GenerateStatistics' >> tfdv.GenerateStatistics()
| 'Step 5: WriteStatsOutput' >> WriteStatisticsToTFRecord(STATS_OUT_FILE)
)
But i get the following error:
TypeError: Expected feature column to be a (Large)List<primitive|struct> or null, but feature my_feature_1 was int32. [while running 'GenerateStatistics/RunStatsGenerators/GenerateSlicedStatisticsImpl/TopKUniquesStatsGenerator/ToTopKTuples']
UPDATE 2: I was able to create a tfx ExampleGen objects from the beam pipeline, thanks to this. Something like
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> beam.io.ReadFromParquet(gsc_input_file)
| 'Step2: ToTFExample' >> beam.Map(utils.dict_to_example)
| "Step3: GenerateStatistics" >> GenerateStatistics(tfdv_options)
)
But i get the following error
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GenerateStatistics: expected <class 'pyarrow.lib.RecordBatch'>, got <class 'tensorflow.core.example.example_pb2.Example'>
Full type hint:
IOTypeHints[inputs=((<class 'pyarrow.lib.RecordBatch'>,), {}), outputs=((<class 'tensorflow_metadata.proto.v0.statistics_pb2.DatasetFeatureStatisticsList'>,), {})]
strip_pcoll_output()
It seems that from tensorflow_data_validation import GenerateStatistics wants RecordBatch.
Unfortunately there’s no other way to do it without writing in a file. We need to do this process because Machine Learning frameworks consume training data as sequence of examples, this file formats for training ML should have easily consumable layouts with no impedance mismatch with the storage platform or programming language used to read/write files.
Some advantage to use TFRecords are:
Best performance training your model.
Binay data uses less space on disk and takes less time to copy and
can be read much more efficiently from disk.
It makes it easier to combine multiple datasets and integrates
seamlessly with the data import and preprocessing functionality
provided by the library, especially for the datasets that are too
large.
You can see more documentation about TFRecord.

spark.read.format('libsvm') not working with python

I am learning PYSPARK and encountered a problem that I can't fix. I followed this video to copy codes from the PYSPARK documentation to load data for linear regression. The code I got from the documentation was spark.read.format('libsvm').load('file.txt'). I created a spark data frame before this btw. When I run this code in Jupyter notebook it keeps giving me some java error and the guy in this video did the exact same thing as I did and he didn't get this error. Can someone help me resolve this issue, please?
A lot of thanks!
I think I solved this issue by setting the "numFeatures" in the option method:
training = spark.read.format('libsvm').option("numFeatures","10").load('sample_linear_regression_data.txt', header=True)
You can use this custom function to read libsvm file.
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector
def read_libsvm(filepath, spark_session):
'''
A utility function that takes in a libsvm file and turn it to a pyspark dataframe.
Args:
filepath (str): The file path to the data file.
spark_session (object): The SparkSession object to create dataframe.
Returns:
A pyspark dataframe that contains the data loaded.
'''
with open(filepath, 'r') as f:
raw_data = [x.split() for x in f.readlines()]
outcome = [int(x[0]) for x in raw_data]
index_value_dict = list()
for row in raw_data:
index_value_dict.append(dict([(int(x.split(':')[0]), float(x.split(':')[1]))
for x in row[1:]]))
max_idx = max([max(x.keys()) for x in index_value_dict])
rows = [
Row(
label=outcome[i],
feat_vector=SparseVector(max_idx + 1, index_value_dict[i])
)
for i in range(len(index_value_dict))
]
df = spark_session.createDataFrame(rows)
return df
Usage:
my_data = read_libsvm(filepath="sample_libsvm_data.txt", spark_session=spark)
You can try to load via:
from pyspark.mllib.util import MLUtils
df = MLUtils.loadLibSVMFile(sc,"data.libsvm",numFeatures=781).toDF()
sc is Spark context and df is resulting data frame.

PySpark: Error "Cannot pickle standard input" on function map

I'm trying to learn to use Pyspark.
I'm usin spark-2.2.0- with Python3
I'm in front of a problem now and I can't find where it came from.
My project is to adapt a algorithm wrote by data-scientist to be distributed. The code below it's what I have to use to extract the features from images and I have to adapt it to extract features whith pyspark.
import json
import sys
# Dependencies can be installed by running:
# pip install keras tensorflow h5py pillow
# Run script as:
# ./extract-features.py images/*.jpg
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
def main():
# Load model VGG16 as described in https://arxiv.org/abs/1409.1556
# This is going to take some time...
base_model = VGG16(weights='imagenet')
# Model will produce the output of the 'fc2'layer which is the penultimate neural network layer
# (see the paper above for mode details)
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
# For each image, extract the representation
for image_path in sys.argv[1:]:
features = extract_features(model, image_path)
with open(image_path + ".json", "w") as out:
json.dump(features, out)
def extract_features(model, image_path):
img = image.load_img(image_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
return features.tolist()[0]
if __name__ == "__main__":
main()
I have written the begining of the Code:
rdd = sc.binaryFiles(PathImages)
base_model = VGG16(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
rdd2 = rdd.map(lambda x : (x[0], extract_features(model, x[0][5:])))
rdd2.collect()[0]
when I try to extract the feature. There is an error.
~/Code/spark-2.2.0-bin-hadoop2.7/python/pyspark/cloudpickle.py in
save_file(self, obj)
623 return self.save_reduce(getattr, (sys,'stderr'), obj=obj)
624 if obj is sys.stdin:
--> 625 raise pickle.PicklingError("Cannot pickle standard input")
626 if hasattr(obj, 'isatty') and obj.isatty():
627 raise pickle.PicklingError("Cannot pickle files that map to tty objects")
PicklingError: Cannot pickle standard input
I try multiple thing and here is my first result. I know that the error come from the line below in the method extract_features:
features = model.predict(x)
and when I try to run this line out of a map function or pyspark, this work fine.
I think the problem come from the object "model" and his serialisation whith pyspark.
Maybe I don't use a good way to distribute this with pyspark and if you have any clew to help me, I will take them.
Thanks in advance.