apache beam rows to tfrecord in order to GenerateStatistics - apache-beam

I have built a pipeline that read some data, does some manipulations and create some apache beam Row objects (Steps 1 and 2 in the code below). I then would like to generate statistic and write them to file. I can leverage tensorflow data validation library for that, but tfdv GenerateStatistics expects a pyarrow.lib.RecordBatch and not a Row object. I am aware of apache_beam.io.tfrecordio.WriteToTFRecord can write a PCollection to file as TFRecord, however, is there a way to do it without writing to file? Ideally a step 3 that would transform a Row object to TFRecord.
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> ...
| 'Step 2: Do some transformations' >> ... # Here the resulting objects are beam Rows
| 'Step 3: transform Rows to TFRecord' >> ...
| 'Step 4: GenerateStatistics' >> tfdv.GenerateStatistics()
| 'Step 5: WriteStatsOutput' >> WriteStatisticsToTFRecord(STATS_OUT_FILE)
)
Without a step 3, my pipeline would generate the following error:
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GenerateStatistics: expected <class 'pyarrow.lib.RecordBatch'>, got Row(attribute_1=Any, attribute_2=Any, ..., attribute_n=Any)
Full type hint:
IOTypeHints[inputs=((<class 'pyarrow.lib.RecordBatch'>,), {}), outputs=((<class 'tensorflow_metadata.proto.v0.statistics_pb2.DatasetFeatureStatisticsList'>,), {})]
strip_pcoll_output()
UPDATE:
I have worked on trying to implement the Step 3 in the pipeline above. GenerateStatistics takes as inputs pyarrow RecordBatch objects, and therefore I tried to build it from the dictionary I get from previous steps. I could came out with the following:
class DictToPyarrow(beam.DoFn):
def __init__(self, schema: pyarrow.schema, *unused_args, **unused_kwargs):
super().__init__(*unused_args, **unused_kwargs)
self.schema = schema
def process(self, pydict):
dict_list = dict()
for key, val in pydict.items():
dict_list[key] = [val]
table = pyarrow.Table.from_pydict(dict_list, schema=self.schema)
batches = table.to_batches()
yield batches[0]
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> ...
| 'Step 2: Do some transformations' >> ... # Here the resulting objects are beam Rows
| "STEP 3: To pyarrow" >> beam.ParDo(DictToPyarrow(schema))
| 'Step 4: GenerateStatistics' >> tfdv.GenerateStatistics()
| 'Step 5: WriteStatsOutput' >> WriteStatisticsToTFRecord(STATS_OUT_FILE)
)
But i get the following error:
TypeError: Expected feature column to be a (Large)List<primitive|struct> or null, but feature my_feature_1 was int32. [while running 'GenerateStatistics/RunStatsGenerators/GenerateSlicedStatisticsImpl/TopKUniquesStatsGenerator/ToTopKTuples']
UPDATE 2: I was able to create a tfx ExampleGen objects from the beam pipeline, thanks to this. Something like
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> beam.io.ReadFromParquet(gsc_input_file)
| 'Step2: ToTFExample' >> beam.Map(utils.dict_to_example)
| "Step3: GenerateStatistics" >> GenerateStatistics(tfdv_options)
)
But i get the following error
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GenerateStatistics: expected <class 'pyarrow.lib.RecordBatch'>, got <class 'tensorflow.core.example.example_pb2.Example'>
Full type hint:
IOTypeHints[inputs=((<class 'pyarrow.lib.RecordBatch'>,), {}), outputs=((<class 'tensorflow_metadata.proto.v0.statistics_pb2.DatasetFeatureStatisticsList'>,), {})]
strip_pcoll_output()
It seems that from tensorflow_data_validation import GenerateStatistics wants RecordBatch.

Unfortunately there’s no other way to do it without writing in a file. We need to do this process because Machine Learning frameworks consume training data as sequence of examples, this file formats for training ML should have easily consumable layouts with no impedance mismatch with the storage platform or programming language used to read/write files.
Some advantage to use TFRecords are:
Best performance training your model.
Binay data uses less space on disk and takes less time to copy and
can be read much more efficiently from disk.
It makes it easier to combine multiple datasets and integrates
seamlessly with the data import and preprocessing functionality
provided by the library, especially for the datasets that are too
large.
You can see more documentation about TFRecord.

Related

'_InvalidUnpickledPCollection' object has no attribute 'windowing'

I am trying to create a pipeline with Apache Beam, where the first step is to center the input time series around 0 by taking the average of the input PCollection, then subtracting each element from the average with a Map. However, running the below script gives me the following error:
'_InvalidUnpickledPCollection' object has no attribute 'windowing'
import apache_beam as beam
import numpy as np
from apache_beam.testing.test_pipeline import TestPipeline
raw_input = np.array(range(1024), dtype=float) # time series is made up of floats
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
avg = input | "Average" >> beam.CombineGlobally(beam.combiners.MeanCombineFn())
centered = input | "Center" >> beam.Map(lambda x: x - beam.pvalue.AsSingleton(avg))
test_pl.run()
run_test()
Writing your lambda as a dedicated mapping function solves your error message, however, for reasons unknown to me the output seems to be duplicated (at least when I test this on play.beam.apache.org)
import apache_beam as beam
import numpy as np
from apache_beam.testing.test_pipeline import TestPipeline
def centering(element, average):
return element - average
raw_input = np.array(range(1024), dtype=float) # time series is made up of floats
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
avg = input | "Average" >> beam.CombineGlobally(beam.combiners.MeanCombineFn())
centered = input | "Center" >> beam.Map(centering, beam.pvalue.AsSingleton(avg))
centered | beam.Map(print)
test_pl.run()
run_test()
As gorilla_glue pointed out, the reason being that we need to provide an explicit side-input variable (either by providing a method as in my code above or within the lambda, i.e., lambda x, side_input: ...).

Generate data statistics with GenerateStatistics from tf.Example generated from a beam pipeline

I am trying to generate data statistics from a beam pipeline with tensorflow data validation. From what i can see online there are two ways to do this: The statsgen component, and the GenerateStatistics component. My pipeline looks something like
import apache_beam as beam
from tfx.components.example_gen import utils
from tensorflow_data_validation import GenerateStatistics
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> beam.io.ReadFromParquet(gsc_input_file)
| 'Step2: ToTFExample' >> beam.Map(utils.dict_to_example)
| "Step3: GenerateStatistics" >> GenerateStatistics(tfdv_options)
)
But the above gives the following error:
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GenerateStatistics: expected <class 'pyarrow.lib.RecordBatch'>, got <class 'tensorflow.core.example.example_pb2.Example'>
Full type hint:
IOTypeHints[inputs=((<class 'pyarrow.lib.RecordBatch'>,), {}), outputs=((<class 'tensorflow_metadata.proto.v0.statistics_pb2.DatasetFeatureStatisticsList'>,), {})]
strip_pcoll_output()
It seems that from tensorflow_data_validation import GenerateStatistics wants RecordBatch.
I got the utils.dict_to_example from here. I was hoping to that GenerateStatistics would get as input some tfexample as statsgen, but it doesn't. What am I missing here? how could i get my pipeline to work?

Key value in apache beam

My output in apache beam is
(['key'],{'id':name})
Expected
('key',{'id':name})
How to do transformation using Map in apache beam to get the expected output
Here is a test pipeline with a lambda function that reformats your tuple using Map:
options = PipelineOptions()
options.view_as(StandardOptions).streaming = True
with TestPipeline(options=options) as p:
bad_record = (['key'],{'id':'name'})
records = (
p
| beam.Create([bad_record])
| beam.Map(lambda e: (e[0][0], e[1]))
| beam.Map(print)
)
The output is:
('key', {'id': 'name'})
I'm guessing that your [key] is probably being yielded instead of returned in a Map earlier in the pipeline. If you fix it there you won't need this step.

Matrix Multiplication A^T * A in PySpark

I asked a similar question yesterday - Matrix Multiplication between two RDD[Array[Double]] in Spark - however I've decided to shift to pyspark to do this. I've made some progress loading and reformatting the data - Pyspark map from RDD of strings to RDD of list of doubles - however the matrix multiplcation is difficult. Let me share my progress first:
matrix1.txt
1.2 3.4 2.3
2.3 1.1 1.5
3.3 1.8 4.5
5.3 2.2 4.5
9.3 8.1 0.3
4.5 4.3 2.1
it's difficult to share files, however this is what my matrix1.txt file looks like. It is a space-delimited text file including the values of a matrix. Next is the code:
# do the imports for pyspark and numpy
from pyspark import SparkConf, SparkContext
import numpy as np
# loadmatrix is a helper function used to read matrix1.txt and format
# from RDD of strings to RDD of list of floats
def loadmatrix(sc):
data = sc.textFile("matrix1.txt").map(lambda line: line.split(' ')).map(lambda line: [float(x) for x in line])
return(data)
# this is the function I am struggling with, it should take a line of the
# matrix (formatted as list of floats), compute an outer product with itself
def AtransposeA(line):
# pseudocode for this would be...
# outerprod = compute line * line^transpose
# return(outerprod)
# here is the main body of my file
if __name__ == "__main__":
# create the conf, sc objects, then use loadmatrix to read data
conf = SparkConf().setAppName('SVD').setMaster('local')
sc = SparkContext(conf = conf)
mymatrix = loadmatrix(sc)
# this is pseudocode for calling AtransposeA
ATA = mymatrix.map(lambda line: AtransposeA(line)).reduce(elementwise add all the outerproducts)
# the SVD of ATA is computed below
U, S, V = np.linalg.svd(ATA)
# ...
My approach is as follows - to do matrix multiplication A^T * A, I create a function that computes outer products of rows of A. The elementwise sum of all of the outerproducts is the product I want. I then call AtransposeA() in a map function, that way is it performed on each row of the matrix, and finally I use a reduce() to add the resulting matrices.
I'm struggling thinking about how the AtransposeA function should look. How can I do an outerproduct in pyspark like this? Thanks in advance for help!
First, consider why you want to use Spark for this. It sounds like all your data fits in memory, in which case you can use numpy and pandas in a very straight-forward way.
If your data isn't structured so that rows are independent, then it probably can't be parallelized by sending groups of rows to different nodes, which is the whole point of using Spark.
Having said that... here is some pyspark (2.1.1) code that I think does what you want.
# read the matrix file
df = spark.read.csv("matrix1.txt",sep=" ",inferSchema=True)
df.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|1.2|3.4|2.3|
|2.3|1.1|1.5|
|3.3|1.8|4.5|
|5.3|2.2|4.5|
|9.3|8.1|0.3|
|4.5|4.3|2.1|
+---+---+---+
# do the sum of the multiplication that we want, and get
# one data frame for each column
colDFs = []
for c2 in df.columns:
colDFs.append( df.select( [ F.sum(df[c1]*df[c2]).alias("op_{0}".format(i)) for i,c1 in enumerate(df.columns) ] ) )
# now union those separate data frames to build the "matrix"
mtxDF = reduce(lambda a,b: a.select(a.columns).union(b.select(a.columns)), colDFs )
mtxDF.show()
+------------------+------------------+------------------+
| op_0| op_1| op_2|
+------------------+------------------+------------------+
| 152.45|118.88999999999999| 57.15|
|118.88999999999999|104.94999999999999| 38.93|
| 57.15| 38.93|52.540000000000006|
+------------------+------------------+------------------+
This seems to be the same result that you get from numpy.
a = numpy.genfromtxt("matrix1.txt")
numpy.dot(a.T, a)
array([[ 152.45, 118.89, 57.15],
[ 118.89, 104.95, 38.93],
[ 57.15, 38.93, 52.54]])

Insert a key using Ruamel

I am using the Ruamel Python library to programmatically edit human-edited YAML files. The source files have keys that are sorted alphabetically.
I'm not sure if this is a basic Python question, or a Ruamel question, but all methods I have tried to sort Ruamel's OrderedDict structure are failing for me.
I am quite confused, for instance, why the following code, based on this recipe, isn't working:
import ruamel.yaml
import collections
def read_file(f):
with open(f, 'r') as _f:
return ruamel.yaml.round_trip_load(
_f.read(),
preserve_quotes=True
)
def write_file(f, data):
with open(f, 'w') as _f:
_f.write(ruamel.yaml.dump(
data,
Dumper=ruamel.yaml.RoundTripDumper,
explicit_start=True,
width=1024
))
data = read_file('in.yaml')
data = collections.OrderedDict(sorted(data.items(), key=lambda t: t[0]))
write_file('out.yaml', data)
But given this input file:
---
bananas: 1
apples: 2
The following output file is produced:
--- !!omap
- apples: 2
- bananas: 1
I.e. it's turned my file into a YAML ordered map.
Is there an easy way to do this? Also, can I simply insert into the data structure somehow?
If you round_trip a mapping in ruamel.yaml¹ , a mapping doesn't get represented as a collections.OrderedDict(), it gets represented as a ruamel.yaml.comments.CommentedMap(). The latter can be a subclass of collections.OrderedDict() depending on which version of Python you are working with (e.g. in Python 2 it uses the much faster C implementation fromruamel.ordereddict)
The representer doesn't automatically interpret "normal" ordered dictionaries (whether from collections or ruamel.ordereddict) as special in round_trip_dump mode. But if you drop the collections:
import ruamel.yaml
def read_file(f):
with open(f, 'r') as _f:
return ruamel.yaml.round_trip_load(
_f.read(),
preserve_quotes=True
)
def write_file(f, data):
with open(f, 'w') as _f:
ruamel.yaml.dump(
data,
stream=_f,
Dumper=ruamel.yaml.RoundTripDumper,
explicit_start=True,
width=1024
)
data = read_file('in.yaml')
data = ruamel.yaml.comments.CommentedMap(sorted(data.items(), key=lambda t: t[0]))
write_file('out.yaml', data)
your out.yaml will be:
---
apples: 2
bananas: 1
Please note that I also removed an inefficiency in your write_file routine. If you don't specify a stream, all data will be streamed to a StringIO instance first (in memory) and then returned (which you wrote to a stream with _f.write(), it is much more efficient to directly write to the stream.
As for your final question: yes you can insert using:
data.insert(1, 'apricot', 3)
¹ Disclaimer: I am the author of both ruamel.yaml as well as ruamel.ordereddict.