Key value in apache beam - apache-beam

My output in apache beam is
(['key'],{'id':name})
Expected
('key',{'id':name})
How to do transformation using Map in apache beam to get the expected output

Here is a test pipeline with a lambda function that reformats your tuple using Map:
options = PipelineOptions()
options.view_as(StandardOptions).streaming = True
with TestPipeline(options=options) as p:
bad_record = (['key'],{'id':'name'})
records = (
p
| beam.Create([bad_record])
| beam.Map(lambda e: (e[0][0], e[1]))
| beam.Map(print)
)
The output is:
('key', {'id': 'name'})
I'm guessing that your [key] is probably being yielded instead of returned in a Map earlier in the pipeline. If you fix it there you won't need this step.

Related

Transforming `PCollection` with many elements into a single element

I am trying to convert a PCollection, that has many elements, into a PCollection that has one element. Basically, I want to go from:
[1,2,3,4,5,6]
to:
[[1,2,3,4,5,6]]
so that I can work with the entire PCollection in a DoFn.
I've tried CombineGlobally(lamdba x: x), but only a portion of elements get combined into an array at a time, giving me the following result:
[1,2,3,4,5,6] -> [[1,2],[3,4],[5,6]]
Or something to that effect.
This is my relevant portion of my script that I'm trying to run:
import apache_beam as beam
raw_input = range(1024)
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
def combine(x):
print(x)
return x
(
input
| "Global aggregation" >> beam.CombineGlobally(combine)
)
pl.run()
run_test()
I figured out a pretty painless way to do this, which I missed in the docs:
The more general way to combine elements, and the most flexible, is
with a class that inherits from CombineFn.
CombineFn.create_accumulator(): This creates an empty accumulator. For
example, an empty accumulator for a sum would be 0, while an empty
accumulator for a product (multiplication) would be 1.
CombineFn.add_input(): Called once per element. Takes an accumulator
and an input element, combines them and returns the updated
accumulator.
CombineFn.merge_accumulators(): Multiple accumulators could be
processed in parallel, so this function helps merging them into a
single accumulator.
CombineFn.extract_output(): It allows to do additional calculations
before extracting a result.
I suppose supplying a lambda function that simply passes its argument to the "vanilla" CombineGlobally wouldn't do what I expected initially. That functionality has to be specified by me (although I still think it's weird this isn't built into the API).
You can find more about subclassing CombineFn here, which I found very helpful:
A CombineFn specifies how multiple values in all or part of a
PCollection can be merged into a single value—essentially providing
the same kind of information as the arguments to the Python “reduce”
builtin (except for the input argument, which is an instance of
CombineFnProcessContext). The combining process proceeds as follows:
Input values are partitioned into one or more batches.
For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of
zero values.
For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.
The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value,
once all of the accumulators have had all the input value in their
batches added to them. This operation is invoked repeatedly, until
there is only one accumulator value left.
The extract_output operation is invoked on the final accumulator to get the output value. Note: If this CombineFn is used with a transform
that has defaults, apply will be called with an empty list at
expansion time to get the default value.
So, by subclassing CombineFn, I wrote this simple implementation, Aggregated, that does exactly what I want:
import apache_beam as beam
raw_input = range(1024)
class Aggregated(beam.CombineFn):
def create_accumulator(self):
return []
def add_input(self, accumulator, element):
accumulator.append(element)
return accumulator
def merge_accumulators(self, accumulators):
merged = []
for a in accumulators:
for item in a:
merged.append(item)
return merged
def extract_output(self, accumulator):
return accumulator
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
(
input
| "Global aggregation" >> beam.CombineGlobally(Aggregated())
| "print" >> beam.Map(print)
)
pl.run()
run_test()
You can also accomplish what you want with side inputs, e.g.
with beam.Pipeline() as p:
pcoll = ...
(p
# Create a PCollection with a single element.
| beam.Create([None])
# This will process the singleton exactly once,
# with the entirity of pcoll passed in as a second argument as a list.
| beam.Map(
lambda _, pcoll_as_side: ...consume pcoll_as_side here...,
pcoll_as_side=beam.pvalue.AsList(pcoll))

apache beam rows to tfrecord in order to GenerateStatistics

I have built a pipeline that read some data, does some manipulations and create some apache beam Row objects (Steps 1 and 2 in the code below). I then would like to generate statistic and write them to file. I can leverage tensorflow data validation library for that, but tfdv GenerateStatistics expects a pyarrow.lib.RecordBatch and not a Row object. I am aware of apache_beam.io.tfrecordio.WriteToTFRecord can write a PCollection to file as TFRecord, however, is there a way to do it without writing to file? Ideally a step 3 that would transform a Row object to TFRecord.
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> ...
| 'Step 2: Do some transformations' >> ... # Here the resulting objects are beam Rows
| 'Step 3: transform Rows to TFRecord' >> ...
| 'Step 4: GenerateStatistics' >> tfdv.GenerateStatistics()
| 'Step 5: WriteStatsOutput' >> WriteStatisticsToTFRecord(STATS_OUT_FILE)
)
Without a step 3, my pipeline would generate the following error:
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GenerateStatistics: expected <class 'pyarrow.lib.RecordBatch'>, got Row(attribute_1=Any, attribute_2=Any, ..., attribute_n=Any)
Full type hint:
IOTypeHints[inputs=((<class 'pyarrow.lib.RecordBatch'>,), {}), outputs=((<class 'tensorflow_metadata.proto.v0.statistics_pb2.DatasetFeatureStatisticsList'>,), {})]
strip_pcoll_output()
UPDATE:
I have worked on trying to implement the Step 3 in the pipeline above. GenerateStatistics takes as inputs pyarrow RecordBatch objects, and therefore I tried to build it from the dictionary I get from previous steps. I could came out with the following:
class DictToPyarrow(beam.DoFn):
def __init__(self, schema: pyarrow.schema, *unused_args, **unused_kwargs):
super().__init__(*unused_args, **unused_kwargs)
self.schema = schema
def process(self, pydict):
dict_list = dict()
for key, val in pydict.items():
dict_list[key] = [val]
table = pyarrow.Table.from_pydict(dict_list, schema=self.schema)
batches = table.to_batches()
yield batches[0]
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> ...
| 'Step 2: Do some transformations' >> ... # Here the resulting objects are beam Rows
| "STEP 3: To pyarrow" >> beam.ParDo(DictToPyarrow(schema))
| 'Step 4: GenerateStatistics' >> tfdv.GenerateStatistics()
| 'Step 5: WriteStatsOutput' >> WriteStatisticsToTFRecord(STATS_OUT_FILE)
)
But i get the following error:
TypeError: Expected feature column to be a (Large)List<primitive|struct> or null, but feature my_feature_1 was int32. [while running 'GenerateStatistics/RunStatsGenerators/GenerateSlicedStatisticsImpl/TopKUniquesStatsGenerator/ToTopKTuples']
UPDATE 2: I was able to create a tfx ExampleGen objects from the beam pipeline, thanks to this. Something like
with beam.Pipeline(options=pipeline_options) as pipeline:
result = ( pipeline
| 'Step 1: Read data' >> beam.io.ReadFromParquet(gsc_input_file)
| 'Step2: ToTFExample' >> beam.Map(utils.dict_to_example)
| "Step3: GenerateStatistics" >> GenerateStatistics(tfdv_options)
)
But i get the following error
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GenerateStatistics: expected <class 'pyarrow.lib.RecordBatch'>, got <class 'tensorflow.core.example.example_pb2.Example'>
Full type hint:
IOTypeHints[inputs=((<class 'pyarrow.lib.RecordBatch'>,), {}), outputs=((<class 'tensorflow_metadata.proto.v0.statistics_pb2.DatasetFeatureStatisticsList'>,), {})]
strip_pcoll_output()
It seems that from tensorflow_data_validation import GenerateStatistics wants RecordBatch.
Unfortunately there’s no other way to do it without writing in a file. We need to do this process because Machine Learning frameworks consume training data as sequence of examples, this file formats for training ML should have easily consumable layouts with no impedance mismatch with the storage platform or programming language used to read/write files.
Some advantage to use TFRecords are:
Best performance training your model.
Binay data uses less space on disk and takes less time to copy and
can be read much more efficiently from disk.
It makes it easier to combine multiple datasets and integrates
seamlessly with the data import and preprocessing functionality
provided by the library, especially for the datasets that are too
large.
You can see more documentation about TFRecord.

Create multiple rows of fixed length from a data frame column in Pyspark

My input is a dataframe column in pyspark and it has only one column DETAIL_REC.
detail_df.show()
DETAIL_REC
================================
ABC12345678ABC98765543ABC98762345
detail_df.printSchema()
root
|-- DETAIL_REC: string(nullable =true)
For every 11th char/string it has to be in next row of dataframe for downstream process to consume this.
Expected output Should be multiple rows in dataframe
DETAIL_REC (No spaces lines after each record)
==============
ABC12345678
ABC98765543
ABC98762345
If you have spark 2.4+ version, we can make use of higher order functions to do it like below:
from pyspark.sql import functions as F
n = 11
output = df.withColumn("SubstrCol",F.explode((F.expr(f"""filter(
transform(
sequence(0,length(DETAIL_REC),{n})
,x-> substring(DETAIL_REC,x+1,{n}))
,y->y <> '')"""))))
output.show(truncate=False)
+---------------------------------+-----------+
|DETAIL_REC |SubstrCol |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+
Logic used:
First generate a sequence of integers starting from 0 to length of the string in steps of 11 (n)
Using transform iterate through this sequence and keep getting substrings from the original string (This keeps changing the start position.
Filter out any blank strings from the resulting array and explode this array.
For lower versions of spark, use a udf with textwrap or any other functions as addressed here:
from pyspark.sql import functions as F, types as T
from textwrap import wrap
n = 11
myudf = F.udf(lambda x: wrap(x,n),T.ArrayType(T.StringType()))
output = df.withColumn("SubstrCol",F.explode(myudf("DETAIL_REC")))
output.show(truncate=False)
+---------------------------------+-----------+
|DETAIL_REC |SubstrCol |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+

How to use filter in dynamically in scala?

I have the raw of line of logs file about 1TB. As below.
Test X1 SET WARN CATALOG MAP1,MAP2
INFO X2 SET WARN CATALOG MAPX,MAP2,MAP3
I read the logs file using spark scala scala and make the rdd of logs file.
I need to filter only those line which contains
1.SET
2.INFO
3. CATALOG
I write the filter like that
Val filterRdd = rdd.filter(f =>f.contains("SET")).filter(f => f.contains("INFO")).filter(f =>f.contains("CATALOG"))
can we do the same if these parameter are assign to list. and based on that we can filter dynamically not writing to much of line ; here in example i take only three restriction but in real it goes to upto 15 restriction keywords. can we do it dynamically.
Something like this could work when you require all words to appear in a line:
val words = Seq("SET", "INFO", "CATALOG")
val filterRdd = rdd.filter(f => words.forall(w => f.contains(w)))
and if you want any:
val filterRdd = rdd.filter(f => words.exists(w => f.contains(w)))

Scala: oneSet.diff(otherSet) not working

I have a function that finds new columns to add to a cassandra table:
val inputSet:Set[String] = inputColumns.map
{
cht => cht.stringLabel.toLowerCase()
}.distinct.toSet
logger.debug("\n\ninputSet\n"+inputSet.mkString(", "))
val extantSet:Set[String] = extantColumns.map
{
e => e._1.toLowerCase()
}.toSet
logger.debug("\n\nextantSet\n"+inputSet.mkString(" * "))
inputSet.diff(extantSet)
I want the values that are ONLY in the input set. I will then create columns in Cassandra table.
The return set (i.e., inputSet.diff(extantSet)),however, includes columns that are in both sets.
From my log files:
inputSet
incident, funnel, v_re-evaluate, adj_in-person, accident, v_create,....
extantSet
incident * funnel * v_re-evaluate * adj_in-person * accident *
v_create.....
returned set:
funnel | v_re-evaluate | adj_in-person | v_explain | v_devise | dmepos
|....
Which in the end throws
com.datastax.driver.core.exceptions.InvalidQueryException: Invalid column name adj_in-person because it conflicts with an existing column
What have I done wrong?
Any help would be deeply appreciated?
this is what i have tired. which gives me the output as follows.
object ABC extends App {
val x = List("A","B","c","d","e","a","b").map(_.toLowerCase)
val y = List("a","b","C").map(_.toLowerCase)
println(s"${x diff y} List diff")
println(s"${x.toSet diff y.toSet} Set diff")
}
Output:
List(d, e, a, b) List diff
Set(e, d) Set diff
and i think you are looking for the set difference.
As you can see when we are taking the diff of two list then we are getting duplicates in the answer which are a, b but after the operation .toSet we are getting rid of duplicates so this should work for you too.