I have a YAML file that is usually edited by a human but, recently I have a need for it to also be edited by an automated task. I am using version 0.17.16-1 from the Ubuntu repo. I have mostly figured out how to get the output to look like the input with one exception. When there is a comment in my YAML right before an array, the first element is mis-formatted. If I remove the comment, the formatting is correct. It also doesn't matter if the comment is left-justified or indent like in the example. Most likely, I have something mis-configured so, if anyone could point it out to me I'd be very grateful. This has been driving me nuts for a few days now.
import sys
import ruamel.yaml
yaml_str = """\
top_level:
# comment
-
key1: "1"
key4: "4"
-
key2: "2"
key5: "5"
"""
yaml = ruamel.yaml.YAML()
yaml.indent(mapping=2, sequence=4, offset=2)
yaml.compact(seq_map=False)
data = yaml.load(yaml_str)
yaml.dump(data, sys.stdout)
Output:
top_level:
# comment
-
key1: '1' key4: '4'
-
key2: '2'
key5: '5'
ruamel.yaml attaches comments normally to the node preceding the comment,
so you are not having a comment before a sequence (no arrays in YAML), but a comment
between a key and its value. Those can be problematic, but the main problem here
seems to be the use of yaml.compact() in combination with the comment:
import sys
import ruamel.yaml
yaml_str = """\
top_level:
# comment
-
key1: "1"
key4: "4"
-
key2: "2"
key5: "5"
"""
yaml = ruamel.yaml.YAML()
yaml.indent(mapping=2, sequence=4, offset=2)
yaml.preserve_quotes = True # this way you keep double quotes from the input
# yaml.compact(seq_map = False)
data = yaml.load(yaml_str)
yaml.dump(data, sys.stdout)
which gives:
top_level:
# comment
- key1: "1"
key4: "4"
- key2: "2"
key5: "5"
If the above really does have to have block sequence indicator on a line of its own, you can
trivially postprocess the output with the transform parameter of dump().
You should be using a Python virtual environment, and never work in the system
Python space. That allows you to install newer versions of all packages than the
system uses.
Ansible version: 2.9
I would like to merge lists and dicts, that I only know the prefix of. Suffix aka *.
Eg.:
list_1:
- 1
list_a:
- 2
list_Z:
- 3
to merge list_* into eg. mylist resulting in:
mylist:
- 1
- 2
- 3
Same for dicts with recursion. Any ideas?
If I have multiple references and when I write them to a YAML file using ruaml.yaml from Python I get:
<<: [*name-name, *help-name]
but instead I would prefer to have
<<: *name-name
<<: *help-name
Is there an option to achieve this while writing to the file?
UPDATE
descriptions:
- &description-one-ref
description: >
helptexts:
- &help-one
help_text: |
questions:
- &question-one
title: "title test"
reference: "question-one-ref"
field: "ChoiceField"
choices:
- "Yes"
- "No"
required: true
<<: *description-one-ref
<<: *help-one
riskvalue_max: 10
calculations:
- conditions:
- comparator: "equal"
value: "Yes"
actions:
- riskvalue: 0
- conditions:
- comparator: "equal"
value: "No"
actions:
- riskvalue: 10
Currently I'm reading such a file and modify specific values within python and then want to write it back. When I'm writing I'm getting the issue that the references are as list and not as outlined.
That means the workflow is as: I'm reading the doc via
yaml = ruamel.yaml.YAML()
with open('test.yaml') as f:
data = yaml.load(f)
for k in data.keys():
if k == 'questions':
q = data.get(k)
for i in range(0, len(q)):
q[i]['title'] = "my new title"
f.close()
g = open('new_file.yaml', 'w')
yaml(data)
g.close()
No, there is no such option, as it would lead to an invalid YAML file.
The << is a mapping key, for which the value is interpreted
specially assuming the parser implements to the language independent
merge key specification. And a mapping key must be unique
according to the YAML specification:
The content of a mapping node is an unordered set of key: value node
pairs, with the restriction that each of the keys is unique.
That ruamel.yaml (< 0.15.75) doesn't throw an error on such
duplicate key is a bug. On duplicate normal keys, ruamel.yaml
does throw an error. The bug is inherited from PyYAML (which is not
specification conformant, and does not throw an error even on
duplicate normal keys).
However with a little pre- and post-processing what you want to do can
be easily achieved. The trick is to make the YAML valid before parsing
by making the offending duplicate << keys unique (but recognisable)
and then, when writing the YAML back to file, substituting these
unique keys by <<: * again. In the following the first occurence of
<<: * is replaced by [<<, 0]:, the second by [<<, 1]: etc.
The * needs to be part of the substitution, as there are no anchors in
the document for those aliases.
import sys
import subprocess
import ruamel.yaml
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.indent(sequence=4, offset=2)
class DoubleMergeKeyEnabler(object):
def __init__(self):
self.pat = '<<: ' # could be at the root level mapping, so no leading space
self.r_pat = '[<<, {}]: ' # probably not using sequences as keys
self.pat_nr = -1
def convert(self, doc):
while self.pat in doc:
self.pat_nr += 1
doc = doc.replace(self.pat, self.r_pat.format(self.pat_nr), 1)
return doc
def revert(self, doc):
while self.pat_nr >= 0:
doc = doc.replace(self.r_pat.format(self.pat_nr), self.pat, 1)
self.pat_nr -= 1
return doc
dmke = DoubleMergeKeyEnabler()
with open('test.yaml') as fp:
# we don't do this line by line, that would not work well on flow style mappings
orgdoc = fp.read()
doc = dmke.convert(orgdoc)
data = yaml.load(doc)
data['questions'][0].anchor.always_dump = True
#######################################
# >>>> do your thing on data here <<< #
#######################################
with open('output.yaml', 'w') as fp:
yaml.dump(data, fp, transform=dmke.revert)
res = subprocess.check_output(['diff', '-u', 'test.yaml', 'output.yaml']).decode('utf-8')
print('diff says:', res)
which gives:
diff says:
which means the files are the same on round-trip (as long as you don't
change anything before dumping).
Setting preserve_quotes and calling ident() on the YAML instance are necessary to
preserve your superfluous quotes, resp. keeping the indentation.
Since the anchor question-one has no alias, you need to enable dumping explicitly by
setting always_dump on that attribute to True. If necessary you can recursively
walk over data and set anchor.always_dump = True when .anchor.value is not None
I have a dictionary with a few lists(contains a # of strings).
Example List:
hosts = ['199.168.1.100:1000', '199.168.1.101:1000']
When I try to print this out using ruamel.yaml, the elements show up as
hosts:
- 199.168.1.100:1000
- 199.168.1.101:1000
I want the results to be
hosts:
- '199.168.1.100:1000'
- '199.168.1.101:1000'
So I traversed through the list and created a new list with each element being a ruamel SingleQuotedString
S = ruamel.yaml.scalarstring.SingleQuotedScalarString
new_list = []
for e in hosts:
new_list.append(S(e))
hosts = new_list
When I print this out, I still end up printing the "hosts" list without any quotes. What am I doing wrong here?
In the following I assume you mean dumping to YAML when you indicate printing.
Your approach is in principle correct, as using the "global"
yaml.default_style = "'"
would also get the key hosts quoted, and that is not what you
want. Maybe you are not reassigning hosts to the actual datastructure that
you are dumping, because hosts is just the value of the key value pair you
are dumpiong.
The following:
import sys
import ruamel.yaml
S = ruamel.yaml.scalarstring.SingleQuotedScalarString
yaml = ruamel.yaml.YAML()
data = dict(hosts = [S(x) for x in ['199.168.1.100:1000', '199.168.1.101:1000']])
yaml.dump(data, sys.stdout)
will give what you want without problem:
hosts:
- '199.168.1.100:1000'
- '199.168.1.101:1000'
I'm currently using Tensorflow transform library to convert and save the transformation, though it used to work before just fine currently im facing a bit of issue something similar to below
I keep getting the same error like -
'BeamDatasetMetadata' object has no attribute 'schema' [while running
'AnalyzeAndTransformDataset/TransformDataset/ConvertAndUnbatch']
Is someone familiar with the error above and how can we resolve it?
My Transform Function looks like below -
# ### Transformation Function
def transform_data(train_data_file, test_data_file, working_dir):
"""Transform the data and write out as a TFRecord of Example protos.
Read in the data using the CSV reader, and transform it using a
preprocessing pipeline that scales numeric data and converts categorical data
from strings to int64 values indices, by creating a vocabulary for each
category.
Args:
train_data_file: File containing training data
test_data_file: File containing test data
working_dir: Directory to write transformed data and metadata to
"""
def preprocessing_fn(inputs):
"""Preprocess input columns into transformed columns."""
outputs = {}
# Scale numeric columns to have range [0, 1].
for key in NUMERIC_FEATURE_KEYS:
outputs[key] = tft.scale_to_0_1(inputs[key])
# For all categorical columns except the label column, we use
# tft.string_to_int which computes the set of unique values and uses this
# to convert the strings to indices.
for key in CATEGORICAL_FEATURE_KEYS:
tft.uniques(inputs[key], vocab_filename=key)
""" We would use the lookup table when the label is a string value
In our case here Creative_id = 0/1 so we can direclty assign output as is
"""
outputs[LABEL_KEY] = inputs[LABEL_KEY]
return outputs
# The "with" block will create a pipeline, and run that pipeline at the exit
# of the block.
with beam.Pipeline() as pipeline:
with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
# Create a coder to read the data with the schema. To do this we
# need to list all columns in order since the schema doesn't specify the
# order of columns in the csv.
ordered_columns = [
'app_category', 'connection_type', 'creative_id', 'day_of_week',
'device_size', 'geo', 'hour_of_day', 'num_of_connects',
'num_of_conversions', 'opt_bid', 'os_version'
]
converter = csv_coder.CsvCoder(ordered_columns, RAW_DATA_METADATA.schema)
# Read in raw data and convert using CSV converter. Note that we apply
# some Beam transformations here, which will not be encoded in the TF
# graph since we don't do the from within tf.Transform's methods
# (AnalyzeDataset, TransformDataset etc.). These transformations are just
# to get data into a format that the CSV converter can read, in particular
# removing empty lines and removing spaces after commas.
raw_data = (
pipeline
| 'ReadTrainData' >> textio.ReadFromText(train_data_file)
| 'FilterTrainData' >> beam.Filter(
lambda line: line and line != 'app_category,connection_type,creative_id,day_of_week,device_size,geo,hour_of_day,num_of_connects,num_of_conversions,opt_bid,os_version')
| 'FixCommasTrainData' >> beam.Map(
lambda line: line.replace(', ', ','))
| 'DecodeTrainData' >> MapAndFilterErrors(converter.decode))
# Combine data and schema into a dataset tuple. Note that we already used
# the schema to read the CSV data, but we also need it to interpret
# raw_data.
raw_dataset = (raw_data, RAW_DATA_METADATA)
transformed_dataset, transform_fn = (
raw_dataset | beam_impl.AnalyzeAndTransformDataset(preprocessing_fn))
transformed_data, transformed_metadata = transformed_dataset
transformed_data_coder = example_proto_coder.ExampleProtoCoder(transformed_metadata.schema)
_ = (
transformed_data
| 'EncodeTrainData' >> beam.Map(transformed_data_coder.encode)
| 'WriteTrainData' >> tfrecordio.WriteToTFRecord(
os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE)))
# Now apply transform function to test data. In this case we also remove
# the header line from the CSV file and the trailing period at the end of
# each line.
raw_test_data = (
pipeline
| 'ReadTestData' >> textio.ReadFromText(test_data_file, skip_header_lines=1)
| 'FixCommasTestData' >> beam.Map(
lambda line: line.replace(', ', ','))
| 'DecodeTestData' >> beam.Map(converter.decode))
raw_test_dataset = (raw_test_data, RAW_DATA_METADATA)
transformed_test_dataset = ((raw_test_dataset, transform_fn) | beam_impl.TransformDataset())
# Don't need transformed data schema, it's the same as before.
transformed_test_data, _ = transformed_test_dataset
_ = (
transformed_test_data
| 'EncodeTestData' >> beam.Map(transformed_data_coder.encode)
| 'WriteTestData' >> tfrecordio.WriteToTFRecord(
os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE)))
_ = (
transform_fn
| 'WriteTransformFn' >>
transform_fn_io.WriteTransformFn(working_dir))
Ouput stack for -
pip show tensorflow-transform apache-beam
Name: tensorflow-transform
Version: 0.4.0
Summary: A library for data preprocessing with TensorFlow
Home-page: UNKNOWN
Author: Google Inc.
Author-email: tf-transform-feedback#google.com
License: Apache 2.0
Location: /usr/local/lib/python2.7/dist-packages
Requires: six, apache-beam, protobuf
---
Name: apache-beam
Version: 2.4.0
Summary: Apache Beam SDK for Python
Home-page: https://beam.apache.org
Author: Apache Software Foundation
Author-email: dev#beam.apache.org
License: Apache License, Version 2.0
Location: /usr/local/lib/python2.7/dist-packages
Requires: oauth2client, httplib2, mock, crcmod, grpcio, futures, pyvcf, avro, typing, pyyaml, dill, six, hdfs, protobuf
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
the above issue doesn't seem to occur all the time though! looks like it has some conflict with other packages.
Can't see but this line looks incomplete:
```
converter = csv_coder.CsvCoder(ordered_columns, RAW_DATA_METADATA.schema)
```
A possible way to do it:
```
INPUT_SCHEMA = dataset_schema.from_feature_spec({
'label':tf.FixedLenFeature(shape=[], dtype=tf.float32),
'id': tf.FixedLenFeature(shape=[], dtype=tf.float32),
'date': tf.FixedLenFeature(shape=[], dtype=tf.string),
'random': tf.FixedLenFeature(shape=[], dtype=tf.string),
'name': tf.FixedLenFeature(shape=[], dtype=tf.string),
'tweet': tf.FixedLenFeature(shape=[], dtype=tf.string),
})
```
```
converter_input = coders.CsvCoder(
['label','id','date','random','name','tweet'],
INPUT_SCHEMA,
delimiter=delimiter)
```
Then for the transform step where it seams like your actuall problem is here is an example as well.
```
input_metadata =
dataset_metadata.DatasetMetadata(schema=TRANSFORM_INPUT_SCHEMA)
TRANSFORM_INPUT_SCHEMA = dataset_schema.from_feature_spec({
'id': tf.FixedLenFeature(shape=[], dtype=tf.float32),
'label': tf.FixedLenFeature(shape=[], dtype=tf.float32),
'tweet': tf.FixedLenFeature(shape=[], dtype=tf.string),
'answer_to_nbr': tf.FixedLenFeature(shape=[], dtype=tf.float32),
'nbr_of_tags': tf.FixedLenFeature(shape=[], dtype=tf.float32),
})
train_dataset = (train_dataset, input_metadata)
transformed_dataset, transform_fn = (train_dataset
| 'AnalyzeAndTransform' >>
beam_impl.AnalyzeAndTransformDataset(
preprocessing_fn))
```
Hope it helps you :) If you post to your github repo I could look at the full code and see if I can help! Good luck!
Look at this repo for help https://github.com/Fematich/tftransform-demo