How do I print a dictionary vs a defaultdict based dictionary as a yaml file using ruamel.yaml? - ruamel.yaml

Please refer to this trivial block of code shown below. My goal is to use defaultdict to come up with a relatively simple dictionary, and further print the results out as a yaml file.
When I manually define the dictionary, it seems to work just fine and the YAML is displayed exactly the way I want it, but when I use defaultdict to come up with a dictionary, I get an error message and unfortunately I am not able to decipher that.
When I print the dictionary as a JSON, it prints the exact same output. What I am missing?
import sys,ruamel.yaml
import json
from collections import defaultdict
def dict_maker():
return defaultdict(dict_maker)
S = ruamel.yaml.scalarstring.DoubleQuotedScalarString
app = "someapp"
d = {'beats':{'name':S(app), 'udp_address':S('239.1.1.1:10101')}}
foo = dict_maker()
foo["beats"]["name"] = S(app)
foo["beats"]["udp_address"] = S("239.1.1.1:10101")
print "Regular dictionary"
print json.dumps(d, indent=4)
print "defaultdict dictionary"
print json.dumps(foo, indent=4)
print "dictionary as a yaml\n"
ruamel.yaml.dump(d, sys.stdout, Dumper=ruamel.yaml.RoundTripDumper)
print "defaultdict dictionary as a yaml\n"
ruamel.yaml.dump(foo, sys.stdout, Dumper=ruamel.yaml.RoundTripDumper)
Error Message
raise RepresenterError("cannot represent an object: %s" % data)
ruamel.yaml.representer.RepresenterError: cannot represent an object: defaultdict(<function dict_maker at 0x7f1253725a28>, {'beats': defaultdict(<function dict_maker at 0x7f1253725a28>, {'name': u'someapp', 'udp_address': u'239.1.1.1:10101'})})

You seem to be using the word "dictionary" when refering to a Python dict. There is however no such thing as a "defaultdict based dictionary", that would imply that foo after
foo = dict_maker()
would be a dict, and of course it is not: foo is a defaultdict which is dict based (i.e. exactly the other way around from what you write).
That JSON dumps this, is not surprising, as it cannot do more than stupidly dump the key-value pairs as if it were a dict. But when you try to load that JSON back, you see how useless this is as, you cannot continue working with it (at least not in the way expected):
import sys
import json
from collections import defaultdict
import io
def dict_maker():
return defaultdict(dict_maker)
app = "someapp"
foo = dict_maker()
foo["beats"]["name"] = app
foo["beats"]["udp_address"] = "239.1.1.1:10101"
io = io.StringIO()
json.dump(foo, io, indent=4)
io.seek(0)
bar = json.load(io)
bar['otherapp']['name'] = 'some_alt_app'
print(bar['beats']['udp_address'])
The above throws: KeyError: 'otherapp'. And that is because JSON doesn't keep all the information needed.
However, if you use the unsafe YAML dumper, then ruamel.yaml can dump and load this fine:
import sys
from ruamel.yaml import YAML
from collections import defaultdict
import io
def dict_maker():
return defaultdict(dict_maker)
app = "someapp"
yaml = YAML(typ='unsafe')
foo = dict_maker()
foo["beats"]["name"] = app
foo["beats"]["udp_address"] = "239.1.1.1:10101"
io = io.StringIO()
yaml.dump(foo, io)
io.seek(0)
print(io.getvalue())
bar = yaml.load(io)
bar['otherapp']['name'] = 'some_alt_app'
print(bar['beats']['udp_address'])
this doesn't throw an error, as bar is again a defaultdict with dict_maker as the function it defaults to. The above prints
239.1.1.1:10101
as you would expect.
That the RoundTripDumper/Loader doesn't support this out-of-the-box, is because it is based on the SafeDumper/Loader, which cannot dump/load arbitrary Python instances like defaultdict and its dict_maker function reference. Enabling that would make the loading unsafe.
So if you need to use the RoundTripDumper you should add a representer for defaultdict or a subclass thereof (and possible one for dict_maker as well). To be able to load that, you need constructor(s) as well. How to do that is described in the documentation (Dumping Python classes)

Related

CSV Feeders for gatling 3

I am using Gatling 3. I have a csv feeder with just one column titled accountIds. I need to pass this in the body of my POST request as JSON. I have tried a lot of different syntax but nothing seems to be working. I can also not print what is actually being sent in the body. It works if I remove the "$accountIds" and use an actual value instead. Below is my code:
val searchFeeder = csv("C://data/accountids.csv").random
val scn1 = scenario("Scenario 1")
.feed(searchFeeder)
.exec(http("Search")
.post("/v3/accounts/")
.body(StringBody("""{"accountIds": "${accountIds}"}""")).asJson)
setUp(scn1.inject(atOnceUsers(10)).protocols(httpConf))
Have you enabled trace level in logback.xml to see the details of post request?
Also, can you confirm if location you have mentioned "C://data/accountids.csv" is the right one. Generally, data folder resides in project location and within project you can access the data file as:
val searchFeeder = csv("data/stack.csv").random
I just created sample script and enabled logging.I am able to see that accountId is getting passed:
package basicpackage
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.core.scenario.Simulation
class StackFeeder extends Simulation {
val httpConf=http.baseUrl("http://example.com")
val searchFeeder = csv("data/stack.csv").random
val scn1 = scenario("Scenario 1")
.feed(searchFeeder)
.exec(http("Search")
.post("/v3/accounts/")
.body(StringBody("""{"accountIds": "${accountIds}"}""")).asJson)
setUp(scn1.inject(atOnceUsers(1)).protocols(httpConf))
csv file location

Checking to see if record exists in MongoDB before Scrapy inserts

As the title implies, I'm running a Scrapy spider and storing results in MongoDB. Everything is running smoothly, except when I re-run the spider, it adds everything again, and I don't want the duplicates. My pipelines.py file looks like this:
import logging
import pymongo
from pymongo import MongoClient
from scrapy.conf import settings
from scrapy import log
class MongoPipeline(object):
collection_name = 'openings'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
#classmethod
def from_crawler(cls, crawler):
## pull in information from settings.py
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
## initializing spider
## opening db connection
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
## clean up when spider is closed
self.client.close()
def process_item(self, item, spider):
## how to handle each post
if self.db.openings.find({' quote_text': item['quote_text']}) == True:
pass
else:
self.db[self.collection_name].insert(dict(item))
logging.debug("Post added to MongoDB")
return item
My spider looks like this:
import scrapy
from ..items import QuotesItem
class QuoteSpider(scrapy.Spider):
name = 'quote'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
items = QuotesItem()
quotes = response.xpath('//*[#class="quote"]')
for quote in quotes:
author = quote.xpath('.//*[#class="author"]//text()').extract_first()
quote_text = quote.xpath('.//span[#class="text"]//text()').extract_first()
items['author'] = author
items['quote_text'] = quote_text
yield items
The current syntax is obviously wrong, but is there a slight fix to the for loop to make to fix it? Should I be running this loop in the spider instead? I was also looking at upsert but was having trouble understanding how to use that effectively. Any help would be great.
Looks like you have a leading space here: self.db.openings.find({' quote_text': item['quote_text']}). I suppose it should just be 'quote_text'?
You should use is True instead of == True. This is the reason it adds everything again.
I would suggest to use findOne instead of find, will be more efficient.
Using upsert instead is indeed a good idea but the logic will be slightly different: you will update the data if the item already exists, and insert it when it doesn't exists (instead of not doing anything if the item already exists). The syntax should look something like this: self.db[self.collection_name].update({'quote_text': quote_text}, dict(item),upsert=True)
steps :
check if the collection is empty else : write in collection
if not empty and item exist : pass
else (collection not empty + item dosen't exist) : write in collection
code:
def process_item(self, item, spider):
## how to handle each post
# empty
if len(list(self.db[self.collection_name].find({}))) == 0 :
self.db[self.collection_name].insert_one(dict(item))
# not empty
elif item in list(self.db[self.collection_name].find(item,{"_id":0})) :
print("item exist")
pass
else:
print("new item")
#print("here is item",item)
self.db[self.collection_name].insert_one(dict(item))
logging.debug("Post added to MongoDB")
return item

How do I work with a Scala process interactively?

I'm writing a bot in Scala for a game that uses text input and output. So I want to work with a process interactively - that is, my code receives output from the process, works with it, and only then sends its next input to the process. So I want to give a function access to the inputStreams and the outputStream simultaneously.
This doesn't seem to fit into any of the factories in scala.sys.process.BasicIO or the constructor for scala.sys.process.ProcessIO (three functions, each of which has access to only one stream).
Here's how I'm doing it at the moment.
private var rogue_input: OutputStream = _
private var rogue_output: InputStream = _
private var rogue_error: InputStream = _
Process("python3 /home/robin/IdeaProjects/Rogomatic/python/rogue.py --rogomatic").run(
new ProcessIO(rogue_input = _, rogue_output = _, rogue_error = _)
)
try {
private val rogue_scanner = new Scanner(rogue_output)
private val rogue_writer = new PrintWriter(rogue_input, true)
// Play the game
} finally {
rogue_input.close()
rogue_output.close()
rogue_error.close()
}
This works, but it doesn't feel very Scala-like. Is there a more idiomatic way to do this?
So I want to work with a process interactively - that is, my code receives output from the process, works with it, and only then sends its next input to the process.
In general, this is traditionally solved by expect. There exist libraries and tools inspired by expect for various languages, including for Scala: https://github.com/Lasering/scala-expect.
The README of the project gives various examples. While I don't know exactly what your rouge.py expects in terms of stdin/stdout interactions, here's a quick "hello world" example showing how you could interact with a Python interpreter (using the Ammonite REPL, which has conveniently library importing capabilities):
import $ivy.`work.martins.simon::scala-expect:6.0.0`
import work.martins.simon.expect.core._
import work.martins.simon.expect.core.actions._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
val timeout = 5 seconds
val e = new Expect("python3 -i -", defaultValue = "?")(
new ExpectBlock(
new StringWhen(">>> ")(
Sendln("""print("hello, world")""")
)
),
new ExpectBlock(
new RegexWhen("""(.*)\n>>> """.r)(
ReturningWithRegex(_.group(1).toString)
)
)
)
e.run(timeout).onComplete(println)
What the code above does is it "expects" >>> to be sent to stdout, and when it finds that, it will send print("hello, world"), followed by a newline. From then, it reads and returns everything until the next prompt (>>>) using a regex.
Amongst other debug information, the above should result in Success(hello, world) being printed to your console.
The library has various other styles, and there may also exist other similar libraries out there. My main point is that an expect-inspired library is likely what you're looking for.

subprocess.run managing optional stdin and stdout

In python >= 3.5 we can give optional stdout, stdin, stderr to subprocess.run()
per the docs:
Valid values are PIPE, DEVNULL, an existing file descriptor (a positive integer),
an existing file object, and None. PIPE indicates that a new pipe to the child
should be created
I want to support passing through (at least) None or existing file objects whilst managing resources pythonically.
How should I manage the optional file resources in something like:
import subprocess
def wraps_subprocess(args=['ls', '-l'], stdin=None, stdout=None):
# ... do important stuff
subprocess.run(args=args, stdin=stdin, stdout=stdout)
A custom contextmanager (idea taken from this answer) seems to work:
import contextlib
#contextlib.contextmanager
def awesome_open(path_or_file_or_none, mode='rb'):
if isinstance(path_or_file_or_none, str):
file_ = needs_closed = open(path_or_file_or_none, mode)
else:
file_ = path_or_file_or_none
needs_closed = None
try:
yield file_
finally:
if needs_closed:
needs_closed.close()
which would be used used like
import subprocess
def wraps_subprocess(args=['ls', '-l'], stdin=None, stdout=None):
# ... do important stuff
with awesome_open(stdin, mode='rb') as fin, awesome_open(stdout, mode='wb') as fout:
subprocess.run(args=args, stdin=fin, stdout=fout)
But I think there is probably a better way.

How do I generate binary RFC822-style headers in Python 3.2?

How do I convince email.generator.Generator to use binary in Python 3.2? This seems like precisely the use case for the policy framework that was introduced in Python 3.3, but I would like my code to run in 3.2.
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: \N{SNOWMAN}\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Fails with UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128).
Your data is not a valid RFC2822 header, which I suspect misleads you. It's a Unicode string, but RFC2822 is always only ASCII. To have non-ASCII characters you need to encode them with a character set and either base64 or quoted-printable encoding.
Hence, valid code would be this:
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: =?utf8?b?4piD?=\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Which of course avoids the error completely.
The question is how to generate such headers as =?utf8?b?4piD?= and the answer lies in the email.header module.
I made this example with:
>>> from email import header
>>> header.Header('\N{SNOWMAN}', 'utf8').encode()
'=?utf8?b?4piD?='
To handle files that have a Key: Value format the email module is the wrong solution. Handling such files are easy enough without the email module, and you will not have to work around the restrictions of RF2822. For example:
# -*- coding: UTF-8 -*-
import io
import sys
if sys.version_info > (3,):
def u(s): return s
else:
def u(s): return s.decode('unicode-escape')
def parse(infile):
res = {}
payload = ''
for line in infile:
key, value = line.strip().split(': ',1)
if key in res:
raise ValueError(u("Key {0} appears twice").format(key))
res[key] = value
return res
def generate(outfile, data):
for key in data:
outfile.write(u("{0}: {1}\n").format(key, data[key]))
if __name__ == "__main__":
# Ensure roundtripping:
data = {u('Key'): u('Value'), u('Foo'): u('Bar'), u('Frötz'): u('Öpöpöp')}
with io.open('/tmp/outfile.conf', 'wt', encoding='UTF8') as outfile:
generate(outfile, data)
with io.open('/tmp/outfile.conf', 'rt', encoding='UTF8') as infile:
res = parse(infile)
assert data == res
That code took 15 minutes to write, and works in both Python 2 and Python 3. If you want line continuations etc that's easy to add as well.
Here is a more complete one that supports comments etc.
A useful solution comes from http://mail.python.org/pipermail/python-dev/2010-October/104409.html :
from email.parser import Parser
from email.generator import BytesGenerator
# How do I get surrogateescape from a BytesIO/StringIO?
data = "Key: \N{SNOWMAN}\r\n\r\n" # write this to headers.txt
headers = open("headers.txt", "r", encoding="ascii", errors="surrogateescape")
message = Parser().parse(headers)
with open("/tmp/rfc882test", "wb") as out:
BytesGenerator(out, maxheaderlen=0).flatten(message)
This is for a program that wants to read and write a binary Key: value file without caring about the encoding. To consume the headers as decoded text without being able to write them back out with Generator(), Parser().parse(open("headers.txt", "r", encoding="utf-8")) should be sufficient.