I have successfully compiled several MIBs into JSON using PySMI with JsonCodeGen and CallbackWriter (which uploads the parsed JSON to cloud storage). Now I am trying to build an index using freshly compiled JSON MIBs in combination with already-compiled JSON files.
From the documentation, it looks like I need to pass all of these files to mibCompiler.compile() function, even though most of them have already been compiled, so that I can run mibCompiler.buildIndex() after compiling.
From what I understand, I need a searcher to exclude the already-compiled JSON MIBs...is this the case? All I see in the current code are PyFileSearcher, StubSearcher, and AnyFileSearcher. I 'm not sure what to do from here to ignore my JSON files.
I'm also not sure buildIndex() will even accept JSON files as input, so I'm hoping this is the right approach.
Thanks in advance!
I'm also not sure buildIndex() will even accept JSON files as input, so I'm hoping this is the right approach.
Actually, no! Present day PySMI compiler can only parse ASN.1 MIBs, it will fail on JSON input.
Probably the simplest solution would be to just load JSON MIBs and existing JSON index into Python as dicts, walk the dicts updating one another. Here is the code that builds JSON index dict out of some internal objects (which carry pieces of MIB data).
From PySMI perspective, the best course of action would probably be to introduce a JSON MIB compiler which would turn JSON MIB into the abstract syntax tree from which JSON MIB index could be build...
Related
First of all, apologies if this is a stupid question. I'm new to unit tests so I'm struggling a bit here.
I'm working on an app that queries an API, receives a JSON response and then processes that response to produce a series of complex data structures. Many of these data structures are daily time series, which means each of my functions produces a list (List<Datapoint>) containing hundreds of datapoint objects.
What I'm trying to test is that, for a given API response, each function produces the output it should.
For the input of each test I have already grabbed a sample, real JSON response from the API, and I've stored it inside a test_data folder within my root test folder.
However, for the expect part... how can I obtain a sample output from my function and store it somewhere in my test_data folder?
It would be straightforward if the output of my function were a string, but in this case we're talking about a list with hundreds of custom objects containing different values inside them. The only way to create those objects is through the function itself.
I tried running the debugger to check the value of the output at runtime, which I can do... but that doesn't help me copy it or store it anywhere as code.
Should I try to print the full contents of the output to a string at runtime and store that string? I don't think this would work, as all I see in the console are a bunch of Instance Of when I do functionOutput.toString()... I would probably need to recursively print each of the variables inside those objects.
Please tell me I'm being stupid and there's a simpler way to do this :)
I have a set of .xml documents that I want to parse.
I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and slower run times, often with one task taking tens of hours to run:
The first transform of mine takes the .xml contents and puts it into a single cell, and a second transform takes this string and uses Python's xml library to parse the string into a document. This document I'm then able to extract properties from and return a DataFrame.
I'm using a UDF to conduct the process of mapping the string contents to the fields I want.
How can I make this faster / work better with large .xml files?
For this problem, we're going to combine a couple of different techniques to make this code both testable and highly scalable.
Theory
When parsing raw files, you have a couple of options you can consider:
❌ You can write your own parser to read bytes from files and convert them into data Spark can understand.
This is highly discouraged whenever possible due to the engineering time and unscalable architecture. It doesn't take advantage of distributed compute when you do this as you must bring the entire raw file to your parsing method before you can use it. This is not an effective use of your resources.
⚠ You can use your own parser library not made for Spark, such as the XML Python library mentioned in the question
While this is less difficult to accomplish than writing your own parser, it still does not take advantage of distributed computation in Spark. It is easier to get something running, but it will eventually hit a limit of performance because it does not take advantage of low-level Spark functionality only exposed when writing a Spark library.
✅ You can use a Spark-native raw file parser
This is the preferred option in all cases as it takes advantage of low-level Spark functionality and doesn't require you to write your own code. If a low-level Spark parser exists, you should use it.
In our case, we can use the Databricks parser to great effect.
In general, you should also avoid using the .udf method as it likely is being used instead of good functionality already available in the Spark API. UDFs are not as performant as native methods and should be used only when no other option is available.
A good example of UDFs covering up hidden problems would be string manipulations of column contents; while you technically can use a UDF to do things like splitting and trimming strings, these things already exist in the Spark API and will be orders of magnitude faster than your own code.
Design
Our design is going to use the following:
Low-level Spark-optimized file parsing done via the Databricks XML Parser
Test-driven raw file parsing as explained here
Wire the Parser
First, we need to add the .jar to our spark_session available inside Transforms. Thanks to recent improvements, this argument, when configured, will allow you to use the .jar in both Preview/Test and at full build time. Previously, this would have required a full build but not so now.
We need to go to our transforms-python/build.gradle file and add 2 blocks of config:
Enable the pytest plugin
Enable the condaJars argument and declare the .jar dependency
My /transforms-python/build.gradle now looks like the following:
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
After applying this config, you'll want to restart your Code Assist session by clicking on the bottom ribbon and hitting Refresh
After refreshing Code Assist, we now have low-level functionality available to parse our .xml files, now we need to test it!
Testing the Parser
If we adopt the same style of test-driven development as here, we end up with /transforms-python/src/myproject/datasets/xml_parse_transform.py with the following contents:
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
#transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
... an example file /transforms-python/test/myproject/datasets/sample.xml with contents:
<tag>
<field1>
my_value
</field1>
</tag>
And a test file /transforms-python/test/myproject/datasets/test_xml_parse_transform.py:
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
We now have:
A distributed-compute, low-level .xml parser that is highly scalable
A test-driven setup that we can quickly iterate on to get our exact functionality right
Cheers
I have a really complicated system which use multiple languages and frameworks (Java Python Scala Bash). In each module I need to retrieve configuration values which are similar and change frequently. Currently I'm maintaining multiple conf files which holds lots of duplicates.
I wonder if there is out of the box RestAPI which can retrieve variables by demand from remote location.
All I manage to find by now are ways to load the entire file from remote source which is half a solution from me:
YAML.parse(open('https://link_to_file/file.yaml'))
My goal, which I fail to find a lead to it, is to make a direct call.
MyRemoteAPI.get("level1.level2.x")
P.S
YAML is not mandatory solution for me, I'm Open for suggestions.
I don't know about an out-of-the-box API, but it's fairly trivial to build. Make a service that will read the YAML file and traverse to the appropriate key. e.g. using a dynamic language like Ruby (+Rails), you could do something like
def value
config = YAML.load_file '/local/path/to/config.yaml'
render plain: config.dig(params[:key].split('.'))
end
dig essentially traverses a structure and safely returns nil if a key isn't found, so this returns the value at the "leaf" of the requested path.
You might also want to cache the structure in memory to prevent constantly reading from the file, e.g. could do something like ##config ||= YAML.parse(open('https://link_to_file/file.yaml')) or config = Rails.cache.fetch('config', expire_in: 1.hour) { ... }. And/or cache the API's HTTP response.
I recently encountered a very large mission-critical project where all the configuration
files were defined using textual protobuf definitions. The configuration files are meant to be
human readable and editable.
For example
message ServerSettings {
required int32 port = 3022;
optional string name = "mywebserver";
}
Personally I found this humorous.
But is it in fact a reasonable keep-it-simple technique, or clearly moronic ?!
In other words, are there REAL, ACTUAL problems with this ?
If that is the text proto if format, then... Whatever, I guess. If it works, then it is as reasonable as any other serialization format.
If that is meant to be proto schema, then it is illegal (the value after the = is meant to be the field number).
Json or XML might be more typical, but as long as it works it isn't "moronic". So the ultimate question is: does it work?
I think it's quite clever. I am guessing they pass it through protoc --encode to generate a binary which is what is actually parsed.
Pros:
1. Code is generated to parse configuration
2. Type validation
3. More robust configuration file compared to a key/value as it supports structs, unions, maps and arrays
4. The configuration data is now serializable meaning it can be easily exposed to an RPC or IPC interface.
Cons:
1. The syntax can be a little verbose for maps/arrays.
2. It requires protoc to be installed on the target as well as libprotobuf.so if you are on a system with tight memory limits.
For my work, I sometimes have to deal with logfiles from a binary protocol (the logfiles contain hexdumps of the messages). I want to write a Perl script that can interpret the binary data for me and print the contents in a more friendly format.
I have a (machine readable) description of the protocol messages in a proprietary format and I have (mostly) figured out how to parse that format (the parts I can"t fully understand are not related to my goal, so I can just ignore them), so I can convert the description into a data structure for use in my script.
Because the protocol description only rarely changes, it seems a waste to re-parse the protocol description each time I want to analyse a logfile, but on the other hand, if the description does change or if I accidentally throw away my pre-parsed form of the description, then I would like my script to automatically trigger a re-parsing of the description.
What is the best way to realise this?
Assuming that the protocol description lives in a file accessible to the script, have a function to read in the parsed data which caches the parsed results in intermediate file. The logic is very very simple but the steps look very verbose since I tried to write out the full spec - in reality it should take <10 lines of Perl code.
Check if intermediate file exists. If it does not (or can not be read), skip to proprietary parsing step (#4)
If you can read in the intermediate cache file, read in the "protocol description timestamp" field (described below). Then find out modification time of "protocol description" file via stat() and compare. If modification time of "protocol description" file is >= cache file's stored timestamp, skip to proprietary parsing step (#4)
Else (e.g. the time of "protocol description" file is < cache file's stored timestamp), read the intermediary cache file data via Data::Dumper or Storable. End.
If you need to re-parse because of logic in #1 or #2, read in "protocol description" file, parse it into your data structure.
Then create a hash with 2 keys: "protocol_description_timestamp" (with the value being the modification time of protocol description file derived from stat call) and second key "data", with the value being a reference to the data structure you just produced as a result of parsing.
Then save that data structure into the intermediate cache file using Storable or Data::Dumper or any other method of your choice for storing Perl data structires.
You can use a Makefile for this. Make the data structure you use a Makefile target that depends on the protocol description.
When Make notices that the protocol was updated more recently than the script, it will run the commands you specify to recreate your data.