Consider this ReST text from the Xarray documentation:
Once you've manipulated a Dask array, you can still write a dataset
too big to fit into memory back to disk by using
:py:meth:~xarray.Dataset.to_netcdf in the usual way.
.. ipython:: python
ds.to_netcdf('manipulated-example-data.nc')
By setting the compute argument to False,
:py:meth:~xarray.Dataset.to_netcdf will return a Dask delayed object
that can be computed later.
When we build the documentation, this snippet actually creates the file manipulated-example-data.nc. We obviously don't want this file in the repository. I don't know much about the ipython directive in ReST but there must be a standard way to handle this, rather than just adding this stuff to the .gitignore. Anyone know?
Related
I have a set of .xml documents that I want to parse.
I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and slower run times, often with one task taking tens of hours to run:
The first transform of mine takes the .xml contents and puts it into a single cell, and a second transform takes this string and uses Python's xml library to parse the string into a document. This document I'm then able to extract properties from and return a DataFrame.
I'm using a UDF to conduct the process of mapping the string contents to the fields I want.
How can I make this faster / work better with large .xml files?
For this problem, we're going to combine a couple of different techniques to make this code both testable and highly scalable.
Theory
When parsing raw files, you have a couple of options you can consider:
❌ You can write your own parser to read bytes from files and convert them into data Spark can understand.
This is highly discouraged whenever possible due to the engineering time and unscalable architecture. It doesn't take advantage of distributed compute when you do this as you must bring the entire raw file to your parsing method before you can use it. This is not an effective use of your resources.
⚠ You can use your own parser library not made for Spark, such as the XML Python library mentioned in the question
While this is less difficult to accomplish than writing your own parser, it still does not take advantage of distributed computation in Spark. It is easier to get something running, but it will eventually hit a limit of performance because it does not take advantage of low-level Spark functionality only exposed when writing a Spark library.
✅ You can use a Spark-native raw file parser
This is the preferred option in all cases as it takes advantage of low-level Spark functionality and doesn't require you to write your own code. If a low-level Spark parser exists, you should use it.
In our case, we can use the Databricks parser to great effect.
In general, you should also avoid using the .udf method as it likely is being used instead of good functionality already available in the Spark API. UDFs are not as performant as native methods and should be used only when no other option is available.
A good example of UDFs covering up hidden problems would be string manipulations of column contents; while you technically can use a UDF to do things like splitting and trimming strings, these things already exist in the Spark API and will be orders of magnitude faster than your own code.
Design
Our design is going to use the following:
Low-level Spark-optimized file parsing done via the Databricks XML Parser
Test-driven raw file parsing as explained here
Wire the Parser
First, we need to add the .jar to our spark_session available inside Transforms. Thanks to recent improvements, this argument, when configured, will allow you to use the .jar in both Preview/Test and at full build time. Previously, this would have required a full build but not so now.
We need to go to our transforms-python/build.gradle file and add 2 blocks of config:
Enable the pytest plugin
Enable the condaJars argument and declare the .jar dependency
My /transforms-python/build.gradle now looks like the following:
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
After applying this config, you'll want to restart your Code Assist session by clicking on the bottom ribbon and hitting Refresh
After refreshing Code Assist, we now have low-level functionality available to parse our .xml files, now we need to test it!
Testing the Parser
If we adopt the same style of test-driven development as here, we end up with /transforms-python/src/myproject/datasets/xml_parse_transform.py with the following contents:
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
#transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
... an example file /transforms-python/test/myproject/datasets/sample.xml with contents:
<tag>
<field1>
my_value
</field1>
</tag>
And a test file /transforms-python/test/myproject/datasets/test_xml_parse_transform.py:
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
We now have:
A distributed-compute, low-level .xml parser that is highly scalable
A test-driven setup that we can quickly iterate on to get our exact functionality right
Cheers
I need to write documentation for a parser that was developed in Kaitai. Given a .ksy file, is there any way to produce "pretty" views of the tree?
There is a two year old fork of ksc that supports GraphViz output but the resulting output is pretty hard to work with.
(https://www.reddit.com/r/dataisbeautiful/comments/4zhpvh/binary_data_formats_network_packets_archives/)
I can easily determine what the nodes are but getting their immediate parent would add very useful context.
Thank you.
-David
Please define what exactly do you expect from a "pretty tree".
GraphViz support is available in master and stable releases for a long time (as -t graphviz), and is very well supported — basically every ksy in official repo is accompanied nowadays with a chart: for example, http://formats.kaitai.io/lzh/index.html
If you want to have a tree of values (as opposed to "tree of data types"), we actually have ksdump, which allows you to dump arbitary data file using arbitrary .ksy in a YAML/JSON/XML tree of values. Will it work for you?
I have a really complicated system which use multiple languages and frameworks (Java Python Scala Bash). In each module I need to retrieve configuration values which are similar and change frequently. Currently I'm maintaining multiple conf files which holds lots of duplicates.
I wonder if there is out of the box RestAPI which can retrieve variables by demand from remote location.
All I manage to find by now are ways to load the entire file from remote source which is half a solution from me:
YAML.parse(open('https://link_to_file/file.yaml'))
My goal, which I fail to find a lead to it, is to make a direct call.
MyRemoteAPI.get("level1.level2.x")
P.S
YAML is not mandatory solution for me, I'm Open for suggestions.
I don't know about an out-of-the-box API, but it's fairly trivial to build. Make a service that will read the YAML file and traverse to the appropriate key. e.g. using a dynamic language like Ruby (+Rails), you could do something like
def value
config = YAML.load_file '/local/path/to/config.yaml'
render plain: config.dig(params[:key].split('.'))
end
dig essentially traverses a structure and safely returns nil if a key isn't found, so this returns the value at the "leaf" of the requested path.
You might also want to cache the structure in memory to prevent constantly reading from the file, e.g. could do something like ##config ||= YAML.parse(open('https://link_to_file/file.yaml')) or config = Rails.cache.fetch('config', expire_in: 1.hour) { ... }. And/or cache the API's HTTP response.
Fairly simply question, but I cannot find an answer via Google or on Stack.
I have a use-case where it would be highly-preferable to simply read a .m MATLAB script from a URL.
How should I do this correctly?
<disclaimer>
Clearly only do this with files you have complete control of (and/or find a solution with better validation). This is a "dangerous" method as there is no check that you're not about to run a file which, for example, copies your entire harddrive to Bob's computer before corrupting it all. Bob and Alice might be spending the whole evening laughing at your embarrassing holiday photos.
Treat this more as a proof of concept than a how-to, it addresses your problem but by no means should be used in production code.
</disclaimer>
You'll need to use eval to evaluate code. Every time I mention eval I feel compelled to point out it's not recommended, in particular in this case because you could be evaluating whatever random code is living in that file on the web! In this case your only alternative is to save the file locally and call run.
Then you can use
eval(urlread('http://myscript.m'))
or, since urlread is not recommended (from the docs), you can use webread and specify that the output should be text in the options
eval(webread('http://myscript.m', weboptions('ContentType', 'text')))
Using webread appears to be really slow, not sure why when it's the recommended function. In light of this, urlread might be preferable.
There is a note in the webread docs which suggests you wouldn't even need to specify the weboptions
If a web service returns a MATLAB® file with a .m extension, the function returns its content as a character vector.
Although you suggested that webread returned a uint8 variable which didn't work.
If you'd like to save a file from a URL then run it, you can use websave and str2func like so:
fcnName = 'newscript'; % Name for the function/script file
websave([fcnName '.m'], 'http://myscript.m'); % Download and save it
fcn = str2func(fcnName); % Create function handle
fcn(); % Evaluate function/script
It should of course go without saying that you want to be really sure you can trust the source of the file, otherwise you're gonna have a bad time.
Let us say that I have a Matlab function and I change its signature (i.e. add parameter). As Matlab does not 'compile' is there an easy way to determine which other functions do not use the right signature (i.e. submits the additional parameter). I do not want to determine this at runtime (i.e. get an error message) or have to do text searches. Hope this makes sense. Any feedback would be very much appreciated. Many thanks.
If I understand you correctly, you want to change a function's signature and find all functions/scripts/classes that call it in the "old" way, and change it to the "new" way.
You also indicated you don't want to do it at runtime, or do text searches, but there is no way to detect "incorrect" calls at "parse-time", so I'm afraid these demands leave no option at all to detect old function calls...
What I would do in that case is temporarily add a few lines to the new function:
function myFunc(param1, param2, newParam) % <-- the NEW signature
if nargin == 2
clc, error('old call detected.'); end
and then run the main script/function/whatever in which this function resides. You'll get one error for each time something calls the function incorrectly, along with the error stack in the Matlab command window.
It is then a matter of clicking on the link in the bottom of the error stack, correct the function call, and repeat from the top until no more errors occur.
Don't forget to remove these lines when you're done, or better, replace the word error with warning just to capture anything that was missed.
Better yet: if you're on linux, a text search would be a matter of
$ grep -l 'myFunc(.*,.*); *.m'
which will list all the files having the "incorrect" call. That's not too difficult I'd say...You can probably do a similar thing with the standard windows search, but I can't test that right now.
This is more or less what the dependency report was invented for. Using that tool, you can find what functions/scripts call your altered function. Then it is just a question of manually inspecting every occurrence.
However, I'd advise to make your changes to the function signature such that backwards compatibility is maintained. You can do so by specifying default values for new parameters and/or issuing a warning in those scenarios. That way, your code will run, and you will get run-time hints of deprecated code (which is more or less a necessary evil in interpreted/dynamic languages).
For many dynamic languages (and MATLAB specifically) it is generally impossible to fully inspect the code without the interpreter executing the code. Just imagine the following piece of code:
x = magic(10);
In general, you'd say that the magic function is called. However, magic could map to a totally different function. This could be done in ways that are invisible to a static analysis tool (such as the dependency report): e.g. eval('magic = 1:100;');.
The only way is to go through your whole code base, either inspecting every occurrence manually (which can be found easily with a text search) or by running a test that fully covers your code base.
edit:
There is however a way to access intermediate outputs of the MATLAB parser. This can be accessed using the undocumented and unsupported mtree function (which can be called like this: t = mtree(file, '-file'); for every file in your code base). Using the resulting structure you might be able to find calls with a certain amount of parameters.