I have a set of .xml documents that I want to parse.
I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and slower run times, often with one task taking tens of hours to run:
The first transform of mine takes the .xml contents and puts it into a single cell, and a second transform takes this string and uses Python's xml library to parse the string into a document. This document I'm then able to extract properties from and return a DataFrame.
I'm using a UDF to conduct the process of mapping the string contents to the fields I want.
How can I make this faster / work better with large .xml files?
For this problem, we're going to combine a couple of different techniques to make this code both testable and highly scalable.
Theory
When parsing raw files, you have a couple of options you can consider:
❌ You can write your own parser to read bytes from files and convert them into data Spark can understand.
This is highly discouraged whenever possible due to the engineering time and unscalable architecture. It doesn't take advantage of distributed compute when you do this as you must bring the entire raw file to your parsing method before you can use it. This is not an effective use of your resources.
⚠ You can use your own parser library not made for Spark, such as the XML Python library mentioned in the question
While this is less difficult to accomplish than writing your own parser, it still does not take advantage of distributed computation in Spark. It is easier to get something running, but it will eventually hit a limit of performance because it does not take advantage of low-level Spark functionality only exposed when writing a Spark library.
✅ You can use a Spark-native raw file parser
This is the preferred option in all cases as it takes advantage of low-level Spark functionality and doesn't require you to write your own code. If a low-level Spark parser exists, you should use it.
In our case, we can use the Databricks parser to great effect.
In general, you should also avoid using the .udf method as it likely is being used instead of good functionality already available in the Spark API. UDFs are not as performant as native methods and should be used only when no other option is available.
A good example of UDFs covering up hidden problems would be string manipulations of column contents; while you technically can use a UDF to do things like splitting and trimming strings, these things already exist in the Spark API and will be orders of magnitude faster than your own code.
Design
Our design is going to use the following:
Low-level Spark-optimized file parsing done via the Databricks XML Parser
Test-driven raw file parsing as explained here
Wire the Parser
First, we need to add the .jar to our spark_session available inside Transforms. Thanks to recent improvements, this argument, when configured, will allow you to use the .jar in both Preview/Test and at full build time. Previously, this would have required a full build but not so now.
We need to go to our transforms-python/build.gradle file and add 2 blocks of config:
Enable the pytest plugin
Enable the condaJars argument and declare the .jar dependency
My /transforms-python/build.gradle now looks like the following:
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
After applying this config, you'll want to restart your Code Assist session by clicking on the bottom ribbon and hitting Refresh
After refreshing Code Assist, we now have low-level functionality available to parse our .xml files, now we need to test it!
Testing the Parser
If we adopt the same style of test-driven development as here, we end up with /transforms-python/src/myproject/datasets/xml_parse_transform.py with the following contents:
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
#transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
... an example file /transforms-python/test/myproject/datasets/sample.xml with contents:
<tag>
<field1>
my_value
</field1>
</tag>
And a test file /transforms-python/test/myproject/datasets/test_xml_parse_transform.py:
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
We now have:
A distributed-compute, low-level .xml parser that is highly scalable
A test-driven setup that we can quickly iterate on to get our exact functionality right
Cheers
Related
Consider this ReST text from the Xarray documentation:
Once you've manipulated a Dask array, you can still write a dataset
too big to fit into memory back to disk by using
:py:meth:~xarray.Dataset.to_netcdf in the usual way.
.. ipython:: python
ds.to_netcdf('manipulated-example-data.nc')
By setting the compute argument to False,
:py:meth:~xarray.Dataset.to_netcdf will return a Dask delayed object
that can be computed later.
When we build the documentation, this snippet actually creates the file manipulated-example-data.nc. We obviously don't want this file in the repository. I don't know much about the ipython directive in ReST but there must be a standard way to handle this, rather than just adding this stuff to the .gitignore. Anyone know?
Let's consider a scenario, we have to run the performance test for "create an account api" which takes input as header/path param "Auth token" and input data like user account name . So for above scenario we have 2 feature file as,
to run performance test for POST http://baseUrl/auth_param/create/input_data
1. One feature(e.g: generateAuth.feature) file which will have the auth
token
2. Second feature(createAccount.feature) file which take parameter as a
auth token, input data.
Here is my simulation class,
class <MyClass> extends Simulation {
before {
println("Simulation is about to start!")
}
val generateAuthTest = scenario("generateAuth").exec(karateFeature("classpath:path/generateAuth.feature"))
val createAccountTest = scenario("test").exec(karateFeature("classpath:path/createAccount.feature"))
setUp(
createAccountTest.inject(rampUsers(1) over (10 seconds))).maxDuration(1 minutes)
after {
println("Simulation is finished!")
}
}
Here, can i read auth from generateAuth.feature file which is input for createAccount.feature file, so that i can pass as a parameter?
Please suggest me how to pass parameters to createAccount.feature while calling in karateFeature method.
Let me put a requirement here,
let's say we have some feature files for CRUD operations on a particular data. Here how i go to write functional scenario,
I will create new feature file to write a scenario
just use CRUD files to test a SINGLE flow.
Now if i go for Performance test cases on individual operation, i feel there are 2 ways,
Create new 4 performance test feature files (one for each CRUD
method) and call these CRUD feature files in the respective test
feature file. Finally we just call test feature files in the
respective gatling simulation class.
**(In this case, I will end up with creating more test feature files as well simulation classes for
performance, which I want to avoid) **
Just call CRUD files in the respective gatling simulation class and
pass the required parameters to them.(In this case , we just need to create only 4 simulation
classes and run them on the basic of operation like create,read,delete and so on)
Here just wanted to know 2nd way of performance test, is it achievable or not in karate and if yes please let me know how?
Summary- I think its achievable using 3rd feature file (extra) for
individual use case but I do not want to make an extra feature file
for each case so that I can avoid maintenance work and can take
advantage of re-usability of existing feature file from functional
test to performance test.
Just use the normal Karate concepts such as karate-config.js
You can easily switch environments by setting the karate.env system property.
For example:
mvn test -DargLine="-Dkarate.env=e2e"
EDIT: after you edited your question, it is clear you have a SINGLE flow you want to test. please use a SINGLE feature. I suggest you move the generateAuth into the Background of the feature. Also refer to the docs on callSingle() for advanced options.
If you are expecting 2 feature files to magically share data that is not possible and not needed if you structure your tests correctly.
If you really really need this, please create a Java singleton and access it from each feature. Totally don't recommend this though.
EDIT: In Karate 0.9.0 onwards, you can call a single scenario within a feature if it has a tag:
classpath:animals/cats/create.feature#sometagname
I have a really complicated system which use multiple languages and frameworks (Java Python Scala Bash). In each module I need to retrieve configuration values which are similar and change frequently. Currently I'm maintaining multiple conf files which holds lots of duplicates.
I wonder if there is out of the box RestAPI which can retrieve variables by demand from remote location.
All I manage to find by now are ways to load the entire file from remote source which is half a solution from me:
YAML.parse(open('https://link_to_file/file.yaml'))
My goal, which I fail to find a lead to it, is to make a direct call.
MyRemoteAPI.get("level1.level2.x")
P.S
YAML is not mandatory solution for me, I'm Open for suggestions.
I don't know about an out-of-the-box API, but it's fairly trivial to build. Make a service that will read the YAML file and traverse to the appropriate key. e.g. using a dynamic language like Ruby (+Rails), you could do something like
def value
config = YAML.load_file '/local/path/to/config.yaml'
render plain: config.dig(params[:key].split('.'))
end
dig essentially traverses a structure and safely returns nil if a key isn't found, so this returns the value at the "leaf" of the requested path.
You might also want to cache the structure in memory to prevent constantly reading from the file, e.g. could do something like ##config ||= YAML.parse(open('https://link_to_file/file.yaml')) or config = Rails.cache.fetch('config', expire_in: 1.hour) { ... }. And/or cache the API's HTTP response.
I have a script which creates new referenceId each time its executed. I used
.check(regex("orders.(.*?)\"").saveAs("referenceId")))
to extract the referenceId. Now, how can I write/append it to a file without impacting the script even if I run it as a load test?
I used session in .exec to write my value into a file. Here it is:
.exec( session => {
scala.tools.nsc.io.File("../user-files/data/refenceId.csv").appendAll(session("refenceId").as[String]+"\n")
session}
)
You solution works, but...
First of all do not use anything (if you don't have to) from scala.tools.nsc.io package. It is internal package only for Scala compiler. It is not public API included in Scala runtime library (official Scaladoc). More about the topic here. Scala do not have any own abstraction for writing to file, hence one need to use normal java.io.File & co.
Secondly opening a file in each execution can (may not) slow down your load-test. It strongly depends on at which rate you are making the requests. At higher rates you can experience contention when more concurrent executions will be trying to write to same file. Simplest solution to this is to write to different files, but you can still run out of maximum possible number of opened files. Another solution is to use shared java.io.FileOutputStream resp. java.io.FileWriter to desired target file with proper synchronisation (will be accessed from various threads), which is still blocking IO. Yet another solution will be use Java NIO API to write to shared file via Channel (non-blocking) or OutputStream (not sure if non-blocking).
Of course solutions differ in difficulty of implementation.
I want to call the API of uima-text-segmenter https://code.google.com/p/uima-text-segmenter/source/browse/trunk/INSTALL?r=22 to run an example.
But I don`t know how to call the API...
the readme said,
With the DocumentAnalyzer, run the following descriptor
`desc/textSegmenter/wst-snowball-C99-JTextTilingAAE.xml` by taking the
uima-examples data as input.
Could anyone give me some code which could be run directly in main func for example?
Thanks a lot!
Long answer:
The link describes how you would set up the application from within the Eclipse UIMA environment. This sort of set-up is typically targeted at subject matter specialists with little or no coding experience. It allows them to work (relatively fast) with UIMA in a declarative way: all data structures and analysis engines (computing blocks within UIMA) are declared in xml (with a GUI on top of it), after which the framework takes care of the rest. In this scenario, you would typically run a UIMA pipeline using a run configuration from within Eclipse (or the included UIMA pipeline runner application). Luckily, UIMA allows you to do exactly the same from code, but I would recommend using UIMAFit (http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e137) for this purpose instead of of UIMA, as it bundles lots of useful things and coding shortcuts.
Short answer:
Using UIMAFit, you can call Factory methods that create CollectionReader (read input), AnalysisEngine (process input) and Consumer objects (write/do other stuff) from (third-party provided) XML files. Use these methods to construct your pipeline and the SimplePipeline class to run it. To extract the data you need, you would manipulate the CAS object (containing your data) in a Consumer object, possibly with a callback. You could also do this in a Analysis Engine object. I recommend using DKPro's FeaturePathFactory (https://code.google.com/p/dkpro-core-asl/source/browse/de.tudarmstadt.ukp.dkpro.core-asl/trunk/de.tudarmstadt.ukp.dkpro.core.api.featurepath-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/api/featurepath/FeaturePathFactory.java?spec=svn1811&r=1811) to quickly access the feature you are after.
Code examples:
http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e137 contains examples, but they all go in the opposite direction (class objects are used in the factory methods, instead of XML files - XML is generated from these classes). Take a look at the UIMAFit API to find the method you need, AnalysisEngineDescription from XML for example: http://uima.apache.org/d/uimafit-current/api/org/apache/uima/fit/factory/AnalysisEngineFactory.html#createEngineDescriptionFromPath-java.lang.String-java.lang.Object...-