Large XML with selective parsing

Large XML with selective parsing - dom

We are building kind of staging application where we receive large XML files (ISO 20022 messages) with tons of elements defined in it. We just store these XML's in database as XMLtype and send them to downstream system for further processing.
There is GUI where we need to display some of those XML elements to Users and allow to update some of fields and store it again in database as new XML message
Trying to find best efficient implementation stack with respect to performance and memory.
One idea is to identify XML elements which are required to be displayed in UI and have such elements defined as meta fields with XPath. Trying to avoid parsing entire XML.
Appreciate any ideas to process large XML when only certain elements are required to be viewed and updated.

My experience is that using XML Data Types in common RDBMS is OK, but not great. I found that native XML DBMS work much better for ISO 20022, such as Marklogic or eXistdb.
If you want to continue with an RDBMS' XML Type, then use XQuery to pull the items you want. Oracle call this XMLQuery. Microsoft have a query function for XQuery.
As XQuery is based on XPath, then yes, using XPath is a good way to achieve what you want.

Related

How best to store HTML alongside strings in Cloud Storage

I have a collection data of, and in each case there is chunk of HTML and a few strings, for example
html: <div>html...</div>, name string: html chunk 1, date string: 01-01-1999, location string: London, UK. I would like to store this information together as a single cloud storage object. Specifically, I am using Google Cloud Storage. There are two ways I can think of doing this. One is to store the strings as custom metadata, and the HTML as the actual file contents. The other is to store all the information as JSON file, with the HTML as a base64 encoded string.
I want to avoid a situation where after having stored a lot of data, I find there is some limitation to the approach I am using. What is the proper way to do this - is either of these approaches bad practice? Assuming there is no problem with either, I would go with the JSON approach because it is easier to pass around all the data together as a file.

There isn't a specific right way to do what you're talking about, there are potential pitfalls and performance criteria but they depend on what you're doing with the data and why. Do you ever need access to the metadata for queries? You won't be able to efficiently do that if you pack everything into one variable as a JSON object. What are you parsing the data with later? does it have built in support for JSON? Does it support something else? Is speed a consideration? Is cloud storage space a consideration? Does a user have the ability to input the html and could they potentially perform some sort of attack? How do you use the data when you retrieve it? How stable is the format of the data? You could use JSON, ProtocolBuffers, packed binary blobs in a length | value based format, base64 with a delimiter, zip files turned into binary blobs, do what suits your application and allows a clean structured design that you can test and maintain.

How to build Scala report projects

Is there a common standard to follow for building a SCALA based report engine from scratch. Data will be sourced from HDFS, Filtered, formatted and emailed. Please share any experience or hurdles to expect.

I used to do such reports as PDF, HTML and XSLX.
We used ElasticSearch but here was the general workflow:
get filtered data from storage to scala (no real trouble, just make sure your filters are well tested)
fill the holes to have a consistent data: think about missing points, crazy timezones...
format (we used an xslt processor to produce email HTML, it is really specific and size for emails is limited, aim ~15 Mo as a very maximum)
if file is too big, store it somewhere and send the link instead

DATASTAGE capabilities

I'm a Linux programmer.
I used to write code in order to get things done: java perl php c.
I need to start working with DATA STAGE.
All I see is that DATA STAGE is working on table/csv style data and doing it line by line.
I want to know if DATA STAGE can work on file that are not table/csv like. can it load
data into data structures and run function on them, or is it limited to working
only on one line at a time.
thank you for any information that you can give on the capabilities of DATA SATGE

IBM (formerly Ascential) DataStage is an ETL platform that, indeed, works on data sets by applying various transformations.
This does not necessarily mean that you are constrained on applying only single line transformations (you can also aggregate, join, split etc). Also, DataStage has it's own programming language - BASIC - that allows you to modify the design of your jobs as needed.
Lastly, you are still free to call external scripts from within DataStage (either using the DSExecute function, Before Job property, After Job property or the Command stage).
Please check the IBM Information Center for a comprehensive documentation on BASIC Programming.
You could also check the DSXchange forums for DataStage specific topics.

Yes it can, as Razvan said you can join, aggregate, split. It can uses loops and external scripts, it can also handles XML.
My advice for you is that if you have large quantities of data you're gonna have to work on then datastage is your friend, else if the data that you're going to have to load is not very big then it's going to be easier to use JAVA, c, or any programming language that you know.

You can all times of functions , conversions , manipulate the data. mainly Datastage is used for ease of use when you handling humongous data from datamart /datawarehouse.
The main process of datastage would be ETL - Extraction Transformation Loading.
If a programmer uses 100 lines of code to connect to some database here we can do it with one click.
Anything can be done here even c , c++ coding in a rountine activity.

If you are talking about hierarchical files, like XML or JSON, the answer is yes.
If you are talking about complex files, such as are produced by COBOL, the answer is yes.
All using in-built functionality (e.g. Hierarchical Data stage, Complex Flat File stage). Review the DataStage palette to find other examples.

Is there a way to get around space usage issues when using long field names in MongoDB?

It looks like having descriptive field names (the ones I like the most) can take much space in the memory for big collections. I don't like the idea of giving them short and cryptic names to save memory, neither do I like the idea to translate field names to shortened fields somewhere in the application.
Is there a way to tell mongo not to store every field name as text?

For now the only thing you can do is to vote and wait for SERVER-863 to be solved. After almost a year of discussion the status of this issue has been changes to planned but not scheduled...
The workaround is to use document mapping libraries likes Spring Data Document or morphia (in Java world) and work with nicely named objects. But the underlying database names are still cryptic.

If you are using an "object-document mapper" library to access MongoDB, many of them provide facilities for using descriptive names within your application code, but storing short names in the database. If your application has a data access layer, it may be possible for you to implement this logic in your application code, as well.
Since you haven't said what language you're using, or whether you're using an ODM at all, I provide any more guidance on which ODMs might fit your needs.

Load and perform search on large amount of data

I need a suggest how to operate with large amount of data on iPhone. Let say I have xml file with ~120k text records. I need to perform search on this data. The solution i have tried is to use Core Data to store information in sorted order in caches. And then use binary search which works fast. But the problem is to build this caches. On first launch application takes about 15-25 seconds to build this caches. Maybe I need to use different approach to search the data?
Thanks in advance.

If you're using an XML file with the requirement that you can't cache, then you're not going to succeed unless you somehow carefully format your XML file to have useful data traversal properties -- but then you may as well use a binary file that's more useful unless you have some very esoteric requirements.
Really what you want is one of the typical indexing algorithms (on disk hash, B-tree, etc) from the get-go.
However...
If you have to read in and parse your XML text file, then you can skirt using a typical big and slow generic XML parser and write a fast hackish version since most of the data records you'll need to recognize are probably formatted the same way over and over. Nothing special, just find where the relevant data fields start, grab the data until it ends, move on to the next data field.
Honestly, 120k of text isn't very much-- sounds like whatever XML parser you're using is just slow. (I use this trick all the time for autogenerated XML data that just represents things like tables or simple data records -- my own parser is faster than any generic XML parser.)
This is probably the solution you actually want since you sound fairly attached to the XML file format. It won't be as error-proof as a generic XML parser if you're not careful, however it will eat that 120KB file up like nobody's business. And it's entry level CS work -- read in a file with certain specific formatting and grab the data values from it. Regexps are your friend if you have access to them.

Try storing and doing your searches in the cloud. (using a database stored on a server somewhere)
Unless you specifically need ALL of the information on the device..

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Large XML with selective parsing - dom

Related

How best to store HTML alongside strings in Cloud Storage

How to build Scala report projects

DATASTAGE capabilities

Is there a way to get around space usage issues when using long field names in MongoDB?

Load and perform search on large amount of data

Categories

Resources