XSLT transform with MSXML doesn't use proper encoding - encoding

I'm using IXMLDOMDocument::transformNode from MSXML 3.0 to apply XSLT transforms. Each of the transforms has an xsl:output directive that specifies UTF-8 as the encoding. For example,
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
...
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:str="http://exslt.org/strings"
xmlns:math="http://exslt.org/math"
extension-element-prefixes="str math">
<xsl:output encoding="UTF-8" indent="yes" method="xml" />
...
</xsl:stylesheet>
Yet the transformed result is always UTF-16 (and the encoding attribute says UTF-16).
<?xml version="1.0" encoding="UTF-16"?>
Is this a bug in MSXML?
For various reasons, I'd really like to have UTF-8. Is there a workaround? Or do I have to convert the transformed result to UTF-8 myself and patch up the encoding attribute?
Update: I've worked around the problem by accepting the UTF-16 encoding and prepending a byte-order mark, which satisfies the downstream users of the transformed result, but I'm still be interested in how to get UTF-8 output.

You're probably sending the ouput to a DOM tree or to a character stream, not to a byte stream. If that's the case then it's not MSXML that's doing the encoding, and whatever does do the final encoding has no knowledge of the xsl:output directive (or indeed, of XSLT).

Supplementing what Michael Kay said (which is spot on, of course), here's a JScript example how to transform to a stream, using the XSLT serialization in the process:
// command line args
var args = WScript.Arguments;
if (args.length != 3) {
WScript.Echo("usage: cscript msxsl.js in.xml ss.xsl out.xml");
WScript.Quit();
}
xmlFile = args(0);
xslFile = args(1);
resFile = args(2);
// DOM objects
var xsl = new ActiveXObject("MSXML2.DOMDOCUMENT.6.0");
var xml = xsl.cloneNode(false);
// source document
xml.validateOnParse = false;
xml.async = false;
xml.load(xmlFile);
if (xml.parseError.errorCode != 0)
WScript.Echo ("XML Parse Error : " + xml.parseError.reason);
// stylesheet document
xsl.validateOnParse = false;
xsl.async = false;
xsl.resolveExternals = true;
//xsl.setProperty("AllowDocumentFunction", true);
//xsl.setProperty("ProhibitDTD", false);
//xsl.setProperty("AllowXsltScript", true);
xsl.load(xslFile);
if (xsl.parseError.errorCode != 0)
WScript.Echo ("XSL Parse Error : " + xsl.parseError.reason);
// output object, a stream
var stream = WScript.createObject("ADODB.Stream");
stream.open();
stream.type = 1;
xml.transformNodeToObject( xsl, stream );
stream.saveToFile( resFile );
stream.close();
You may test using this input:
<Urmel>
<eins>Käse</eins>
<deux>café</deux>
<tre>supplì</tre>
</Urmel>
And this stylesheet:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output encoding="UTF-8"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
I think it'll be easy for you to adapt the JScript example to C++.

As you noted, BSTRs are all UTF-16. However, I think Michael Ludwig might be on to something here. Have you tried using this method?
HRESULT IXMLDOMDocument::transformNodeToObject(
IXMLDOMNode *stylesheet,
VARIANT outputObject);
You should be able to just use CreateStreamOnHGlobal, stash the resultant IStream ptr into a VARIANT, and pass that in as the outputObject parameter. Theoretically. I haven't actually tried this, though :)

Related

Convert XML like String to PySpark Dataframe

I'm using azure.storage.queue's receive_messages() function in databricks to pull messages from a azure queue. The response looks like xml but it is really just a string:
<?xml version="1.0" encoding="utf-16"?>
<root>
<col1>123</col1>
<col2>1</col2>
<col3>Unknown</col3>
<col4>Dog</col4>
<col5>Owner</col5>
<col6>-1</col6>
<col7>Owner</col7>
<col8></col8>
</root>
When I write the response to a list, it looks like:
'<root>\r\n <col1>123</col1>\r\n <col2>1</col2>\r\n <col3>Unknown</col3>\r\n <col4>Dog</col4>\r\n <col5>Owner</col5>\r\n <col6>-1</col6>\r\n <col7>Owner</col7>\r\n <col8></col8>\r\n</root>'
I know that I can split on \r\n with something like:
l = [x.strip().split(' ') for x in a[0].split('\r\n')]
l
This gives:
['root'],
['<col1>123</col1>'],
['<col2>1</col2>'],
['<col3>Unknown</col3>'],
['<col4>Dog</col4>'],
['<col5>Owner</col5>'],
['<col6>-1</col6>'],
['<col7>Owner</col7>'],
['<col8></col8'],
['</root>']]
I'm not sure if this is the best route and I don't want to hard code each value into the spark dataframe, because I need to iterate through all messages in the queue. Looking for a solution that converts each 'col' into a header and then grabs the value between 'tags'.
Here is an answer:
data=[]
for message in response:
#print(message.content)
soup = BeautifulSoup(message.content, "xml")
c=soup.find_all('col1')
c1=soup.find_all('col2')
c2=soup.find_all('col3')
c3=soup.find_all('col4')
c4=soup.find_all('col5')
c5=soup.find_all('col6')
c6=soup.find_all('col7')
c7=soup.find_all('col8')
for i in range(0,len(c)):
rows=[c[i].get_text(),
c1[i].get_text(),
c2[i].get_text(),
c3[i].get_text(),
c4[i].get_text(),
c5[i].get_text(),
c6[i].get_text(),
c7[i].get_text()]
data.append(rows)
#print(data)
out_df = spark.createDataFrame(data,schema = ['c','c1','c2','c3','c4',
'c5','c6','c7'])
This was faster, but requires the response to always be in the same order, which mine is.
data=[]
for message in response:
#print(message.content)
root=etree.fromstring(message.content.encode('utf-16'))
arr=[]
for child in root:
r=child.text
arr.append(r)
data.append(arr)
out_df = spark.createDataFrame(data,schema = ['c','c1','c2','c3','c4',
'c5','c6','c7'])
display(out_df)

Mirth Add Segment HL7 V3

I am new to Mirth/JavaScript. I have a project where I need to add a segment to an incoming HL7 v3 XML file. I have tried the following JavaScript in the destination transformer;
tmp = msg.copy();
tmp.createSegment('templateId', ClinicalDocument, 1);
tmp.ClinicalDocument['templateId'][1]['#root'] ="2.16.840.1.113883.10.20.22.1.1";
This generates an error.
Also I need to place this new segment before the existing templateID segment.
Currently this is what we receive –
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:mif="urn:hl7-org:v3/mif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:sdtc="urn:hl7-org:sdtc">
<realmCode code="US" />
<typeId root="2.16.840.1.113883.1.3" extension="POCD_HD000040" /><br/>
<templateId root="2.16.840.1.113883.10.20.22.1.2" extension="2015-08-01" />
We want to add
Tranformed Output Desired -
Any help on how to accomplish this will be greatly appreciated.
Thank You
I understand your requirement in this way. That you need to add
<templateId root="2.16.840.1.113883.10.20.22.1.1" extension="2015-08-01" />
Exactly before the templateID
<templateId root="2.16.840.1.113883.10.20.22.1.2" extension="2015-08-01" />
If that's the requirement, this JAVA code will work
public static void main(String[] args) throws SAXException, IOException, ParserConfigurationException, TransformerException {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
// Source XML
File fXmlFile = new File("C:\\Labs\\POC\\Import_Export\\TEST.xml");
Document doc = dBuilder.parse(fXmlFile);
// Get the element after which template ID code has to be added
Node nodeName = doc.getElementsByTagName("typeId").item(0);
// Code to add new Template ID
Element newTemplateID = doc.createElement("templateId");
newTemplateID.setAttribute("root", "2.16.840.1.113883.10.20.22.1.1");
newTemplateID.setAttribute("extension", "2015-08-01");
// Inserting exactly on specific area
nodeName.getParentNode().insertBefore(newTemplateID, nodeName.getNextSibling());
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
// destination XML
StreamResult result = new StreamResult(new File("C:\\Labs\\POC\\Import_Export\\TESTNew.xml"));
transformer.transform(source, result);
}
you will get an XML like this
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:mif="urn:hl7-org:v3/mif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:sdtc="urn:hl7-org:sdtc">
<realmCode code="US"/>
<typeId extension="POCD_HD000040" root="2.16.840.1.113883.1.3"/>
<templateId extension="2015-08-01" root="2.16.840.1.113883.10.20.22.1.1"/>
<templateId extension="2015-08-01" root="2.16.840.1.113883.10.20.22.1.2"/>
</ClinicalDocument>
Refactor the above code for Mirth specification u will get same output, else u simply want to add extra tags on the end of xml you can use this code in Mirth script
var addTemplateId = new XML("<templateId></templateId>");
addTemplateId['#root'] = '2.16.840.1.113883.10.20.22.1.1';
addTemplateId['#extension'] = '2015-08-01';
var newValue = msg.appendChild(addTemplateId);
msg = newValue;
This will add new tags at the end of the existing tags of the message. which means your output will be like this
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:mif="urn:hl7-org:v3/mif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:sdtc="urn:hl7-org:sdtc">
<realmCode code="US"/>
<typeId extension="POCD_HD000040" root="2.16.840.1.113883.1.3"/>
<templateId extension="2015-08-01" root="2.16.840.1.113883.10.20.22.1.2"/>
<templateId extension="2015-08-01" root="2.16.840.1.113883.10.20.22.1.1"/>
</ClinicalDocument>

How to export xml or json from typo3

I need to export content from a typo3 site to a web app. I am considering to use xml or json. But I haven't figure it out how to do it.
I'm new to typo3 development. So I would like to know if someone has suggestions how to do this.
Regards,
This highly depends on your requirements ;)
As a starting point you can use a new page type and disable all header codes to generate XML e.g.
xml = PAGE
xml {
typeNum = 123
config {
disableAllHeaderCode = 1
xhtml_cleaning = none
admPanel = 0
metaCharset = utf-8
additionalHeaders = Content-Type:text/xml;charset=utf-8
}
10 = COA
10 {
wrap = <?xml version="1.0" encoding="UTF-8" standalone="yes" ?><your_root_tag>|</your_root_tag>
# add code here to generate xml content
10 = ...
}
}
If you browse to http://example.com/index.php?type=123 you'll get the XML content.
But if things get more complex writing an extension maybe the better approach.

Problem while configuring the file Delimeter("\t") in app.config(C#3.0)

In my app.config file I made the setting like the following
<add key = "Delimeter" value ="\t"/>
Now while accessing the above from the program by using the below code
string delimeter = ConfigurationManager.AppSettings["FileDelimeter"].ToString();
StreamWriter streamWriter = null;
streamWriter = new StreamWriter(fs);
streamWriter.BaseStream.Seek(0, SeekOrigin.End);
Enumerable
.Range(0, outData.Length)
.ToList().ForEach(i => streamWriter.Write(outData[i].ToString() + delimiter));
streamWriter.WriteLine();
streamWriter.Flush();
I am getting the output as
18804\t20100326\t5.59975381254617\t
18804\t20100326\t1.82599797249479\t
But if I directly use "\t" in the delimeter variable I am getting the correct output
18804 20100326 5.59975381254617
18804 20100326 1.82599797249479
I found that while I am specifying the "\t" in the config file, and while reading it into
the delimeter variable, it is becoming "\\t" which is the problem.
I even tried with but with no luck.
I am using C#3.0.
Need help
You need to use the XML entity that represents tab, which I believe is rather than the C# representation (which is "\t" as you already know).
<add key="Delimeter" value=" "/>
Or you could always just take the easy way out:
// allow for <add key="Delimeter" value="\t"/>
if (delimiter == #"\t")
delimiter = "\t";

Parsing an XML string containing " " (which must be preserved)

I have code that is passed a string containing XML. This XML may contain one or more instances of (an entity reference for the blank space character). I have a requirement that these references should not be resolved (i.e. they should not be replaced with an actual space character).
Is there any way for me to achieve this?
Basically, given a string containing the XML:
<pattern value="[A-Z0-9 ]" />
I do not want it to be converted to:
<pattern value="[A-Z0-9 ]" />
(What I am actually trying to achieve is to simply take an XML string and write it to a "pretty-printed" file. This is having the side-effect of resolving occurrences of in the string to a single space character, which need to be preserved. The reason for this requirement is that the written XML document must conform to an externally-defined specification.)
I have tried creating a sub-class of XmlTextReader to read from the XML string and overriding the ResolveEntity() method, but this isn't called. I have also tried assigning a custom XmlResolver.
I have also tried, as suggested, to "double encode". Unfortunately, this has not had the desired effect, as the & is not decoded by the parser. Here is the code I used:
string schemaText = #"...<pattern value=""[A-Z0-9&#x20;]"" />...";
XmlWriterSettings writerSettings = new XmlWriterSettings();
writerSettings.Indent = true;
writerSettings.NewLineChars = Environment.NewLine;
writerSettings.Encoding = Encoding.Unicode;
writerSettings.CloseOutput = true;
writerSettings.OmitXmlDeclaration = false;
writerSettings.IndentChars = "\t";
StringBuilder writtenSchema = new StringBuilder();
using ( StringReader sr = new StringReader( schemaText ) )
using ( XmlReader reader = XmlReader.Create( sr ) )
using ( TextWriter tr = new StringWriter( writtenSchema ) )
using ( XmlWriter writer = XmlWriter.Create( tr, writerSettings ) )
{
XPathDocument doc = new XPathDocument( reader );
XPathNavigator nav = doc.CreateNavigator();
nav.WriteSubtree( writer );
}
The written XML ends up with:
<pattern value="[A-Z0-9&#x20;]" />
If you want it to be preserved, you need to double-encode it: &#x20;. The XML-reader will translate entities, that's more or less how XML works.
<pattern value="[A-Z0-9&#x20;]" />
What I did above is replaced "&" with "&" thereby escaping the ampersand.