Parsing an XML string containing " " (which must be preserved) - .net-2.0

I have code that is passed a string containing XML. This XML may contain one or more instances of (an entity reference for the blank space character). I have a requirement that these references should not be resolved (i.e. they should not be replaced with an actual space character).
Is there any way for me to achieve this?
Basically, given a string containing the XML:
<pattern value="[A-Z0-9 ]" />
I do not want it to be converted to:
<pattern value="[A-Z0-9 ]" />
(What I am actually trying to achieve is to simply take an XML string and write it to a "pretty-printed" file. This is having the side-effect of resolving occurrences of in the string to a single space character, which need to be preserved. The reason for this requirement is that the written XML document must conform to an externally-defined specification.)
I have tried creating a sub-class of XmlTextReader to read from the XML string and overriding the ResolveEntity() method, but this isn't called. I have also tried assigning a custom XmlResolver.
I have also tried, as suggested, to "double encode". Unfortunately, this has not had the desired effect, as the & is not decoded by the parser. Here is the code I used:
string schemaText = #"...<pattern value=""[A-Z0-9&#x20;]"" />...";
XmlWriterSettings writerSettings = new XmlWriterSettings();
writerSettings.Indent = true;
writerSettings.NewLineChars = Environment.NewLine;
writerSettings.Encoding = Encoding.Unicode;
writerSettings.CloseOutput = true;
writerSettings.OmitXmlDeclaration = false;
writerSettings.IndentChars = "\t";
StringBuilder writtenSchema = new StringBuilder();
using ( StringReader sr = new StringReader( schemaText ) )
using ( XmlReader reader = XmlReader.Create( sr ) )
using ( TextWriter tr = new StringWriter( writtenSchema ) )
using ( XmlWriter writer = XmlWriter.Create( tr, writerSettings ) )
{
XPathDocument doc = new XPathDocument( reader );
XPathNavigator nav = doc.CreateNavigator();
nav.WriteSubtree( writer );
}
The written XML ends up with:
<pattern value="[A-Z0-9&#x20;]" />

If you want it to be preserved, you need to double-encode it: &#x20;. The XML-reader will translate entities, that's more or less how XML works.

<pattern value="[A-Z0-9&#x20;]" />
What I did above is replaced "&" with "&" thereby escaping the ampersand.

Related

How to cut a string from the end in UIPATH

I have this string: "C:\Procesos\rrhh\CorteDocumentos\Cortados\10001662-1_20060301_29_1_20190301.pdf" and im trying to get this part : "20190301". The problem is the lenght is not always the same. It would be:
"9001662-1_20060301_4_1_20190301".
I've tried this: item.ToString.Substring(66,8), but it doesn't work sometimes.
What can I do?.
This is a code example of what I said in my comment.
Sub Main()
Dim strFileName As String = ""
Dim di As New DirectoryInfo("C:\Users\Maniac\Desktop\test")
Dim aryFi As FileInfo() = di.GetFiles("*.pdf")
Dim fi As FileInfo
For Each fi In aryFi
Dim arrname() As String
arrname = Split(Path.GetFileNameWithoutExtension(fi.Name), "_")
strFileName = arrname(arrname.Count - 1)
Console.WriteLine(strFileName)
Next
End Sub
You could achieve this using a simple regular expressions, which has the added benefit of including pattern validation.
If you need to get exactly eight numbers from the end of file name (and after an underscore), you can use this pattern:
_(\d{8})\.pdf
And then this VB.NET line:
Regex.Match(fileName, "_(\d{8})\.pdf").Groups(1).Value
It's important to mention that Regex is by default case sensitive, so to prevent from being in a situations where "pdf" is matched and "PDF" is not, the patter can be adjusted like this:
(?i)_(\d{8})\.pdf
You can than use it directly in any expression window:
PS: You should also ensure that System.Text.RegularExpressions reference is in the Imports:
You can achieve it by this way as well :)
Path.GetFileNameWithoutExtension(Str1).Split("_"c).Last
Path.GetFileNameWithoutExtension
Returns the file name of the specified path string without the extension.
so with your String it will return to you - 10001662-1_20060301_29_1_20190301
then Split above String i.e. 10001662-1_20060301_29_1_20190301 based on _ and will return an array of string.
Last
It will return you the last element of an array returned by Split..
Regards..!!
AKsh

How to create virtual XML for ZUGFeRD Invoices

I try to create a PDF/A-3b file which contains an embedded XML-File to be ZUGFeRD conform. I use Perl and PDFLib for this purpose. The PDFLib Documentation out there is just for Java and PHP. Creating the PDF works fine, but the XML part is my problem.
So how can i create a pvf from xml and join this to my pdf?
This is what PDFLib recommends in Java:
// Place XML stream in a virtual PVF file
String pvf_name = "/pvf/ZUGFeRD-invoice.xml";
byte[] xml_bytes = xml_string.getBytes("UTF-8");
p.create_pvf(pvf_name, xml_bytes, "");
// Create file attachment (asset) from PVF file
int xml_asset = p.load_asset("Attachment", pvf_name,
"mimetype=text/xml description={ZUGFeRD invoice in XML format} "
+ "relationship=Alternative documentattachment=true");
// Associate file attachment with the document
p.end_document("associatedfiles={" + xml_asset + "}");
So I thought, take the example and fit it to perl:
my $xmldata = read_file($xmlfile, binmode => ':utf8'); #I use example xml at the moment
my $pvf_xml = "/pvf/ZUGFeRD-invoice.xml";
PDF_create_pvf($pdf, $pvf_xml, $xmldata, ""); #because no OOP i need to call it this way (works with all other PDF Functions)
my $xml_invoice = PDF_load_asset("Attachment", $pvf_xml, "mimetype=text/xml "
."description={Rechnungsdaten im Zugferd-Xml-Format} "
."relationship=Alternative documentattachment=true");
PDF_end_document($pdf, "associatedfiles={".$xml_invoice."}");
In PHP examples it's also not needed to convert to ByteArray after reading xml. Further tried it with unpack but don't seem to be the problem.
If I call my script I'm just getting:
Usage: load_asset(type, filename, optlist); at signatur_test.pl line
41.
I think the problem is that pvf_xml isn't created the line before.
Anyone did this before and no how to solve this?
Arg, i was just missing the PDF-Handle in the load_asset method:
my $xml_invoice = PDF_load_asset($pdf, "Attachment", $pvf_xml, "mimetype=text/xml "
."description={Rechnungsdaten im Zugferd-Xml-Format} "
."relationship=Alternative documentattachment=true");
This way it works.

cgi.parse_multipart function throws TypeError in Python 3

I'm trying to make an exercise from Udacity's Full Stack Foundations course. I have the do_POST method inside my subclass from BaseHTTPRequestHandler, basically I want to get a post value named message submitted with a multipart form, this is the code for the method:
def do_POST(self):
try:
if self.path.endswith("/Hello"):
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers
ctype, pdict = cgi.parse_header(self.headers['content-type'])
if ctype == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, pdict)
messagecontent = fields.get('message')
output = ""
output += "<html><body>"
output += "<h2>Ok, how about this?</h2>"
output += "<h1>{}</h1>".format(messagecontent)
output += "<form method='POST' enctype='multipart/form-data' action='/Hello'>"
output += "<h2>What would you like to say?</h2>"
output += "<input name='message' type='text'/><br/><input type='submit' value='Submit'/>"
output += "</form></body></html>"
self.wfile.write(output.encode('utf-8'))
print(output)
return
except:
self.send_error(404, "{}".format(sys.exc_info()[0]))
print(sys.exc_info() )
The problem is that the cgi.parse_multipart(self.rfile, pdict) is throwing an exception: TypeError: can't concat bytes to str, the implementation was provided in the videos for the course, but they're using Python 2.7 and I'm using python 3, I've looked for a solution all afternoon but I could not find anything useful, what would be the correct way to read data passed from a multipart form in python 3?
I've came across here to solve the same problem like you have.
I found a silly solution for that.
I just convert 'boundary' item in the dictionary from string to bytes with an encoding option.
ctype, pdict = cgi.parse_header(self.headers['content-type'])
pdict['boundary'] = bytes(pdict['boundary'], "utf-8")
if ctype == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, pdict)
In my case, It seems work properly.
To change the tutor's code to work for Python 3 there are three error messages you'll have to combat:
If you get these error messages
c_type, p_dict = cgi.parse_header(self.headers.getheader('Content-Type'))
AttributeError: 'HTTPMessage' object has no attribute 'getheader'
or
boundary = pdict['boundary'].decode('ascii')
AttributeError: 'str' object has no attribute 'decode'
or
headers['Content-Length'] = pdict['CONTENT-LENGTH']
KeyError: 'CONTENT-LENGTH'
when running
c_type, p_dict = cgi.parse_header(self.headers.getheader('Content-Type'))
if c_type == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, p_dict)
message_content = fields.get('message')
this applies to you.
Solution
First of all change the first line to accommodate Python 3:
- c_type, p_dict = cgi.parse_header(self.headers.getheader('Content-Type'))
+ c_type, p_dict = cgi.parse_header(self.headers.get('Content-Type'))
Secondly, to fix the error of 'str' object not having any attribute 'decode', it's because of the change of strings being turned into unicode strings as of Python 3, instead of being equivalent to byte strings as in Python 3, so add this line just under the above one:
p_dict['boundary'] = bytes(p_dict['boundary'], "utf-8")
Thirdly, to fix the error of not having 'CONTENT-LENGTH' in pdict just add these lines before the if statement:
content_len = int(self.headers.get('Content-length'))
p_dict['CONTENT-LENGTH'] = content_len
Full solution on my Github:
https://github.com/rSkogeby/web-server
I am doing the same course and was running into the same problem. Instead of getting it to work with cgi I am now using the parse library. This was shown in the same course just a few lessons earlier.
from urllib.parse import parse_qs
length = int(self.headers.get('Content-length', 0))
body = self.rfile.read(length).decode()
params = parse_qs(body)
messagecontent = params["message"][0]
And you have to get rid of the enctype='multipart/form-data' in your form.
In my case I used cgi.FieldStorage to extract file and name instead of cgi.parse_multipart
form = cgi.FieldStorage(
fp=self.rfile,
headers=self.headers,
environ={'REQUEST_METHOD':'POST',
'CONTENT_TYPE':self.headers['Content-Type'],
})
print('File', form['file'].file.read())
print('Name', form['name'].value)
Another hack solution is to edit the source of the cgi module.
At the very beginning of the parse_multipart (around the 226th line):
Change the usage of the boundary to str(boundary)
...
boundary = b""
if 'boundary' in pdict:
boundary = pdict['boundary']
if not valid_boundary(boundary):
raise ValueError('Invalid boundary in multipart form: %r'
% (boundary,))
nextpart = b"--" + str(boundary)
lastpart = b"--" + str(boundary) + b"--"
...

Open xml encoding

Im using Open XML SDK to read and write information to custom xml parts in a Word document.
Im serializing a xml structure into a class , modifying the data and serializing it back into the document.
Example code
DocumentFormat.OpenXml.Packaging.CustomXmlPart myPart = GetCustomXmlPart(mainPart, AITNamespaces.SiteDataNameSpace);
StreamReader myStrR = new System.IO.StreamReader(myPart.GetStream());
string myXML = myStrR.ReadToEnd();
sitedata = ObjectXMLSerializer<WDSiteData>.LoadString(myXML);
Site selSite = _applicationService.GetApplicationData().Sites.Find(item => item.Id == siteId);
sitedata.SiteId= selSite.Id;
sitedata.SiteName = "ööööaaaaaåååå"; // selSite.SiteName;
myXML = ObjectXMLSerializer<WDSiteData>.GetXMLString(sitedata);
myPart = GetCustomXmlPart(wordProcDocument.MainDocumentPart, AITNamespaces.SiteDataNameSpace);
using (StreamWriter sw = new StreamWriter(myPart.GetStream(FileMode.Create)))
{
sw.Write(myXML);
}
My problem is that the national characters become encoded and the text "ööööaaaaaåååå" is displayed in the word document as "????????aaaaa????????"
The actual encoding is done
myXML = ObjectXMLSerializer<WDSiteData>.GetXMLString(sitedata);
Anyone has a clue on how to go about handling national characters this way.
For whom it concerns , my trouble was that in my serializer the encoding was fixed
return ASCIIEncoding.ASCII.GetString(memStream.ToArray());
When i changed to
return ASCIIEncoding.UTF8.GetString(memStream.ToArray());
all is well.

Problem while configuring the file Delimeter("\t") in app.config(C#3.0)

In my app.config file I made the setting like the following
<add key = "Delimeter" value ="\t"/>
Now while accessing the above from the program by using the below code
string delimeter = ConfigurationManager.AppSettings["FileDelimeter"].ToString();
StreamWriter streamWriter = null;
streamWriter = new StreamWriter(fs);
streamWriter.BaseStream.Seek(0, SeekOrigin.End);
Enumerable
.Range(0, outData.Length)
.ToList().ForEach(i => streamWriter.Write(outData[i].ToString() + delimiter));
streamWriter.WriteLine();
streamWriter.Flush();
I am getting the output as
18804\t20100326\t5.59975381254617\t
18804\t20100326\t1.82599797249479\t
But if I directly use "\t" in the delimeter variable I am getting the correct output
18804 20100326 5.59975381254617
18804 20100326 1.82599797249479
I found that while I am specifying the "\t" in the config file, and while reading it into
the delimeter variable, it is becoming "\\t" which is the problem.
I even tried with but with no luck.
I am using C#3.0.
Need help
You need to use the XML entity that represents tab, which I believe is rather than the C# representation (which is "\t" as you already know).
<add key="Delimeter" value=" "/>
Or you could always just take the easy way out:
// allow for <add key="Delimeter" value="\t"/>
if (delimiter == #"\t")
delimiter = "\t";