Filter (phrase) out specific tags in document - filtering

I'm on OSX, I was wondering if there is a piece of software out there that allows me to extract specific tags and their value out of a document.
I have an XML that contains path information contained within a tag like this:
<pathurl>file://localhost/disk1/pahttofile.mov</pathurl>
I need to extract only these tags and the path info.
How would I do this without having to find and copy paste a million times?
THX!
Karel.

In Python:
import xml.etree.ElementTree as ET
def getpathurls(filename):
tree = ET.parse(filename)
root = tree.getroot()
return [path.text for path in root.findall('pathurl')]
Depending on how your file is formatted, this may get you the results you're looking for.
(I know this isn't exactly what you're looking for, but this is the only way to answer the question on-topic to StackOverflow)

Related

Migrating from itext2 to itext7

Years ago, I wrote a small app in itext2 to gather reports on a weekly basis and concatenate them into one PDF. The app used com.lowagie.text.pdf.PdfCopy to copy and merge the PDFs. And it worked fine. Performed exactly as expected.
A few weeks ago I looked into migrating the application to itex7. To that end, I used the copyPagesTo method of com.itextpdf.kernel.pdf.PdfDocument. When run on the same file set, this produces warnings like:
WARN PdfNameTree - Name "section.1" already exists in the name tree; old value will be replaced by the new one.
When I click on the link to "section.1" in the first document of the merged PDF, I am taken to "section.1" of the last document. Not what I expected and not what happens when using the itext2 app. In the PDF's produced by itext2, if I click on the link to "section.1" of the first document in the combined PDF, I am taken to section 1 of the first document.
There is a hint in Javadocs for copyPagesTo saying
If outlines destination names are the same in different documents, all
such outlines will lead to a single location in the resultant
document. In this case iText will log a warning. This can be avoided
by renaming destinations names in the source document.
There is however, no explanation of how this should be done. I find it odd that this should be necessary in itext7, although it wasn't in itext2.
Is there a simple way to get around his problem?
I've also tried the Sejda desktop app and it produces correct results, but I would prefer to automate the process through a batch script.
My guess is iText 2 didn't even know it might be a problem.
If iText can't deduplicate destination names, the procedure is roughly:
Follow /Catalog -> /Names -> /Dests in each document to find the destination name tree.
Deduplicate the names, by adding suffixes. Remember that a name with a suffix added might be equal to an existing name in the same or another document. Be careful!
Now you can rewrite the destination name trees. Since you have only used suffixes, you can do this in place - the lexicographic ordering of the names is unaltered so the search tree structure is not broken.
Now, rewrite destination links in each PDF for the new names. For example any dictionary entry with key /Dest, or any /D in a /GoTo action.
Now, after all this preprocessing, the files will merge without name clashes.
(I know all this because I've just implemented it for my own PDF software. It's slightly hairy stuff, but not intractable.)
If you like, I can provide a devel version of cpdf with this functionality, if you would like to test it.

exiftool write xmp tag

I am trying to write a new value for a XMP tag using exiftool but for some reason the tag is not being recognized.
Reading the field works:
exiftool -PropertyId /Users/user/test.jpg
Property Id : 17934
But when trying to write a value for PropertyId tag, does not work.I did try also to use -xmp:PropertyId but I get the same result:
exiftool -PropertyId=12345 /Users/user/test.jpg
Warning: Tag 'PropertyId' is not defined
Nothing to do.
Exporting the metadata shows that the field is there: (I only copied the xmp section)
exiftool -xmp -b -a /Users/user/test.jpg > data.xmp
...
<rdf:Description rdf:about=''
xmlns:xmp='http://ns.adobe.com/xap/1.0/'>
<xmp:Brand>Brand Name</xmp:Brand>
<xmp:CreateDate>2015-07-08T11:45:21</xmp:CreateDate>
<xmp:CreatorTool>CreatorTool</xmp:CreatorTool>
<xmp:FacilityName>The Restaurant Name</xmp:FacilityName>
<xmp:MetadataDate>2015-09-14T13:12:51-06:00</xmp:MetadataDate>
<xmp:ModifyDate>2015-09-14T13:12:51-06:00</xmp:ModifyDate>
<xmp:PropertyId>00000</xmp:PropertyId>
<xmp:PropertyName>Property Name</xmp:PropertyName>
<xmp:ShootDate>2016-03-12</xmp:ShootDate>
</rdf:Description>...
I am missing something? Test file is here: test.jpg
Exiftool cannot edit metadata it does not have a definition for, as it is in this case. In fact, your example XMP shows a lot of tags which it says are part of the "xap" group but are not actually part of that (very old) standard, including Brand, FacilityName, PropertyName, and ShootDate. You'll find that none of those are directly editable by exiftool. Probably not by any other program except the one that originally wrote it.
If you want exiftool to be able to write those tags, you'll need to create definitions for those tags. See the ExifTool Example Config file for details.
Also take note, as I said, "xap" is a very old standard and has long since been replaced. Exiftool will update the tags it does know to the newer standards. For details see the XMP xmp tags entry.

Novacode LineChart type

I have a code that implements a Novacode.LineChart. And the LineChart type which is shown by default is this one:
But I dont want this type of chart, I want it without points, like this:
This is the code where I create the chart:
LineChart c = new LineChart();
c.AddLegend(ChartLegendPosition.Bottom, false);
c.Grouping = Grouping.Stacked;
Anyone knows how can I hide thoose points and show only the lines? Thanks to everyone!!
Your question is shown up while I was searching for the exact same feature. It's probably a bit late but I hope it would be useful for other people in need of this feature.
My so called answer is not more than a few lines of dirty and unmanageable hack so unless you are not in dire need, I do not recommend to follow this way.
I also do not know if is it an approved approach here but I prefer to write the solution step by step so it may help you to grasp the concept and use better methods.
After I have realized that I was unable to use DocX to create a line chart without markers, using currently provided API, I wanted to know what were the differences between actual and desired output. So I saved a copy of .docx file with line chart after I manually edited the chart to expected result.
Before and after the edit
As you may already know, a .docx is a container format and essentially comprised of a few different folders and files. You can open it up with a .zip archive extractor. I used 7-Zip for this task and found chart file at location of /word/charts/chart1.xml but this may differ depending on the file, but you can easily figure it out.
Compared both of chart1.xml files and the difference was, the file without the markers had and extra XML tag with an additional attribute;
<c:marker>
<c:symbol val="none" />
</c:marker>
I had to somehow add this segment of code to chart. I added these up to example code provided by DocX. You can follow up from: DocX/ChartSample.cs at master
This is where the fun begins. Easy part first.
using System.Xml;
using System.Xml.Linq;
using Xceed.Words.NET;
// Create a line chart.
var line_chart = new LineChart();
// Create the data.
var PlaceholderData = ChartData.GenerateRandomDataForLinechart();
// Create and add series
var Series_1 = new Series("Your random chart with placeholder data");
Series_1.Bind(PlaceholderData, "X-Axis", "Y-Axis");
line_chart.AddSeries(Series_1);
// Create a new XmlDocument object and clone the actual chart XML
XmlDocument XMLWithNewTags = new XmlDocument();
XMLWithNewTags.LoadXml(line_chart.Xml.ToString());
I've used XPath Visualizer Tool to determine the XPath query, which is important to know because you can't just add the marker tag to somewhere and expect it to work. Why do I tell this? Because I appended marker tag on a random line and expected it to work. Naive.
// Set a namespace manager with the proper XPath location and alias
XmlNamespaceManager NSMngr = new XmlNamespaceManager(XMLWithNewTags.NameTable);
string XPathQuery = "/c:chartSpace/c:chart/c:plotArea/c:lineChart/c:ser";
string xmlns = "http://schemas.openxmlformats.org/drawingml/2006/chart";
NSMngr.AddNamespace("c", xmlns);
XmlNode NewNode = XMLWithNewTags.SelectSingleNode(XPathQuery, NSMngr);
Now create necessary tags on newly created XML Document object with specified namespace
XmlElement Symbol = XMLWithNewTags.CreateElement("c", "symbol", xmlns);
Symbol.SetAttribute("val", "none");
XmlElement Marker = XMLWithNewTags.CreateElement("c", "marker", xmlns);
Marker.AppendChild(Symbol);
NewNode.AppendChild(Marker);
And we should copy the contents of latest changes to actual XML object. But oops, understandably it is defined as private so it is a read-only object. This is where I thought like "Okay, I've fiddled enough with this. I better find another library" but then decided to go on because reasons.
Downloaded DocX repo, changed this line to
get; set;
recompiled, copied Xceed.Words.NET.dll to both projectfolder/packages and projectfolder/projectname/bin/Debug folder and finally last a few lines were
// Copy the contents of latest changes to actual XML object
line_chart.Xml = XDocument.Parse(XMLWithNewTags.InnerXml);
// Insert chart into document
document.InsertChart(line_chart);
// Save this document to disk.
document.Save();
Is it worth it? I'm not sure but I have learned a few things while working on it. There're probably lots of bad programming practises in this answer so please tell me if you see one. Sorry for meh English.

Search string formatting in Elouqa API

I'm using the Elouqa Rest API in an integration with another product and I want to implement a file browser. As part of this I want to get a list of the folders inside another folder. Theapi documents here say that a search string can be appended but don't give any clues as to the format of the search string. I've tried various things but so far I'm just getting empty results. An example is here:
/API/rest/1.0/assets/email/folders?search=folderId+%3D+250
I've tried with and without +'s and with and without url encoding the = sign, also various combinations of quote marks but so far nothing.
I believe what you want is a slightly different endpoint e.g.:
/API/rest/1.0/assets/email/folder/250/contents
Which would provide a list of folders contained with folder 250
If you wanted to search for a given folder name then you would use
/API/rest/1.0/assets/email/folders?search=foldername
Hope that helps!

Is there a simple way to extract content from a webpage?

Our build software generates a webpage when the build fails, and lists the users who've committed since the last build. I'd like to have a way to parse the page for members of my team. For example:
Commit
18e1bc67b7e3123987daf8c219a4fbe2003de4
by bob.dole</b><pre>1112233- Description on header is not carried forward to BD doc after PCPROJBILL is ran<br></pre></div></td></tr><tr><td width="16"><img title="The file was modified" height="16" alt="The file was modified" width="16" src="/static/fbfd5d7f/images/16x16/document_edit.png" /></td><td><a>pcbatch/projbill.cpp</a></td></tr><tr class="pane"><td colspan="2" class="changeset"><a name="detail54"></a><div class="changeset-message"><b>
So the script would take a URL as input and search the file for 'bob.dole' and output to a file all of the details associated with him (commit hash, pre-data, etc.)
Could someone give me an idea of what would be the easiest way to accomplish this? I was thinking of using perl, but I'm not sure if there's something more straightforward.
If I got you question correctly, you want to get the webpage content and parse it to find the user name. If it is the case, I would use php
Use get_file_content("your_website"), this will return a string to you to parse.
Then you can use strpos() to find indeces of substrings. This will later help you to extract the user name by using substr() function.
Hope it helps.
The Perl module you are looking for that helps you search based on nodes is Mojo::DOM.