Apache FOP generated PDFs printing strangely from Adobe Reader; OK on screen - perl

I'm not sure how this can be a FOP issue, but I've never seen it with PDFs from any other source, so I've tried to investigate further.
Our application creates PDFs via xsl-fo, using FOP. This has worked great for a couple of years -- occasionally a user will have trouble printing a specific document, and see a very particular type of corruption, wherein most characters are "incremented". That is to say, 1 becomes 2, M becomes N, period becomes a slash, and the word invoice becomes the mildly amusing "jowpjdf". The document displays fine (typically in Adobe Reader). We've generally worked around it, but now an even odder case presents itself.
A new addition to our application creates 2 substantially similar PDFs created with FOP, then concatenates them using Perl's PDF::Reuse to grab the files from the filesystem and create a new document, which is then sent to the user by email. User opens document fine in Reader, hits print, and something new happens... Page 1 prints perfectly, but page 2 is corrupt in exactly the manner described above.
If it was a consistent print driver issue, I'd expect to see both pages corrupted. If it was a FOP issue, likewise. If it was a PDF::Reuse issue, I'd expect to see more fundamental breakage, and this breakage is not new since we started concatenating documents. I'm at a loss where to investigate next.
Has anyone seen similar corruption in PDFs, especially when generating using Apache FOP?
tl;dr PDFs created using FOP sometimes print with every character shifted by 1, e.g. A->B, 3->4

Related

Can I have VS Code skip opening previous workspaces one time only?

I use many VS Code workspaces throughout the day. Most of them are backed by directories on NFS-mounted drives, which are only mounted while I'm VPN'd in to my employer's network. Opening VS Code while not VPN'd in will cause all of my windows to close, leaving me with blank/empty workspaces, and then I have to set them all back up again in the morning. It only takes a few minutes to do, but I'm lazy and it's not neat; I like things neat. I know that I can start VS Code without any workspaces using the -n option, which is great, but then the next time I start up the editor for real (i.e. for work purposes), all of my workspaces need to be reopened again (see previous statement re: I'm lazy and I like things neat).
Is there a way to indicate that I want to start VS Code without any project just this one time, and then the next time I start I want all of my old workspaces to reopen as normal? Alternately, does anyone know where the state information is stored and how to edit it? I have no qualms about saving it off and then restoring it after I'm done.
Absent any miracle solution, I've at least found the correct file to manipulate: the storage.json file, which on MacOS is found at:
~/Library/Application Support/Code/storage.json
I wrote a Perl script to do the manipulation. When I want to go "offline" it reads in the JSON file, loops through the opened windows, identifies the ones I don't want, and removes them using jq, then launches VS Code. When I'm ready to go back "online", I read a backup of the original file looking for the windows I previously removed, adds them back in (also using jq), and then launches VS Code.
The Perl script is a bit rough around the edges to be posted publicly, but people might find the jq helpful. To delete, you want to identify the windows to be removed as (zero-based) indexes in the array, and then delete them with the following:
jq '. | del(.windowsState.openedWindows[1,2,5])' '/Users/me/backups/online-storage.json' >'/Users/me/Library/Application Support/Code/storage.json'
If you want to add them back in at some point, you extract the full JSON bits from the backup file, and then use the following command to append them to the back of the array:
jq '.windowsState.openedWindows += [{"backupPath":"...",...,"workspaceIdentifier": {...}}, {"backupPath":"...",...,"workspaceIdentifier": {...}}, {"backupPath":"...",...,"workspaceIdentifier": {...}}]' '/Users/me/backups/offline-storage.json' >'/Users/me/Library/Application Support/Code/storage.json'
The inserted JSON is elided for clarity; you'll want to include the full JSON strings, of course. I don't know what significance the ordering has, so pulling them out of the middle of the array and appending them to the end of the array will likely have some consequence; it's not significant for my purposes, but YMMV.

VSCode: activeTextEditor encoding

Is there any way to get current document encoding (that is in the bottom bar) in my extension code?
Something like vscode.window.activeTextEditor.encoding
This does not appear to be possible.
Since it's nearly impossible to prove a negative, the rest of this answer documents what I explored.
The string "encoding" does not appear (in this sense) anywhere in the API docs nor in the index.d.ts file it is derived from. (With VSCode 1.37.1, current as of writing.)
I dug into the vscode sources to see if there might be a clever solution, but came up empty. The code that executes when the encoding is changed by the user is in editorStatus.ts, class ChangeEncodingAction. This makes its way to textFileEditorModel.ts, function updatePreferredEncoding, which sets preferredEncoding. That field controls what happens when the file is saved, and is used to populate the status indicator, but doesn't go anywhere else I can find.
Reading the status indicator itself does not appear possible since the API allows extensions to create new indicators with window.createStatusBarItem but not enumerate existing ones. And directly accessing the DOM is not possible.
I also came up empty searching through VSCode issues related to encoding, both open and closed, but only skimmed the most recent ~100 closed issue titles.
Alternatives
My main suggestion at this point would be to file an enhancement request on the VSCode github.
It should also be possible to do something with reflection but of course it would be fragile.
Finally, the encoding controls how the document in memory (a sequence of characters) maps to a file on disk (a sequence of bytes). Depending on what you're trying to do, it might work to speculatively encode the document in several encodings and compare each to what is on disk (so long as the file is not dirty).

Dataprep import dataset does not detect headers in first row automatically

I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.
However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...
What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:
While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.
Another case I saw this is when there were more columns of data than there were headers.
As you already hit on, you can use the following snippet to do mostly the same thing:
rename type: header method: filter sanitize: true
. . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.
More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.
When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the fileā€”so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.
It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)
Best of luck!
Can be resolved as a transformation within a Flow:
rename type: header method: filter sanitize: true

tSendMail - New Line Trouble

I am trying to create an email with the some job status information, which I wish to put across multiple lines. However, whatever I do, I get the output in one line. Have changed the MIME type to HTML, used "\n", "\r", "\r\n", String Objects newline. Nothing seems to work.
Although I noticed that these characters do get processed, even though the outcome isn't as expected. I don't see them in the email body, which suggests that the text processor accepts them. Just doesn't process them they way it should. Do I see a bug in the component?
I am on Talend Open Studio 7.0.1, on Ubutntu 16.04.4 VM, on Windows 10 system (if that helps).
HTML < BR > works.
I tried it earlier but looks like I didn't structure my html tags well so it failed. Did it from start and got it right.
Guess what - The more you try, the more you learn. :)

Prevent Word 2010 from saving o:gfxdata base64 or uuencoded VML?

I am working with .docx files containing several drawing canvases with images inserted and some lines and arrows drawn in Word 2010. I am using 2010 format with no compatibility mode.
Word inserts an o:gfxdata attribute into each v:shape and v:group element and fills it with ascii encoded something. From what I have read it may be a copy of the VML describing the v:shape or v:group. I don't know if I just don't know what to look for, but I cannot determine what this data is for as its removal has no apparent effect on my ability to read or edit the document in Word 2003, 2007, or 2010.
It does swell the document.xml to almost twice the (apparent) necessary size. This considerably slows OpenTBS' processing so I would like to remove it, if possible. Does anyone know of a way to tell Word 2010 to quit saving this extra data? Or what it is for? I have really struggled to find any documentation on it beyond this post.
Edit:
Here is a sample .docx. The document.xml is ~141KB and OpenTBS takes an average of 10.35 seconds to create a file that includes this as a subtemplate 21 times. If I remove all of the o:ogfxdata attributes, the file size is reduced to ~37KB and OpenTBS takes only 2.99 seconds to produce the same file.
Edit 2:
After further investigation, it appears the removal of the o:gfxdata may cause Word 2003 with an older Compatibilty Pack installed, to object to the file with the following error:
"This is a pre-release version of the Compatibility Pack and can open
pre-release Office 2007 files only. Do you want to check for a newer
version of the Compatibility Pack?"
I have been able to open the file by installing a newer compatibility pack - though it prompts the user about the incompatibility and converts the file in order to open it. This does not damage my file, but it is something to look out for.
Attribute o:ogfxdata is poorly documented in the web.
According to your investigations, it's some kind of compatibility extra information.
You can delete those attributes in your template using OpenTBS.
The cleaning can be done once on your template without any merging, and then save the cleaned template as a new template. Or you can perform the cleaning each time you open the template.
Cleaning the DOCX file:
while ($x = clsTbsXmlLoc::FindStartTagHavingAtt($TBS->Source, 'o:gfxdata', 0) ) {
$x->ReplaceAtt('o:gfxdata', '');
$TBS->Source = str_replace(' o:gfxdata=""', '', $TBS->Source);
}
Note that the class clsTbsXmlLoc is provided with OpenTBS and is undocumented.
The code should work since OpenTBS 1.8.0. (which is currently in stable beta version).
I've noticed that since attributes o:gfxdata are deleted, they do not come back immediately when you edit the docx.