iText PdfSmartCopy is creating duplicate fonts

iText PdfSmartCopy is creating duplicate fonts - itext

I am using iText (5.5.12) PdfSmartCopy to merge together two files that have embedded, unsubsetted fonts (and happen to be generated on the same machine, so I know they are referring to the same font) in the hope that the final result will have only a single copy of the font.
However I am finding that the merged result has the font embedded twice.
Here is the code I am using:
String[] srcs = ...
Document document = new Document();
PdfCopy copy = new PdfSmartCopy(document, new FileOutputStream(result));
document.open();
for (int i = 0; i < srcs.length; i++) {
PdfReader reader = new PdfReader(srcs[i]);
copy.addDocument(reader);
copy.freeReader(reader);
reader.close();
}
document.close();
This is the output of pdffonts on the relavant files:
Input file 1:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TimesNewRomanPSMT CID TrueType Identity-H yes no yes 14 0
Input file 2:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TimesNewRomanPSMT CID TrueType Identity-H yes no yes 11 0
Output file:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TimesNewRomanPSMT CID TrueType Identity-H yes no yes 3 0
TimesNewRomanPSMT CID TrueType Identity-H yes no yes 25 0

In contrast to your assumption to have
two files that have embedded, unsubsetted fonts
the fonts are subsetted, and differently so.
From file1.pdf:
From file2.pdf:
As you can see there are numerous differences, there is a non-empty glyph for "1" in file 1 but not in file 2, vice versa for "2", etc...
Thus, these fonts are not identical and PdfSmartCopy correctly did not replace one by the other.
I assume that pdffonts did not recognize them as subsetted because they are not properly marked as subset fonts, in particular their names don't have the required subset tags and they don't have the optional CharSet listing of the character names defined in a font subset. Thus, the fonts not merely are not unsubsetted, the subsetting also was done incorrectly.
Thus, don't blame pdffonts for your incorrect assumptions but instead the PDF generator which created the input files.

Related

read data from text file and display in cols and rows

I'm new to this forum. I need help. I have a text file with Node Names and software names in one long column of a text file.
I'd like to know how to read in each Node Name and its software name. Each node name will be a col header, under each node name it will be the software names....the code will loop thru and read all the software name until it reaches the next node name and starts a new col header and repeat again until EOF.
Any idea on how to do that?
My text file has something like this:
Node Name: ServerName #1
software name: abc
software name: def
software name: ghi
software name: etc...
on...
and on...
and on...
Node Name: ServerName #2
software name: xyz
and on...
and and on...
plus more...
etc...
next Node Name: ServerName #3
etc...
Expected Output:
Node Name: ServerName #1-------Node Name: ServerName #2------Node Name: ServerName #3
software name: abc-------------------software name: xyz-------------------etc...
software name: def-------------------and on...
software name: ghi-------------------and and on...
software name: etc.------------------plus more...
on...--------------------------------------etc...
and on...
and on...
Hope this sample data helps to clarify my explanation above.
Thanks in advance,

For VBA:
To get text from a text file you can use Open. If your text file has each entry on a new line, you can read from the text file directly into an array, line by line.
Once you have your array, you can search for the headers and pass the values over from this one dimension array into a two dimension array.
Since I don't know what your file looks like, this is just a suggestion of how it might be done.
Example of how to use Open:
Function ReadFrom(ByVal FilePath As String) As String
Dim TextFile As Integer, TextBody As String, Data As String
TextFile = FreeFile 'VBA uses a number as a reference for files, FreeFile returns the next available number
Open FilePath For Input Access Read As TextFile
Do Until EOF(TextFile)
Line Input #TextFile, Data
TextBody = TextBody & Data
Loop
Close #TextFile
ReadFrom = TextBody
End Function
Example with the text array included:
Function ReadFrom(ByVal FilePath As String, ByVal AsArray As Boolean) As Variant
Dim TextFile As Integer, TextBody As String, Data As String, TextArray() As String, LineCount As Long
ReDim TextArray(1 To 100000) As String
TextFile = FreeFile 'VBA uses a number as a reference for files, FreeFile returns the next available number
Open FilePath For Input Access Read As TextFile
Do Until EOF(TextFile)
Line Input #TextFile, Data
If AsArray Then
LineCount = LineCount + 1
TextArray(LineCount) = Data
Else
TextBody = TextBody & Data
End If
Loop
Close #TextFile
If AsArray Then
Redim Preserve TextArray(1 to LineCount)
ReadFrom = TextArray
Else
ReadFrom = TextBody
End If
End Function

Showing D Coverage Results as Overlays in Source Buffer

The D language compiler DMD outputs its coverage analysis in a file containing the original source as
| inout(Ix)[] prefix() inout
| {
2037| assert(!keys.empty);
2037| final switch (keys.length)
| {
000000000| case 1:
000000000| return keys.at!0[];
2037| case 2:
| import std.algorithm.searching : commonPrefix;
2037| return commonPrefix(keys.at!0[], keys.at!1[]);
| }
| }
that is, the original source where each line has been prefixed by a 10-character column containing the execution count (if relevant).
When opened in Emacs I would like this file to be presented as a read-only version of the original source buffer with an green overlay for the lines exercised at least once and with red overlay for the lines never exercised.
How is this most conveniently implemented in Emacs-Lisp? For instance is there a way to efficiently hide the first 10 characters of each line in a buffer?
See also: https://github.com/flycheck/flycheck/issues/1074

Manipulate UserDefined tag (TXXX frame) with Taglib-Sharp

Situation & Task
I have a large music collection and I want to clean their ID3V2 tags with PowerShell and taglip-sharp. Some tags like comment or encoding should be deleted while others like artist or title should not.
Usually you manipulate ID3 tags this way (Simplified version)
# Add taglib dll
[Void][System.Reflection.Assembly]::LoadFrom("$PSScriptRoot\taglib-sharp.dll")
# Load example mp3 into memory as [taglib.file]
$media = [TagLib.File]::Create("C:\path\to\musicFile.mp3")
# Change comment tag
$media.tag.tags[0].Comment = "Hello World"
# Save tags back to mp3 file
$Media.Save()
Problem
Many music files store custom information like URL or Shop Name in a frame called TXXX. Unfortunately, this frame is not accessible with the method shown above. Or I haven't found a way yet.
Instead you use
# Read UserTextInformationFrame
$media.GetTag([TagLib.TagTypes]::Id3v2).GetFrames("TXXX")
This User defined text information frame can hold multiple values. And some are useful since music players like Foobar store PERFORMER, DATE or replay_track_gain tags in TXXX.
Example output for the line above could be:
Description : replaygain_track_gain
Text : {-5.00 dB}
FieldList : {-5.00 dB}
TextEncoding : Latin1
FrameId : {84, 88, 88, 88}
Size : 32
Flags : None
GroupId : -1
EncryptionId : -1
Description : URL
Text : {www.amazon.com}
FieldList : {www.amazon.com}
TextEncoding : UTF16
FrameId : {84, 88, 88, 88}
Size : 43
Flags : None
GroupId : -1
EncryptionId : -1
After this, I was able to filter out all unnecessary TXXX values
# Create a whitelist of TXXX frames
$goodTXXX = 'performer','replaygain_track_gain','date'
# Read UserTextInformationFrame AND filter it
$newTXXX = $Media.GetTag([TagLib.TagTypes]::Id3v2).GetFrames("TXXX") |
where { $goodTXXX -contains $_.Description }
Question: How to write multiple values to TXXX frame
So my question is, how do I save my filtered results back to mp3 file?
My failed attempts were:
$media.GetTag([TagLib.TagTypes]::Id3v2).RemoveFrames("TXXX")
$media.GetTag([TagLib.TagTypes]::Id3v2).SetTextFrame("TXXX",$newTXXX)
# Removes old values, but does not show anything in Foobar
#$media.GetTag([TagLib.TagTypes]::Id3v2).GetFrames("TXXX").SetText("Hello World")
# Shows garbage in Foobar. And it's not usable for multiple values
Taglib-Sharp documentation for SetTextFrame
Bonus question: Is taglib-sharp able to strip out Id3v1 and ID3v2.4 tags while saving new tags as ID3v2.3 tags? (Related SO answer, but doesn't distinguish between v2.3 and v2.4)

I found a way with try & error. It's not elegant since you have to remove all TXXX values and add them back if you just want to change a single one
# Add taglib dll
[Void][System.Reflection.Assembly]::LoadFrom("$PSScriptRoot\taglib-sharp.dll")
# Load example mp3 into memory as [taglib.file]
$media = [TagLib.File]::Create("C:\path\to\musicFile.mp3")
# Get or create the ID3v2 tag.
[TagLib.Id3v2.Tag]$id3v2tag = $media.GetTag([TagLib.TagTypes]::Id3v2, 1)
# Create new 'TXXX frame' object
$TXXXFrame = [TagLib.Id3v2.UserTextInformationFrame]("WWW")
# Delete complete TXXX frame first, or else all values are just appended
$id3v2tag.RemoveFrames("TXXX")
# Set the value/text in the newly created TXXX frame, default Text encoding is UTF8
# Use curly brackets instead of single quotation marks
$TXXXFrame.Text = {www.myurl.com}
# Add TXXX frame to tag
$id3v2tag.AddFrame($TXXXFrame)
# Write all changed tags back to file
$media.Save()

itextsharp text extraction fails for some pdfs

I have couple of PDF files whose text I am not able to extract from. These PDFs file were created by converting Word files to PDFs.
The main purpose I am extracting text from pdf is to index its text and make it searchable.
PdfReader reader = new PdfReader(inFileName);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
// where strPDFText is string builder
strPDFText.Append(iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, page) + " ");
}
string str = strPDFText.ToString();
I get an empty string. What could be the reason for the same. I am using Itextsharp 5.5

While the sample PDF provided by the OP indeed indicates that it is a MS Word export, it simply does not contain any text, only an image (which incidentally shows text).
The content of the PDF is this:
/P <</MCID 0>> BDC BT
/F1 11.04 Tf
1 0 0 1 540.1 500.95 Tm
/GS7 gs
0 g
0 G
[( )] TJ
ET
EMC /P <</MCID 1>> BDC q
0.000000071 488.88 612 231.12 re
W* n
468 0 0 219.05 72 500.95 cm
/Image8 Do Q
EMC
As you see the only actual text displayed is a single space ([( )] TJ), and the only remaining content is a bitmap image (/Image8 Do).
Thus,
I get an empty string. What could be the reason for the same.
The reason is that there is no text in your document.

PowerBuilder 12 how to determine encoding of input file

I'm new to PowerBuilder 12, and would like to know is there any way to determine the encoding (e.g. Unicode, BIG5) of an input file. Any comments and code samples are appreciated! Thanks!

From the PB 12.5 help file :
FileEncoding ( filename )
filename : The name of the file you want to test for encoding type
Return Values
EncodingANSI!
EncodingUTF8!
EncodingUTF16LE!
EncodingUTF16BE!
If filename does not exist, returns null.

Finding Unicode is pretty easy, if you assume the Unicode file has a BOM prefix (although reality is that not all Unicode files do have this). Some code to do this is below. However, I have no idea about Big5; it looks to me (at first glance at the spec, never had occasion to use it) like it doesn't have a similar prefix.
Good luck,
Terry
function of_filetype (string as_filename) returns encoding
integer li_NullCount, li_NonNullCount, li_OffsetTest
long ll_File
encoding le_Return
blob lblb_UTF16BE, lblb_UTF16LE, lblb_UTF8, lblb_Test, lblb_BOMTest, lblb_Null
lblb_UTF16BE = Blob ("~hFE~hFF", EncodingANSI!)
lblb_UTF16LE = Blob ("~hFF~hFE", EncodingANSI!)
lblb_UTF8 = Blob ("~hEF~hBB~hBF", EncodingANSI!)
lblb_Null = blobmid (blob ("~h01", encodingutf16le!), 2, 1)
SetNull (le_Return)
// Get a set of bytes to test
ll_File = FileOpen (as_FileName, StreamMode!, Read!, Shared!)
FileRead (ll_File, lblb_Test)
FileClose (ll_File)
// test for BOMs: UTF-16BE (FF FE), UTF-16LE (FF FE), UTF-8 (EF BB BF)
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF16BE))
IF lblb_BOMTest = lblb_UTF16BE THEN RETURN EncodingUTF16BE!
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF16LE))
IF lblb_BOMTest = lblb_UTF16LE THEN RETURN EncodingUTF16LE!
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF8))
IF lblb_BOMTest = lblb_UTF8 THEN RETURN EncodingUTF8!
//I've removed a hack from here that I wouldn't encourage others to use, basically checking for
//0x00 in places I'd "expect" them to be if it was a Unicode file, but that makes assumptions
RETURN le_Return

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

iText PdfSmartCopy is creating duplicate fonts - itext

Related

read data from text file and display in cols and rows

Showing D Coverage Results as Overlays in Source Buffer

Manipulate UserDefined tag (TXXX frame) with Taglib-Sharp

itextsharp text extraction fails for some pdfs

PowerBuilder 12 how to determine encoding of input file

Categories

Resources