I am having a strange problem while generating the PDF using apache FOP. When I am generating it on a Windows machine, it generates properly.
But when I try to generate this on a Linux machine, some overlapping of the text is observed in the generated PDF. I am using same set of XSLT and input XML files on both the machines. Please refer to the attached snapshot that shows the difference in the generated PDF(s).
Is there any difference FOP performs the transformation when run on different platforms ?
Please provide help here.
Thanks and Regards,
Maneesh Sharma
Related
I am trying to perform OCR with tesseract. I can do pdf to text using tesseract java lib as expected. My requirements is extended a bit now. I need to extract metadata based on template form (kind of passport example where we have fix place for first name, date of birth etc). Input could be either pdf or image with same template form.
I am facing hard time to find any such example or article to achieve or to get further help above using tesseract.
So my basic questions :
Is this possible using tesseract?
Is there any example/articles about how to achieve this using tesseract?
Is there any other software/library which is recommended to achieve this?
Thanks for reading this.
I have been using Apache Tika for extracting text from different document formats. Now i want to make it handle header, footer and text boxes differently. So i downloaded source code of Tika from GitHub and trying to make changes in it.
I want to run Apache Tika source code from Eclipse and debug its execution by passing an input document. How can i do that? There are so many main classes. Where do i start? I understand its a Maven project and i am new to it.
And once i make changes how can i create new jar file?
Take a look at Tika's xhtml output first, maybe it extracts headers/footers and you can use parser API to handle these parts as you wish. If it's that way, use API as examples say passing custom SAX-like handler to it.
Not sure this is the right stackexchange site but seems to be the place with the most question about Alfresco I can find so here goes.
Have Alfresco Community Edition 4.2.d installed on a RHEL5 64bit box (mainly default install bar using MySQL as a database locally). Uploading PDFs to the documentLibrary is fine and thumbnail previews and flash previews are generating. If the PDF has been processed by ABBYY OCR (which we have running on a separate server and is used to OCR scanned PDFs) then the flash preview generates fine but the thumbnail is incredibly dark and looks as if it has been attacked by a can of spray paint.
I initially thought it could be a ghostscript issue but have updated that to 9.14 and still getting this issue. I have also tried playing around with ImageMagik but I can't get a nice clear thumbnail to generate. I am guessing it is a switch in the convert command that Alfresco is using but I am struggling to work out a combination of switches that will work and then where Alfresco would store these parameters. Or indeed what switches are currently being used.
I was wondering if anyone had seen this behaviour before with ImageMagik previews in Alfresco 4.2.d? It seems something unique to PDFs that have been through the OCR process so I am guessing I will need to create a separate transformation for them at a later stage.
EDIT: So it was suggested that a later version of ImageMagick and GS should resolve it. I have therefore installed GS 9.14 and IM 6.8.9-0 (both compiled form source). Running the following from a command line:
convert /root/test1.pdf[0] /root/test1.png
results in a crystal clear image thumbnail preview. Thinking I was on to a winner I have amended the following lines in alfresco-global.properties to point to the system location of GS and IM:
img.root=/usr
img.dyn=${img.root}/lib
img.exe=${img.root}/bin/convert
img.gslib = /usr/local/share/ghostscript/9.14/lib/
and alfresco loads. However the thumbnail preview generated by Alfresco using the new version of IM and GS does not result in nice clean previews.
I am guessing that Alfresco is passing some command line switch during the conversion that is undoing the good work of the later versions of these programs. Does anyone know where the switches for thumbnail creation might be stored in Alfresco?
I guess it's related to transparency and default background black. I didn't find an easy way to add the required parameters to the script except to register a new transformer supporting more parameters like:
-fill white -opaque none
I need to extract text from DOC, DOCX and PDF files.
I've downloaded two Windows Forms Application demo projects, one here:
http://www.codeproject.com/Articles/31944/Implementing-a-TextReader-to-extract-various-files
the other here:
http://www.codeproject.com/Articles/13391/Using-IFilter-in-C
Both are working just fine within the windows Form Application. But both do not work within my ASP.NET MVC2 application, it gets saying it can't find a filter or throws a method not implemented exception.
Could this be a security error? Like ASP.NET cannot reach the various IFilter's installed on the machine.
DOC files are working by the way.
All is x64 and the files are first saved on disk and have the proper permissions set.
Any help is appreciated.
edit:
PDF error is: The method or operation is not implemented.
DOCX error is: No filter defined
Windows Forms Application both are working. I'm considering getting the output from a Windows Forms/Console application.
Hopefully, Chris Noe, is in the house...
Selblocks is an extension for Selenium IDE that provides control-flow constructs such as if/then/else, looping and subroutines.
I'm trying to give interation over an XML file a whirl and am running into an error. It seems it can't find the XML file. The XML file is co-located with my Sel scripts. Please see the screenshot attached.
Is there a source for more documentation or examples? Like the sample test suite you have a picture of on the extension page?
Thanks,
Cameron
http://cl.ly/AzzT
I ran into the same issue, and the problem turned out to be that my XML was invalid. In my case it was because one of the parameters I was using was a url containing ampersands. Changing & to & fixed the problem for me, and the variables loaded perfectly.