How to make uploaded PDF text searchable in Apache Sling

How to make uploaded PDF text searchable in Apache Sling - aem

I am exploring Apache Sling 11 to build a web application which is more content driven. I have a page where files(pdf/txt/doc) can be uploaded to path /content/company/uploads as nt-file. In search module I am using JCR query to do search for particular text and I wanted the text inside PDF/TXT files to be searchable, right now the search is able to pickup texts in txt file but not pdf files. The pdf file that I used for testing is just full of text.
I have configured tika in oak:index/lucene and did run a re-index but no change in query result.
Apache Sling version - 11
Backend - Mongo DB(oak-mongo)
Query that is used
SELECT * FROM [nt:base] WHERE ISDESCENDANTNODE('/content/company/uploads') AND lower([*]) LIKE 'test word'
Tika configuration screenshot below
I am just starting to learn sling, any help is highly appreciated, thanks.

Instead of using like I used CONTAINS(*, '%test word%') in query. But now the problem is the text inside txt files are not picked up.

Related

Why does DITA Open Toolkit PDF plugin rename image href attributes?

I'm sorry if this doesn't have enough information. I don't typically ask for help online like this.
I'm using DITA Open Toolkit 3.4 on Windows. I generated a plugin called "vcr2" using Jarno's (very excellent and helpful) PDF Plugin Generator and then made a handful of customizations. The plugin uses the pdf2 plugin as a base. When I try to use the vcr2 plugin, my images are not working. I've tracked the problem down to malformed image filenames in the image's href attribute.
For example:
In my source file (a DITA Task), the markup for one of my images looks like this:
<image href="MyRemindersChooseReminder.png"/> If I run a transform with the pdf2 plugin, the images work fine. In the merged stage1.xml file in the Temp folder, the XML for that same image looks like this:
<image class="- topic/image " href="df2d132af27436c59c5c8c4282e112d62bec8201.png" placement="inline" xtrc="image:1;10:66" xtrf="file:/V:/Vasont/Extract/t12340879-minimal/t12340879.xml"/>
It is processed into a file Topic.fo, and looks like this:
<fo:external-graphic  src="url('file:/V:/Vasont/Extract/t12340879-minimal/MyRemindersChooseReminder.png')"/>
Everything works fine and the image looks fine.
If I run the same file through my 'vcr2' plugin, which just calls the same pdf2 plugin with some overrides, all the images get broken:
stage1.xml
<image class="- topic/image " href="df2d132af27436c59c5c8c4282e112d62bec8201.png" placement="inline" xtrc="image:1;10:66" xtrf="file:/V:/Vasont/Extract/t12340879-minimal/t12340879.xml"/>
Topic.fo
<fo:external-graphic  src="url('file:/V:/Vasont/Extract/t12340879-minimal/df2d132af27436c59c5c8c4282e112d62bec8201.png')" />
As I track this down further, it appears that somewhere in the map-reader Ant task, this filename gets changed to that cryptic string of pseudo-hexadecimal. I think later on it's supposed to be changed back or resolved to a complete URI or something.
So, the two-part question is: Why does Open Toolkit change my filenames, and what's supposed to change them back?

DITA-OT's preprocess uses hashes for temporary filenames because it allows the code to not deal with directory structures. This enables preprocess to work in so-called "map-first" mode, where it first processes all DITA map resources and only then starts to process DITA topic and image resources.
The preprocess has a step called clean-preprocess that can rewrite the temporary file names to match source resource files names. However, this rewrite operation is disabled for PDF output because the original file names are not used for anything in that output type.

How to make a section optional when mapped to optional data in a Word OpenXml Part?

I'm using OpenXml SDK to generate word 2013 files. I'm running on a server (part of a server solution), so automation is not an option.
Basically I have an xml file that is output from a backend system. Here's a very simplified example:
<my:Data
xmlns:my="https://schemas.mycorp.com">
<my:Customer>
<my:Details>
<my:Name>Customer Template</my:Name>
</my:Details>
<my:Orders>
<my:Count>2</my:Count>
<my:OrderList>
<my:Order>
<my:Id>1</my:Id>
<my:Date>19/04/2017 10:16:04</my:Date>
</my:Order>
<my:Order>
<my:Id>2</my:Id>
<my:Date>20/04/2017 10:16:04</my:Date>
</my:Order>
</my:OrderList>
</my:Orders>
</my:Customer>
</my:Data>
Then I use Word's Xml Mapping pane to map this data to content control:
I simply duplicate the word file, and write new Xml data when generating new files.
This is working as expected. When I update the xml part, it reflects the data from my backend.
Thought, there's a case that does not works. If a customer has no order, the template content is kept in the document. The xml data is :
<my:Data
xmlns:my="https://schemas.mycorp.com">
<my:Customer>
<my:Details>
<my:Name>Some customer</my:Name>
</my:Details>
<my:Orders>
<my:Count>0</my:Count>
<my:OrderList>
</my:OrderList>
</my:Orders>
</my:Customer>
</my:Data>
(see the empty order list).
In Word, the xml pane reflects the correct data (meaning no Order node):
But as you can see, the template content is still here.
Basically, I'd like to hide the order list when there's no order (or at least an empty table).
How can I do that?
PS: If it can help, I uploaded the word and xml files, and a small PowerShell script that injects the data : repro.zip

Thanks for sharing your files so we can better help you.
I had a difficult time trying to solve your problem with your existing Word Content Controls, XML files and the PowerShell script that added the XML to the Word document. I found what seemed to be Microsoft's VSTO example solution to your problem, but I couldn't get this to work cleanly.
I was however able to write a simple C# console application that generates a Word file based on your XML data. The OpenXML code to generate the Word file was generated code from the Open XML Productivity Tool. I then added some logic to read your XML file and generate the second table rows dynamically depending on how many orders there are in the data. I have uploaded the code for you to use if you are interested in this solution. Note: The xml data file should be in c:\temp and the generated word files will be in c:\temp also.
Another added bonus to this solution is if you were to add all of the customer data into one XML file, the application will create separate word files in your temp directory like so:
customer_<name1>.docx
customer_<name2>.docx
customer_<name3>.docx
etc.
Here is the document generated from the first xml file
Here is the document generated from the second xml file with the empty row
Hope this helps.

Creating PDF documents and exporting download links from the Tableau server

Is it possible to create PDF documents (e.g. on a nightly schedule) with Tableau and have those documents exposed by a URL by the Tableau server?
This sort of approach is common in the Jasper Reports and BIRT world, so I was wondering if the same approach is possible with Tableau?
I couldn't see any documentation on the Tableau site for creating PDFs, other than print to PDF

With Tableau Server, you can access your published workbook in a pdf format with this URL:
http://nameofyourtableauserver/views/NameOfYourWorkbook/NameOfYourView.pdf
Simply, the url is the url of your view + you add ".pdf".
The pdf file will be generated dynamically when accessing the URL.
Another option is to program your own script with tabcmd.
You can have more info on tabcmd here: http://kb.tableausoftware.com/articles/knowledgebase/using-tabcmd

The same technique also works for PNG. You can control filters using ?field_name=value. You can even select multiple values like this ?field_name=value1,value2.
Parameters can be set the same way.
Personally I've had the best luck with discrete dimensions instead of continuous ones.

I use the Windows Task Scheduler with batch files and Tabcmd.
Programs needed:
Tabcmd
Windows TaskScheduler (All Programs- Accessories - system tools)
http://onlinehelp.tableausoftware.com/v8.1/server/en-us/tabcmd_overview.htm
(tabcmd, how it works?)
Batchfile (create a text file and then save with file extension .bat):
1- Locate tabcmd and login
2- use function tabcmd get "http:\..." and -f "C:...pdf" to save to file.
3- concatenate the filters you want to use to the end of your URL as shown in other answers(all filters on the view must be included(filled out))
4- Save Batch file
Windows Task Scheduler:
1- create a task that will execute the batch file
2- TEST

You can do this by typing
http://server/views/WorkbookName/SheetName.pdf?:format=pdf
Another option will be using javascript api like below..
function exportPDF() {
viz.showExportPDFDialog();
}

Regarding external-document in FOP

I am creating pdf file through the XML, XSL and FOP. I want PDF file contents to display external file contents such as word document.
I know for displaying image in PDF we use but what tag we should to display file contents other than pdf file type.
There's a FOP extension that claims to be able to do this:
jeremias-maerki.ch/development/fop/index.html
Also see xmlgraphics.apache.org/fop/1.0/extensions.html#external-document
When I used in this way
xmlns:fox="http://xmlgraphics.apache.org/fop/extensions"
content-type="pdf" src="C:\temp\reports\p2.pdf"/>
I am getting exception as
org.apache.fop.apps.FOPException: Error(Unknown location): No element mapping definition found for fox:external-document
Let me know the reason.
THanks in advance.

I'd say you're probably using an old Apache FOP version which doesn't have the fox:external-document extension, yet. Please upgrade to FOP 1.0 (or at least 0.95).

Change the namespace from:
http://xml.apache.org/fop/extensions
to
http://xmlgraphics.apache.org/fop/extensions

Do we have any Equivalent of Response.AppendHeader in windows application

I came around this technique of converting datatable to excel
http://www26.brinkster.com/mvark/dyna/downloadasexcel.html
Do we have any Equivalent of Response.AppendHeader in windows application in C#.
Regards
Hema

The trick in the code sample that you have mentioned to dynamically generate an Excel file is based on the fact that documents can be converted from Word/Excel to HTML (File->Save As) and vice versa. Essentially a HTML page containing Office XML is created & in a web application a file download is triggered with the help of the following Response.AppendHeader statements -
Response.AppendHeader("Content-Type", "application/vnd.ms-excel");
Response.AppendHeader("Content-disposition", "attachment; filename=my.xls");
If you want to use this technique in a Winforms application, just save the string content as a text file and give the file an extension of ".xls". Instead of the last 3 lines in the sample's Page_Load method, replace it with this line -
System.IO.File.WriteAllText(#"C:\Report.xls", strBody);
HTH

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to make uploaded PDF text searchable in Apache Sling - aem

Instead of using like I used CONTAINS(*, '%test word%') in query. But now the problem is the text inside txt files are not picked up.

Related

Why does DITA Open Toolkit PDF plugin rename image href attributes?

How to make a section optional when mapped to optional data in a Word OpenXml Part?

Creating PDF documents and exporting download links from the Tableau server

Regarding external-document in FOP

Do we have any Equivalent of Response.AppendHeader in windows application

Categories

Resources