Crawl Image using Apache Nutch - mongodb

I installed Apache Nutch 2.3.1 and Solr 6.5.1 and MongoDB 3.4.7.
After I crawl urls that contain many images, in Solr and mongoDB isn't any image and video.
I also changed regex-urlfilter.txt file in apache nutch and delete postfix that were related to image(.png,.jpeg,.gift,...).
After that I changed suffix-urlfilter.txt file and comment jpeg,gif,png too.
After do that works the Apache Nutch doesn't crawl image.
Now I want to know how I can crawl image and see that in Solr?
As I read about it, I understand that I should create plug-ins.Is my impression correct?

Nutch supports several formats: Plain Text, HTML/XHTML+XML, XML, MS Office files, Adobe PDF, RSS, RTF, MP3. Unfortunately, there is not support for any sort of image files. Apart from this, I'm curious, what do you want to index in image file?

If I understand your question what you want to accomplish is extracting all the metadata from the images and indexing only this in Solr, right?
If Nutch is not even fetching your images then is more likely that some of the URL filters is excluding the URL from being fetched (check the logs). You need to describe your changes to the different files otherwise it will be impossible to help you.
Now, back to the original question, if you want to only index image URLs (along with the metadata) then you need to filter what you index into Solr. Unfortunately Nutch 2.3 doesn't offer (out of the box) this feature. In Nutch 1.x you could use mimetype-filter which allows you to specify what you want to index into Solr/ES depending on the mime type of the URL. My suggestion is to use Nutch 1.x unless you have a very good reason to use Nutch 2.x. Otherwise you could port the mimetype-filter plugin to 2.x or write your own IndexingFiler that supports your own logic.
Keep in mind that the information that you'll get in Solr is only limited to what tika can extract from the image file (metadata) which is usually not very well curated.

Related

Is it possible to index custom documents in Faceted Search (ke_search) ?

My requirement is to index content of documents uploaded by user e.g PDF / DOC etc. Is it possible to index content of these documents by building custom indexers for the Faceted Search extension ke_search? If so, then can anyone provide any guide as to how would I create such indexers?
I am new to Typo3 so any help would be appreciated.
Hello and welcome to TYPO3 - the almost everything is possible CMS :-)
For your own indexeres you will need an own small extension and then register your indexers to the ke_search. Since the customers indexeres are written in PHP, you can use the whole power of PHP. So you just need some PHP libs on your server that are able to read the content of PDF and DOC files and then store the result to the TYPO3 database.
Check the docs:
https://www.typo3-macher.de/facettierte-suche-ke-search/dokumentation/ein-eigener-indexer/
Example configuration with an indexer:
https://github.com/teaminmedias-pluswerk/ke_search_hooks
EDIT: You should also check the build in indexers of ke_search. I guess there is already an indexer for PDF / DOC files included.

Run Tika source code from Eclipse

I have been using Apache Tika for extracting text from different document formats. Now i want to make it handle header, footer and text boxes differently. So i downloaded source code of Tika from GitHub and trying to make changes in it.
I want to run Apache Tika source code from Eclipse and debug its execution by passing an input document. How can i do that? There are so many main classes. Where do i start? I understand its a Maven project and i am new to it.
And once i make changes how can i create new jar file?
Take a look at Tika's xhtml output first, maybe it extracts headers/footers and you can use parser API to handle these parts as you wish. If it's that way, use API as examples say passing custom SAX-like handler to it.

How do I Server My WebApp files with Couchdb

I would like to develop an application with CouchDB, I believe that is possible to use ONLY CouchCB to server html, css, fonts, icons, js, etc. files as well as to store the data and handle them.
The problems I am facing is:
How to serve my files using CouchDB (without having to use any middleware like nodejs), what I found is that I can upload them as attachements to a _design document, but I find it not a practical way to do so for every single file
You are looking for couchapps. There are tools that take care of the uploading part for you like erica and couchapp.
Couchapp documentation is in the wiki part of the repo. Here is the file structure to design doc mapping guide.
For erica everything is in the readme.

How to encode files using ID3v2.3?

I am trying to get files ready for Http Live Streaming. I am then trying to embed ID3 metadata into the ts streams so a jwplayer plugin can read the timed metadata.
However, I am pretty sure the plugin only reads id3v2.3 and not id3v2.4 which id3taggenerator creates.
Does anyone know how I can either convert the tags to the older version, or create ones from that and then insert them into the files?
Thanks.

silverstripe upload tiff and convert to jpeg

i´m working on a image database build with silverstripe framework. It needs to be able to upload large tiff files and they should be converted to jpeg. is there a build in function to do that, or do I have to use a library for that?
Regards,
Florian
SilverStripe doesn't have built-in image conversion utilities, just a GD class which mainly handles resizing of existing images (in gif/jpg/png).
ImageMagick supports conversion of TIFFs (see supported formats). I'm sure you can find PHP wrapper libraries for it (doesn't have to be specific to SilverStripe), or use the commandline tool directly via exec().
Other than that, we have the UploadField class to handle uploads. It uses the jQuery fileupload plugin, which supports larger files (although server timeouts and PHP config play a role here as well). You might want to look into chunked uploads.