I am trying to extract metadata for package component files using Tika at the command line, but I can only seem to get it to output metadata for the containing package file. Example: test_file.zip contains two files, test1.doc and test2.doc. I want to get the metadata for test1.doc and test2.doc, but cannot figure out how to do so.
I tried to run this:
java -jar tika-app-1.5.jar -m test_files.zip
but that just outputted the Content-Length, Content-Type, and resourceName for test_files.zip.
I also tried to run this:
java -jar tika-app-1.5.jar -h test_files.zip
That outputted the HTML for each component file, wrapped in a <div> with class ."package-entry", but the metadata tags were again outputted only for the containing package file test_files.zip. I tried using the -x parameter instead of -h, and no parameter at all, and got exactly the same result.
How do I get the metadata for the component files? I don't mind parsing the embedded metadata from xhtml but I cannot figure how to get it injected into the xhtml or otherwise outputted.
Any help much appreciated. Thank you.
Since you've said you want to do it with only the tika-app jar, your best option is something like
# Create a temp directory
cd /tmp
mkdir tika-extracted
cd tika-extracted
# Have Tika extract out all the embedded resources
java -jar tika-app-1.5.jar --extract $INPUT
# Process each one in turn
for e in *; do
java -jar tika-app-1.5.jar --metadata $e
done
# Tidy up
cd /tmp
rm -rf tika-extracted
Using Java, you'd be able to register your own EmbeddedDocumentExtractor on the ParserContext, and use that to trigger the metadata extraction for each one individually
Related
Background
I am attempting to use a command, called xmllint to parse an html file for a specific value inside a tag. All of the examples I have seen online use the --html option alongside the --xpath option in order to parse like in #nwellnhof's example:
xmllint --html --xpath '/html/body/h1[1]' - <<EOF
<BODY>
<H1>Dublin</H1>
EOF
However, my local version of xmllint does not contain the --xpath option. I would like to figure out which version of the command I am using so I can parse html properly.
Question
How do I find which version of a command that I am using in linux?
I have this wget command:
sudo wget --user-agent='some-agent' --referer=http://some-referrer.html -N -r -nH --cut-dirs=x --timeout=xxx --directory-prefix=/directory/for/downloaded/files -i list-of-files-to-download.txt
-N will check if there is actually a newer file to download.
-r will turn the recursive retrieving on.
-nH will disable the generation of host-prefixed directories.
--cut-dirs=X will avoid the generation of the host's subdirectories.
--timeout=xxx will, well, timeout :)
--directory-prefix will store files in the desired directorty.
This works nice, no problem.
Now, to the issue:
Let's say my files-to-download.txt has these kind of files:
http://website/directory1/picture-same-name.jpg
http://website/directory2/picture-same-name.jpg
http://website/directory3/picture-same-name.jpg
etc...
You can see the problem: on the second download, wget will see we already have a picture-same-name.jpg, so it won't download the second or any of the following ones with the same name. I cannot mirror the directory structure because I need all the downloaded files to be in the same directory. I can't use the -O option because it clashes with --N, and I need that. I've tried to use -nd, but doesn't seem to work for me.
So, ideally, I need to be able to:
a.- wget from a list of url's the way I do now, keeping my parameters.
b.- get all files at the same directory and being able to rename each file.
Does anybody have any solution to this?
Thanks in advance.
I would suggest 2 approaches -
Use the "-nc" or the "--no-clobber" option. From the man page -
-nc
--no-clobber
If a file is downloaded more than once in the same directory, >Wget's behavior depends on a few options, including -nc. In certain >cases, the local file will be
clobbered, or overwritten, upon repeated download. In other >cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the >same file in the same directory will result in the original copy of file >being preserved and the second copy
being named file.1. If that file is downloaded yet again, the >third copy will be named file.2, and so on. (This is also the behavior >with -nd, even if -r or -p are in
effect.) When -nc is specified, this behavior is suppressed, >and Wget will refuse to download newer copies of file. Therefore, ""no->clobber"" is actually a misnomer in
this mode---it's not clobbering that's prevented (as the >numeric suffixes were already preventing clobbering), but rather the >multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, >re-downloading a file will result in the new copy simply overwriting the >old. Adding -nc will prevent this
behavior, instead causing the original version to be preserved >and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the >decision as to whether or not to download a newer copy of a file depends >on the local and remote timestamp and
size of the file. -nc may not be specified at the same time as >-N.
A combination with -O/--output-document is only accepted if the >given output file does not exist.
Note that when -nc is specified, files with the suffixes .html >or .htm will be loaded from the local disk and parsed as if they had been >retrieved from the Web.
As you can see from this man page entry, the behavior might be unpredictable/unexpected. You will need to see if it works for you.
Another approach would be to use a bash script. I am most comfortable using bash on *nix, so forgive the platform dependency. However the logic is sound, and with a bit of modifications, you can get it to work on other platforms/scripts as well.
Sample pseudocode bash script -
for i in `cat list-of-files-to-download.txt`;
do
wget <all your flags except the -i flag> $i -O /path/to/custom/directory/filename ;
done ;
You can modify the script to download each file to a temporary file, parse $i to get the filename from the URL, check if the file exists on the disk, and then take a decision to rename the temp file to the name that you want.
This offers much more control over your downloads.
I am using JasperStarter to create pdf from several jrprint files and then print it using JasperStarter functtions.
I want to create one single pdf file with all the .jrprint files.
If I give command like:
jasperstarter pr a.jprint b.jprint -f pdf -o rep
It does not recognise the files after the first input file.
Can we create one single output file with many input jasper/jrprint files?
Please help.
Thanks,
Oshin
Looking at the documentation, this is not possible:
The command process (pr)
The command process is for processing a report.
In direct comparison to the command for compiling:
The command compile (cp)
The command compile is for compiling one report or all reports in a directory.
I'm using Perl's File::Fetch to download a file from the lastfinished build in Teamcity. This is working fine except the file is versioned, but I'm not getting the version number.
sub GetTeamcityFiles {
my $latest_version = "C:/dowloads"
my $uri = "http://<teamcity>/guestAuth/repository/download/bt11/.lastFinished/MyApp.{build.number}.zip";
# fetch the uri to extract directory
my $ff = File::Fetch->new(uri => "$uri");
my $where = $ff->fetch( to => "$latest_version" );
This gives me a file:
C:\downloads\MyApp.{build.number}.zip.
However, the name of the file downloaded has a build number in the name. Unfortunately there is no version file within the zip, so this is the only way I have of telling what file i've downloaded. Is there any way to get this build number?
c:\downloads\MyApp.12345.zip
With build configs modification
If you have the ability to modify the build configs in TeamCity, you can easily embed the build number into the zip file.
Create a new build step - choose command line
For the script, do something like: echo %build.number% > version.txt
That will put version.txt at the root directory of your build folder in TeamCity, which you can include in your zip later when you create it.
You can later read that file in.
I'm not able to access my servers right now so I don't have the exact name of the parameter, but typing %build will pull up a list of TeamCity parameters to choose from, and I think it is %build.number% that you're after.
Without build configs modification
If you're not able to modify the configs, you're going to need something like egrep:
$ echo MyApp.12.3.4.zip | egrep -o '([0-9]+.){2}[0-9]+'
> 12.3.4
$ echo MyApp.1234.zip | egrep -o '[0-9]+'
> 1234
It looks like you're running on Windows; in those cases I use UnxUtils & UnxUpdates to get access to utilities like this. They're very lightweight and do not install to the registry, just add them to your system PATH.
I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character รข - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/