How do you trim the XMP XML contained within a jpg - iphone

Through the use of sanselan I've found that the root cause of iPhone photos imported to windows becoming uneditable is that there is content (white space?) after the actual XML (for more details and a linked example of the bad XMP XML see https://apple.stackexchange.com/questions/45326/why-can-i-not-edit-some-photos-imported-from-an-iphone-to-windows-vista).
I'd like to scan through my photo archive and 'trim' the XMP XML.
Is there an easy way to do this?
I have some java code that can recursively navigate my photo archive and DETECT the issue. I'm not sure how to trim and write the XML back though.

Obtain the existing XML using any means.
The following works if using the Apache Sanselan library:
String xmpXml = Sanselan.getXmpXml(new File('/path/to/jpeg'));
Then trim it...
xmpXml = xmpXml.trim();
Then write it back to the file using the solution to serializing Xmp XML to an existing jpeg.

try the following steps:
collect all of the photos in a single folder (e.g. folder xmlToConvert on your Desktop)
open a Terminal.app window
cd to the directory you put the files in (e.g. cd ~/Desktop/xmlToConvert)
run the following command from your command line prompt
mkdir converted ; for f in *.xml ; do cat $f | head -n $(wc -l $f) > converted/$f ; done
the converted/ sub-directory should now contain all the files without the whitespace at the end.
(i.e. a folder called converted in the xmlToConvert you created on your Desktop)
hth

Related

How to rename partly the downloaded file using wget?

I'd like to download many files (about 10000) from ftp-server. Names of the files are too long. I'd like to save them only with the date in names. For example: ABCDE201604120000-abcde.nc I prefer to be 20160412.nc
Is it possible?
I am not sure if wget provides similar functionality, nevertheless with curl, one can profit from the relatively rich syntax it provides in order to specify the URL of interest. For example:
curl \
"https://ftp5.gwdg.de/pub/misc/openstreetmap/SOTMEU2014/[53-54].{mp3,mp4}" \
-o "file_#1.#2"
will download files 53.mp3, 53.mp4, 54.mp3, 54.mp4. The output file is specified as file_#1.#2 - here, #1 is replaced by curl with the value of the sequence [53-54] corresponding to the file being downloaded. Similarly, #2 is replace with either mp3 or mp4. Thus, e.g., 53.mp3 will be saved as file_53.mp3.
ewcz's answer works fine if you can enumerate the file names as shown in the post. However, if the filenames are difficult to enumerate, for example, because the integers are sparsely populated, this solution would result in a lot of 404 Not Found requests.
If this is the case, then it is probably better to download all the files recursively, as you have shown, and rename them afterwards. If the file names follow a fixed pattern, you can select the substring from the original name and use it as the new name. In the given example, the new file names start at position 5 and are 8 characters long. The following bash command renames all *.nc files in the current directory.
for f in *.nc; do mv "$f" "${f:5:8}.nc" ; done
If the filenames do not follow a fix pattern and might vary in length, you can use more complex pattern substitution using sed, see SO post for an example.

perl, extract TOC from PDF file

I have checked through CAM::PDF and other PDF related modules, but can not figure if there a way to extract table of content from a clear PDF file.
If there any ideas I would be grateful!
I have not been able to find a library that supports the extraction of pdf bookmarks (which is what I assume you mean by table of contents) reliably.
However, pdftk does a great job at this and can be run from the command line;
pdftk myfile.pdf dump_data | grep BookmarkTitle > outline.txt

MATLAB How to delete a specific page from a .pdf File?

I recently learned how to download .pdf files using urlwrite, but I was wondering if there is any way to specify which pages of the .pdf to save.
The files are always either 1 or 2 pages long, and I only want to keep the first page of the .pdf. Is there any way to directly download just the first page, and if not, is there a way to download the entire .pdf and then get rid of the 2nd page?
I know that it is possible to manually get rid of the second page in Preview or Adobe Acrobat and other applications, but it'd make things a lot easy if I could automate the process in MATLAB.
Any help would be greatly appreciated!
Find an appropriate command line tool (example uses pdftk), and then you can make a call to it from MATLAB. Use sprintf to assemble the appropriate command and then pass it to system. This puts the output in a temporary file then uses movefile to change the filename back:
temp = 'sometempfile.pdf';
urlwrite(someurl, filename);
system(sprintf('pdftk %s cat 1 output %s dont_ask',filename,temp));
movefile(temp, filename);

grabbing all .nc files from URL to get data using matlab

I d like to get all .nc files from URL to get and read the data using matlab. However, the filename is always very long and vary amongst all files.
For instance, I have
url = 'http://sourcename/filename.nc'
the sourcename is always the same, however the filename is very long and vary, so I would like to just use * to be able to grab whatever .nc file in the url
e.g.
url = 'http://sourcename/*.nc'
but this does not work and I am guessing I need to get the exact name - so I am not sure what to do here?
On the other hand, it could be also interesting for me to get the name of each file and record it, but not sure how to do that either.
Thanks a lot in advance!!
HTTP does not implement a filesystem abstraction. This means that each of those URLs that you request could be handled in a completely different way. There is also in many cases no way to get a list of allowable URLs off of a parent (a directory listing, in other words).
It may be the case for you that http://sourcename/ actually returns an index document containing a list of the files. In that case, first fetch that document. Then you'll have to parse the contents to extract the list of files. Then you can loop over those files, form new URLs for each one, and fetch them in sequence.
If you have a list of the file names in a text file, you can use the wget utility to process the file and fetch all the listed files. This file would be formatted as follows:
http://url.com/file1.nc
http://url.com/file2.nc
(etc)
You would then invoke wget as follows:
$ wget -i url-file.txt
Alternatively, you may be able to use wget to fetch the files recursively, if they are all located in the same directory on the web server, e.g.:
$ wget -r -l1 http://url.com/directory
The -r flag says to recurse, the -l1 flag says to go no deeper than 1 level when recursing.
This solution is external to Matlab, but once you have all of the files downloaded, you can work with them all locally.
wget is a fairly standard utility available on linux systems. It is also available for OSX and Windows as well. The wget homepage is here: https://www.gnu.org/software/wget/

Batch merging image files to pdf files using perl in windows

I have a bunch of image files in this naming format:
313024_Page_1_Image_0001.png
313024_Page_1_Image_0002.png
313025_Page_1_Image_0001.png
313025_Page_1_Image_0002.png
313025_Page_2_Image_0001.png
And I would like to convert the files with the same numbers (pre "Page_") to a single pdf with that name. For example, using the above five files:
313024_Page_1_Image_0001.png
313024_Page_1_Image_0002.png
would merge to 313024.pdf
and
313025_Page_1_Image_0001.png
313025_Page_1_Image_0002.png
313025_Page_2_Image_0001.png
would merge to 313025.pdf
I would like to be able to run this script in Perl in windows.
Thanks in advance,
Jake
Imagemagick includes a convert program that will take PNG files and make PDF files from them, e.g.:
$ convert source.png -compress zip source.pdf
You can also append image files into a larger image file, before converting to PDF:
$ convert {listOfImageFilenames} -append -compress zip verticallyStitchedFilename.pdf
You can run this within a Perl script via system() or through the Imagemagick API (example).
You'll probably need to adjust these calls for the special way that Microsoft Windows does things, but it shouldn't be too hard.