I am engaged in preparing an application regarding reading the .epub files in iPhone. Where can I get the reference for sample applications for unzipping and parsing the files? Can anyone guide me with a best link? Thank you in advance.
An .epub file is just a .zip file. It contains a few directory files in XML format and the actual book content is usually XHTML. You can use Objective-Zip to unzip the .epub file and then use NSXMLParser to parse the XML files.
More info: Epub Format Construction Guide
On top of Ole's answer (that's a pretty good how-to guide), it's definitely worth reading the specification for the Open Container Format (OCF) - sorry it's a word file. It's the formal specification for the for zip structure used.
In brief you parse the file by
Checking it's plausibly valid by looking for the text 'mimetype' starting at byte 30 and the text 'application/epub+zip' starting at byte 38.
Extracting the file META-INF/container.xml from the zip
Parsing that file and extracting the value of the full-path attribute of the first rootfile element in it.
Load the referenced file (the full-path attribute is a URL relative to the root of zip file)
Parse that file. It contains all the metadata required to reference all the other content (mostly XHTML/CSS/images). Particularly you want to read the contents of the spine element which will list all content files in reading order.
If you want to do it right, you should probably also handle DTBook content as well.
If you want to do this right, you need to read and understand the Open Packaging Format (OPF) and Open Publication Structure (OPS) specifications as well.
Related
1) How can I differentiate doc and docx files from requests?
a) For instance, if I have
url='https://www.iadb.org/Document.cfm?id=36943997'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])
I get this:
application/vnd.openxmlformats-officedocument.wordprocessingml.document
This file is a docx.
b) If I have
url='https://www.iadb.org/Document.cfm?id=36943972'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])
I get this
application/msword
This file is a doc.
2) Are there other options?
3) If I save a docx file as doc or vice-versa may I have recognition problems (for instance, for conversion to pdf?)? Is there any kind of best practice for dealing with this?
The mime headers you get appear to be the correct ones: What is a correct mime type for docx, pptx etc?
However, the sending software can only go on what file its user selected – and there still are a lot of people sending files with the wrong extension. Some software can handle this, others cannot. To see this in action, change the name of a PNG image to end with JPEG instead. I just did on my Mac and Preview still is able to open it. When I press ⌘+I in the Finder it says it is a JPEG file, but when opened in Preview it gets correctly identified as a "Portable Network Graphics" file. (Your OS may or may not be able to do this.)
But after the file is downloaded, you can unambiguously differ between a DOC and a DOCX file, even if the author got its extension wrong.
A DOC file starts with a Microsoft OLE Header, which is quite complicated structure. A DOCX file, on the other hand, is a compound file format containing lots of smaller XML files, compressed together using a standard ZIP file compression. Therefore, this file type always will start with the two characters PK.
This check is compatible with Python 2.7 and 3.x (only one needs the decode):
import sys
if len(sys.argv) == 2:
print ('testing file: '+sys.argv[1])
with open(sys.argv[1], 'rb') as testMe:
startBytes = testMe.read(2).decode('latin1')
print (startBytes)
if startBytes == 'PK':
print ('This is a DOCX document')
else:
print ('This is a DOC document')
Technically it will confidently state "This is a DOC document" for anything that does not start with PK, and, conversely, it will say "This is a DOCX document" for any zipped file (or even a plain text file that happens to start with those two characters). So if you further process the file based on this decision, you may find out it's not a Microsoft Word document after all. But at least you will have tried with the proper decoder.
I know that the microsoft docx file format is a compressed zip archive. I analyzed it and I think I understand that I can manipulate it by changing the content of the /word/document.xml file inside this file structure.
But after I zip the folder again and try to open it, MS-Word complains with a message like:
"The file ... cannot be opened because its content is causing problems. "
I wonder which is the correct method to zip the content of the xml files after manipulating? Or is there something like a checksum I have overseen?
The reason why ms word complains about my manipulated docx format was, that the compressed file structure was within another folder, which I created to edit the document.xml file.
It is important that the xml files are located in the same root folder structure as in the original file.
A similar question has been asked on StackOverflow here.
I am trying to write a script that will allow me to download numerous (1000s) of data files from a data server (e.g, http://hydro1.sci.gsfc.nasa.gov/thredds/catalog/GLDAS_NOAH10SUBP_3H/2011/345/). Unfortunately, the names of the files in each directory are not formatted in a similar way (the time that they were created were appended to the end of the file name). I need to be able to specify the file name to subset the data (I have a special tool for these data types) and download it. I cannot find a function in matlab that will extract the file names.
I have looked at URLREAD, but it downloads everything including html code.
Thanks for your help!
You can easily parse the link.
x=urlread(url)
links=regexp(x,'<a href=''([^>]+)''>','tokens')
Reads every link, you have to filter all unwanted links.
For example this gets all grb files:
a=regexp(x,'<a href=''([^>]+.grb)''>','tokens')
I need to append 2 video files by appending their NSMutableData as I have already done this with audio files and it is done correctly but not with video files.
It may be because data bytes contain some header info and I will need to remove these bytes from the 2nd video but I don't know that how many bytes should I remove?
You haven't told us the file format in question, but generally:
Look up the specification for the file format in question and find out what the header looks like. Then code a solution based on this information.
There are many resources on the net with file format information, but one place you might want to look is http://www.wotsit.org/.
How to parse .chm files in perl ? Which module is used for it ?
How about Archive::Chm?
Performs some read-only operations on
HTML help (.chm) files. Range of
operations includes enumerating
contents, extracting contents and
getting information about one certain
part of the archive