How can I figure out whether it's a big-endian or little-endian file? I just tried to write a big-endian file with matlab but probably it didn't work. Now I want to learn if it is possible to learn what type it is. Any suggestion?
There's no way in general to know whether a given data file was created using big-endian or little-endian byte formatting. You would need to know something about the type of file it is (if it has a standard format that only ever uses one or the other) or the type of system the file was created on.
When you use fopen to open a file in MATLAB, you have to specify which endian format to use for that file using the machinefmt input argument. If you don't specify a format, MATLAB will use the native machine format by default (which won't be correct for files created on machines with different endian formats). This usage of fopen:
[filename, permission, machinefmt] = fopen(fileID);
Will simply tell you what endian format you used to open the file in the first place, which may or may not be correct for the given file.
Use FOPEN:
fileID = fopen(fileName)
[filename, permission, machineformat] = fopen(fileID)
The third output, machineformat, tells you whether it's big endian ('b') or little endian ('l').
Related
I am working with many XML files and some of them are UTF-8 while most are ANSI.
In the UTF-8 files, the XML header states:
<?xml version="1.0" encoding="ISO8859-1" ?>
However that information is wrong.
The problem this generates is that I use unicode2native to generate correct XLS files, which generates bad output when the file is UTF-8 encoded.
How can I detect which is the real encoding of each file programmatically?
To manually locate them with the help of a text editor is not a feasible option, as there are hundreds of files, and my solution must work with more files which I don't have access.
There's no easy way to do this generally: because a given file might be a valid sequence in multiple encodings, detecting the character encoding requires using heuristics that are aware of natural language features, such as character frequencies, common words, and so on.
Octave doesn't have direct support for this. So you'll need to use an external program or library. Options include ICU4C, compact_enc_det, chardet, juniversalchardet, and others. chardet would probably be the easiest for you to use, since you can just install it and call it as an external command, instead of building a custom program or oct-file using a library. Or juniversalchardet, since if you have a Java-enabled Octave build, it's easy to pull in and use Java libraries from Octave code.
If it's really true that your input files are all either ANSI (Windows 1252/ISO 8859-1) or UTF-8, and no other encodings, you might be able to get away with just checking each file's contents to see if it's a valid UTF-8 string, and assume that any that are not valid UTF-8 are ANSI. Only certain byte sequences are valid UTF-8 encodings, so there's a good chance that the ANSI-encoded files are not valid UTF-8. I think you can check whether a file is valid UTF-8 in pure Octave by doing utf8_bytes = unicode2native(file_contents, 'UTF-8') on it, and seeing if the utf8_bytes output is identical to just casting file_contents directly to uint8. If that doesn't work, you can fall back to using Java's character encoding support (and that you can do with Java Standard Library stuff on any Java-enabled Octave build, without having to load an external JAR file).
And if all your input files are either UTF-8 or strictly 7-bit ASCII, then you can just treat them all as UTF-8, because 7-bit ASCII is a valid subset of UTF-8.
Palliative solution that I found for Windows 10, while I can't find a proper way to do this in pure Octave:
[~, output] = system(['file --mime-encoding "', fileAddress, '"']);
encoding = strsplit(output)(columns(strsplit(output, ' '))){1};
if strcmp('utf-8', encoding)
sheet(1, 1) = {strcat('', unicode2native(myText, 'ISO-8859-1'))};
else
sheet(1, 1) = {myText};
endif
How can I manipulate a file so that the MD5/SHA1 signature is altered. I would like to do it in a generic way, without knowing the file type, exe, pdf, jpg etc. Is it possibly to do like this and not corrupt the file?
I'm not trying to construct a new known signature just change from the existing one. I don't care what the new signature is.
I had been thinking about just adding a new null byte to the end of the file. Would that break any format?
Binary formats specify the format in bits. So it depends on the format what each bit means. In other words: you cannot change the file without understanding the format if you want it to remain valid. Adding a zero byte may work for many formats as it may be simply ignored, but this is not a given.
Alternatively you could lookup locations in existing formats and store them in a table. You can usually lookup the format using the file command in Posix systems (Linux, CygWin...).
I agree with #Maarten. It depends on the file format, so you need a map of rules per format.
I would try though adding a ZERO byte at the end of the file. It may work in the majority of common files.
RandomAccessFile f = new RandomAccessFile("filePath", "r");
byte[] b = new byte[(int)f.length() + 1];
f.read(b);
FileOutputStream fos = new FileOutputStream("newFilePath");
fos.write(b);
I'm trying to read the binary content of a text file (which contains the compressed version of a different text file). The first two characters (01111011 and 00100110) are correct (going by the values that were original put there during the compression process.
However, when it gets to the third character, it should be reading 10010111 (again, going by what was added during the compression process), but instead it reads 10000000010100 (aka 8212). Does anyone know what is causing this discrepancy, or how to fix it? Thanks!
The Java FileReader should not be used to read binary data from files since it reads a character at a time using the default encoding (which is most likely not very good for binary reading)
Instead, use FileInputStream which has read methods that reads actual raw bytes without any encoding applied.
How do I access data from the StackExchange API using Matlab?
The naive
sitedata = urlread('http://api.stackoverflow.com/1.1/questions?tagged=matlab')
fails since the data is compressed. However, when I write this to file (using fprintf(fileID,'%s',sitedata)), I get a zip-file that cannot be uncompressed.
Try urlwrite() instead:
urlwrite('http://api.stackoverflow.com/1.1/questions?tagged=matlab',...
'tempfile.zip')
gunzip('tempfile.zip')
fid = fopen('tempfile');
str = textscan(fid,'%s',Delimiter','\n');
fclose(fid);
A better version of this snippet would use tempname to dynamically generate temporary filenames.
Matlab's urlread assumes you're getting text data back, not binary. The gzip binary data is getting mangled either when urlread is decoding the character data to Unicode values to stick in Matlab chars, or when the formatted-output fprintf function is writing them out, encoding them to UTF-8 or whatever default character encoding you're using for fileID and changing the byte sequence, or maybe both.
IIRC, urlread will default to using ISO-8859-1 encoding, which means the bytes will be turned in to the Unicode code points with the same numeric values - effectively just a widening. So you can get the byte data back by doing sitebytes = uint8(sitedata). (That's a regular uint8() conversion, not a typecast().) (If this isn't the case, you can probably fiddle with urlread's CharSet option.)
If you can't get the right bytes out from urlread by fiddling with the encoding and casts, then you can drop down and make calls against the Java HttpAgent like urlread does and bypass the character set decoding step, or fiddle with its options. See the urlread source for how to do it.
Once you have the right bytes in memory, you can write them out to a file using the lower-level fwrite() function, which won't mangle them by doing character set encoding. Then you'll have a valid gzip file of the site's original response. (I think it'll work if you also just use fwrite(fileID, sitedata, 'uint8') directly on the char string, but it's uglier IMHO.)
You can also unzip it in memory using Java classes and save a trip to the filesystem. Do jsitebytes = typecast(sitebytes 'int8') to get them as Java-friendly signed bytes and then stick it into a ByteArrayInputStream and read it out through a GZIPInputStream. You'll need to build a little Java helper class because Matlab doesn't play well with passing byte[] buffers by reference like java.io wants, but it may be worthwhile if you do a lot of in-memory munging like this.
When working with web services or fancier data downloads (e.g. sites that need sessions or certificates), I've often ended up dropping down and coding directly against the HttpAgent and java.io classes from within Matlab.
I have a string that looks like this
H4sIALYnhUsCA9VXW5aDIAz9zypcgiU8dDnTWtfQ5Q8kEgSR
ap05c+YnhxLyumBu2r/s2PUvO3nh+rCaw0oFob1Q+Z51HfjNZ1jexCSsLAYx
BGG6eATZGJYALIIzG9QOy4NeaPYAyyarKfQY7TgypTjGI3ogkxDahSTw7kX/
FQUHeIgxsoClQD1JGRKF7Jy4oXNeQFou5TvJzlkJoAUIMuGAOlePMTEGWQry
2liLCfHNJPEwuiU7jmzEhM6gnGawSO3ORMnqLQRsNgki7AV4jEI9xKRU65V6
q7UUZVetqsZQC13z3UzMXkkM24nlvs+B/EktqmsnC0dxelvLycTaN+QugYw/
DTJeeTD4iy/ZXQHZ/KuXjH/2kvFKYtfaBfXtaUtlVZCZiIxw5WPLLxkFQZ2D
mMBmUaQJYCKyyBlShVqMuHUFSzu5/UTY1sVMVpwzSnimpEFOz5G7nKSoheIt
yqjg+pxU54zE64jd3zzdrYmW6Ybic2mVvcjAUKfg0s0QMfAXDadyotuGxOdH
hwZIU4NPR2fqbApbVnirTRdFGc/cjr7KwhmV+m6GGbMnf+RetoNNGwiohW4D
AREJ1R0FAhqo7gDx4b18iBh/uWPeGkwc07mMmdtKbBe0WQy9PMpr6TpLZwhR
whmj8/8FjTEWsv8ckhimqgj9+2q0hfWH1WpFCXPYfX27mEMGupKe1QA+gkwd
PDVv/xO+AbHzd9RzDQAA
My initial guess was that this was a Base64encoded file of some sort. Any ideas on how I can figure out if/which type of file it is? It should contain MIME info I guess but how would I save it to a file without fragmenting it.
It's base64. When you decode it, you get a gzipped file, which consists of a boatload of hex characters (literally, as ASCII 0xNN hex characters). They're mostly in the A-Z,a-z range.
I'd paste it here, but from this, I suspect this is part of some exercise you're doing, so I think I'll leave it to you to figure out.
P.S. For edification, I determined the binary output was a gzipped file by using the unix file command, to identify the "magic" bytes, which showed that it was gzipped. Use your decode_base64 function or whatever it is, then dump the return value into a file and gunzip it.