In his famous blog post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Joel said :
The earliest idea for Unicode encoding, which led to the myth about
the two bytes, was, hey, let's just store those numbers in two bytes
each. So Hello becomes
00 48 00 65 00 6C 00 6C 00 6F
Right? Not so fast! Couldn't it also be:
48 00 65 00 6C 00 6C 00 6F 00 ?
The second representation is faster ? why ?
How does swapping the high and low bytes affect performance ?
The sentence "Not so fast!" isn't about computing performance but a way to say "hey, don't make assumptions so fast, here's another way to look at it".
The question is Mu.
Related
I've an application running on Windows, I don't have the source code, the GUI presents the date as 22/06/2018 08:44, this date/time is written/read from a file. This file contains a Hex representation of the date, some examples below (the latter two have been edited by myself - hence the weird year).
2C 05 0A D4 01 (22/06/2018 08:44)
2C 06 0A D4 01 (22/06/2018 08:51)
2C 08 11 D4 01 (01/07/2018 06:53)
B4 AE 08 D4 01 (06/12/5671 13:13)
B4 AE 11 12 10 (31/07/5270 10:53)
I'm trying to understand the conversion from Hex to the GUI date/time, so that I could modify the Hex in the file direct and see the GUI date/time accordingly
Thanks
Edit: The hex numbers are standard Windows 64-bit values representing the number of 100-nanosecond intervals since January 1, 1601, with the three least significant bytes omitted and written as little endian (least significant byte first). For example, your first hex string, 2C 05 0A D4 01, means hex 01D4 0A05 2C00 0000 units at 100 nanos since January 1, 1601 UTC (this is precisely 22/06/2018 08:44:02.9898752 UTC, but your GUI omits seconds and fraction of second).
You can read more here: File Times on MSDN.
For the conversion from date and time to hex you may for example use http://www.silisoftware.com/tools/date.php?inputdate=2018-06-22T08%3A44%3A00%2B00%3A00&inputformat=text, enter your date as 2018-06-22T08:44:00+00:00 and get the hex out as 01D40A05:2A37C800. Round up so it ends in three zero bytes: 01D40A05:2B000000. Reorder the remaining bytes: 2B 05 0A D4 01.
Original answer
It’s not a date-time encoding scheme that I have met before. And from the data you have provided I am not able to deduct the full scheme. I believe I have found a bit of the scheme. I cannot get further.
Assuming some linear correspondence I first note by comparing the first two samples that a difference of 1 unit of the second group of hex digits (the second byte if you will) makes for a difference of 7 minutes. Or approximately: we don’t know if the times have seconds and maybe even fractions of seconds that are not displayed.
I used this information when comparing to the third sample. The third byte has increased by 7 from the first to the third sample (hex 11 - hex 0A = 7). Taking the increase on the second byte into account it would seem that one unit of the third byte approximates 1832 minutes, which is suspiciously close to 256 * 7 minutes = 1792 minutes. So it would seem that the 2nd and 3rd bytes have a “little endian” relationship, where the 3rd byte is more significant than the 2nd. Using this information we can obtain a little more accuracy: The difference in the times is 12849 minutes, and the difference on the 2nd and 3rd byte is hex 1108 - 0A05 = decimal 1795, so each unit is 7.1582 minutes (it agrees with the 7 minutes from before, only it’s more precise). Using this value I interpolated the second date-time from the hex value 2C 06 0A D4 01 and got 2018-06-22T08:51:09. It agrees. Hypothesis confirmed!
The information found so far suffices for encoding values between 09/06/2018 14:43 (2C 00 00 D4 01) and 01/05/2019 09:17 (2C FF FF D4 01) with a precision of 7 minutes. I’d be surprised if that were enough for you.
Comparing to the value in the 4th sample it would seem that one unit on the first byte corresponds to 14 128 940 minutes (26.86 years). It doesn’t divide nicely by the 7.1582 minutes from before, as we might have hoped, so I’m not sure how we might use this observation.
Comparing the last two samples it seems that the 4th and 5th byte cannot have the same little endian relationship since the 5th byte increases while the date decreases. It’s still possible, though, if we assume that at least one of the years is before the common era (“BC”) since era is not printed. Another possibility might be that the fifth byte is ignored. This leads to a unit of the fourth byte corresponding to 1 088 006 minutes. Again it bears no nice relationship to the 7.15 minutes from bytes 2 and 3, and it’s suspicously close to the unit of the first byte, so probably incorrect.
To learn more: First try to see if you get a meaningful date-time from editing (hex) 00 00 00 00 00 into your file. If you do, next try one F at a time:
F0 00 00 00 00
0F 00 00 00 00
…
00 00 00 00 0F
If this doesn’t make a pattern that is clear enough, try one bit at a time, using hex digits 1, 2, 4 and 8 instead of F.
I have those 3 Events in a Midi file:
00 FF 51 03 0E 15 C3 86 A6
20 FF 51 03 15 20 A5 83
5C FF 51 03 0E 15 C3
But what is, in this case, important is, that FF 51 stands for a Tempo Change and the 03 for the number of following Byte-Pairs describing the tempo. As it is "3 Byte Pairs" in Each Event Why are there 5 Byte Pairs describing the first Event, 4 describing the second, and 3 describing the third? (I hope the image helps)
How does the encoding program know, when a new Event starts? The File can be played without any Problems.
All three events have three data bytes.
The delta times between the events are encoded as variable-length quantities, so you have to continue to read bytes until the most significant bit is clear. The three times before each event are 00, 86 A6 20, and 83 5C, resulting in the decoded delta times of 0, 109344, and 476.
According to Wikipedia:
[Ascii85 uses] the ASCII characters 33 (!) through 117 (u) inclusive (to represent the base-85 digits 0 through 84), together with the letter z (as a special case to represent a 32-bit 0 value).
[btoa] Version 4.2 added a "y" exception for a group of all ASCII space characters
While 0 data might be quite common, that use of z to compress 0's seems like an arbitrary optimization that won't always be of use.
Likewise, the less frequent use of y is only of use if the raw bytes contain adjacent spaces. The Unicode encoding of space is actually 20 00 so 0x20202020 isn't all that common in Unicode texts.
Binary data does often have adjacent 00's, but it also often contains adjacent FF's.
Text data does often contain adjacent spaces, but it also often contains adjacent tab characters, or adjacent new-line characters.
It would seem that a frequency analysis, and usage of 9 or 10 characters (Ascii chars 118-126/127, or v through ~/DEL) to represent the 9/10 most frequent 32-bit values, might lead to better compression.
The mapping of compression-character to 32-bit value could perhaps sit at the start of the encoded string enclosed between <[ and ]>. For 32-bit values that are 4 repeated bytes, the 32-bit value can be abbreviated to the repeated hex value(s).
For example:
The binary data (192 bytes):
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
Note the presence of spaces 20, hyphens 2D, tabs 09 and Unicode Carriage Return-Line Feeds 0D 00 0A 00
Could be encoded as (79 bytes)
<[00;FF;20;2D;09;0D000A00]><~vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|~>
Is there merit in an encoding approach that uses such compression? Why aren't the various Ascii85 specs more aggressive with compression?
Because you would normally use a compression program before encoding with ASCII85, which can do a much better job than the suggested ad hoc encodings.
There are some applications for which it is useful to be able to find the Nth octet of an encoded string without having to scan the whole thing. Compression would interfere with that. There are, however, other applications for which certain forms of compression could be useful. If one can use more than 85 distinct characters, a base-85 coding will allow for easy compression using characters outside the primary set. Even if one is limited to a set of precisely 85 characters, the number of sequences of five base-85 characters is greater than the combined number of sequences of one, two, three, and four base-256 bytes, so there would be room to use some special combinations of characters to indicate e.g. runs of certain character values. The biggest problem is that doing so would forfeit the ability to perform random seeks within the encoded data stream.
I have to say that I don't really understand the mechanics of CRC-32; but I was hoping to be able to at least calculate a CRC based on a chunk.
I have a PNG with the following information: 2px by 5px, RGBa, no interlace
The image header chunk results in:
00 00 00 0d = data is 13 bytes long
49 48 44 52 = ascii for IHDR (image header)
00 00 00 02 00 00 00 05 08 06 00 00 00 = data; dimensions, bit-depth, etc.
6f b3 3d 9c = CRC
I wanted to see if CRC could be easily calculated so I tried using:
http://depa.usst.edu.cn/chenjq/www2/wl/software/crc/CRC_Javascript/CRCcalculation.htm
The calculator's default polynomial for CRC-32 is 04C11DB7.
When I plug in "0000000d4948445200000002000000050806000000" I get 35F0A255.
I looked it up on Wikipedia and tried the other various representations used by PNG (EDB88320 & 82608EDB) and I tried leaving off the length and chunk type with the various polynomials I used before; I also tried including the information before the chunk which defines the PNG signature. I never got 6fb33d9c.
Any ideas on why I can't get the right CRC via calculator?
I have integrated a TI lib for .h264 encoding on davinci board with processor dm6446
I could verify the encoded bit stream when saved on hdd and using Elecard stream analyser.
But i could not stream it over rtsp and view in vlc player. The VLC player would switch to TCP/IP and then stop showing message as nothing to play. On further debugging i found out that each encoded bit stream generated is of type IVIDEO_IDR_FRAME .
The Nal header for each frame is like
00 00 00 01 67 42 80 1E DA 05 c7 D9 74 00 00 00 01 68 CE 3c 80 00 00 00 01 65
As I understand 00 00 00 01 are used as a delimiter and 67 and 68 are for SPS and PPS respectively. After streaming first two frames as is, I tried to stream the next frames from the data 00 00 00 01 65. But still VLC player could not play the encoded stream. It showed the message at found PPS and stoped there.
What should i do to resolve this issue. I am quite newbee to this field