How does a h.265 bitstream relate to its trace file? - trace

I am using the https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-16.9 version of the H265 video encoder, and used it to generate a TraceDec.txt file which contains VPS, PPS, PPS, and slice header symbols (e.g vps_video_parameter_set_id u(4): 0)
I was told to match the bits of the .265 file with the symbols of the TraceDec.txt file using a hex editor. I opened the .265 file with a hex editor, but I have no clue how these bits correspond to the symbols in the trace file. The manual doesn't seem to help, either. Do the first bits of the file even start with the VPS?

Related

What are the characters shown on a file after forcefully changing the extension?

Recently I changed the extension of an .apk file to .txt and despite this, I was able to open it on Notepad with some random characters, that weren't available on the keyboard in the file. org/antlr/runtime/ANTLRFileStream.class…TmOÓP=w[×QËÀ)ê|A…ÑETÔ¢NP¢™ãË—º•Q3ZÓcüþ¿j",£ß4ñGÏmÇñ˽Ïs{žçœçeûùëóW ±¨á0F5d0ÖA˔‹LÈã’ŠËR˜PqEƒ†Iy\•ØkÒºÞÁЂ´¦TL«˜H­95{ÙÚ°2K/­×–Y³Üªù(ð·:%œv\'¸!Гû÷óðª#¢èUܵä¸öòæÆÛ_±^ÔÂt^Ùª­Z¾#ýæc"XwêKž_5-7¨ù¦¿éΆmÞZ^Y*ÍS “ÛÖ¹µ¹7eûUàxn]%µ‘Ð^TÊvË^…kžUˆ;u_àTw<sÁ}µDL%ÛªØ>ùÄš#º…Rø˜¨;o)\,0ǚԞ݇ؓ‡àΪ<ò6ýr³¥GsÃ횪EOÌ_…É =è•Ç¬Ž#8ª£½ú^fùõ˜Ž›¸%pü IT{`Á2þ¶<Š:î`NÇ<î긇A˜èÿïˆ8Ç0Q¥»¨#- Ze7srRÉšíVƒõÐ]0rí&tÀ”O´‡[Y±K ö¬H›¯Ü %÷¬8Ì) r+åšW·ÑÏF†¿,bd—i%h³­ˆá8½YÄiª‘
Not just this, but while converting many other extensions like .jar,.xapk, etc. would show me these characters.Can anyone please explain, what factors are these characters based on, and how does the OS decides or try what characters to show in an unsupported file exactly.
Is there a way to get the original content through this data?
Lets say you created a text editor, which can write and save text files as well as open text files. you also defined the encoding that will be used to save text in binary files(all files when saved are binary). So your encoding looks something as following:
Your encoding Emacs encoding
TEXT BINARY TEXT BINARY
A 01000001 ă 01000001
B 01000010 Ћ 01000010
... ...
Z 01011010 Ϡ 01011010
lets say you create a file with 'ABZ' as its contents. this file when saved contains value 010000010100001001011010. When you open this file with your text editor, the editor finds 010000010100001001011010 as file contents and using above encoding it knows that its 'ABZ' hence it prints 'ABZ' on the screen.
Now lets say you open same file using emacs, since emacs uses its own encoding it displays "ăЋϠ", There is nothing wrong with emacs. it just doesn't know that data was written using your custom encoding.
So the point is that every file is written in a specific format, for example APK format can only be correctly understood by Android system. when you try to open the APK file in a text editor it just tries to make sense of binary data in the same way as emacs does in above example.
Is there a way to get the original content through this data?
If you know the originally encoding using which data was written, then you can read the contents of file using same encoding.

Read the pageviews.gz files from wikipedia

I wrote a script to download the pagviewsXXXXX.gz files from wikipedia. So fa so good.
When I unzip the files the content is illegible. Any one knows how to read the content of the pagwviews.gz files ? If there is some api or any idea on how to do it ?
Thanks in advance
I don't know what software you used to decompress the .gz files. I just used 7-zip on a 64-bit Win10 machine with success. Having done that I find that https://dumps.wikimedia.org/other/pagecounts-raw/ provides a description of the lines in the uncompressed file.
The line
de Stadio_Arena_Garibaldi_-_Romeo_Anconetani 1 11820
is from the de (German) wikipedia, page 'Stadio_Arena_Garibaldi_-_Romeo_Anconetani', which had been referenced once in the hour-long period covered by the gzipped file, and the server returned 11,820 bytes.
This line looks like gibberish.
ar %D9%85%D8%B7%D9%8A%D8%A7%D9%81%D9%8A%D8%A9 1 16742
The first two characters, however, indicate that it represents a reference to the Arabic version of wikipedia. The '%' items are non-ascii characters.

Windows Converting a Folder of Files From RTF to UTF-8

I am trying to analyze a corpus of 620 Korean language newspaper articles using the konlpy module in Python. The files are in rtf formatting. However konlpy only supports files encoded in UTF-8. In Windows, how can I convert a folder containing 620 rtf encoded articles to UTF-8 articles such that, upon opening the files in Notepad, the Korean characters are still in-tact?
Some things I have tried (but to no avail)
Used a freeware converter program (http://www.emreakkas.com/localization-tools/convert-rtf-to-txt) that converted the files into UNICODE and then tried to use a Cygwin iconv batch file to convert the files using the same script as this individual did:
cygwin syntax error near unexpected token `done'
When I do this all of the files are there but they are 0KB and they are blank. (let me know if you need more info about this method as i needed to do another step to get this to even loop over my files)
Used another freeware program (memory a little hazy on this one) that converted the rtf files but all the characters were just scrambled latin characters.
I'm thinking that there has to be an easy way to do this, but everything I tried is really complicated and does not work. Another funny thing is that whenever I simply manually take the original rtf file or the file converted into UNICODE and "Save As" and choose UTF-8, it works fine. I would love it if I did not have to "Save As" for 620 articles.
Thanks!

Determine whether file is a PDF in perl?

Using perl, what is the best way to determine whether a file is a PDF?
Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528
Detecting a PDF is not hard, but there are some corner cases to be aware of.
All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
The end of a PDF will contain a line with '%%EOF',
The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.
In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.
The module PDF::Parse has method called IsaPDF which
Returns true, if the file could be parsed and is a PDF-file.

DFM file became binary and infected

We have a DFM file which began as text file.
After some years, in one of our newer versions, the Borland Developer Studio changed it into binary format.
In addition, the file became infected.
Can someone explain me what should I do now? Where can I find how binary file structure is read?
Well, I found what happens to the DFM file, but I don't know why.
The occurence of changing from text file to binary one is known, and could be found in stack overflow in another question. I'll describe only the infection of the file.
In Pascal, the original language of DFM files, a string defines so: first byte is the length of the string (0-255) and the other characters are the string. (Unlike C, which its strings length are recognized by a null character).
Someone (maybe BDS?) while changing the file from text file to binary one, also changed all string of length 13 (0D) to be length 10 (0A). This way, the string finished after 10 chars, and the next char was a value of the property.
I downloaded binary editor, fixed all occurences of length 10, and the file was displayed and compiled well.
(Not only properties' length infected, but also one byte on Icon.Data property was replaced from 0D to 0A)