Read the pageviews.gz files from wikipedia - powershell

I wrote a script to download the pagviewsXXXXX.gz files from wikipedia. So fa so good.
When I unzip the files the content is illegible. Any one knows how to read the content of the pagwviews.gz files ? If there is some api or any idea on how to do it ?
Thanks in advance

I don't know what software you used to decompress the .gz files. I just used 7-zip on a 64-bit Win10 machine with success. Having done that I find that https://dumps.wikimedia.org/other/pagecounts-raw/ provides a description of the lines in the uncompressed file.
The line
de Stadio_Arena_Garibaldi_-_Romeo_Anconetani 1 11820
is from the de (German) wikipedia, page 'Stadio_Arena_Garibaldi_-_Romeo_Anconetani', which had been referenced once in the hour-long period covered by the gzipped file, and the server returned 11,820 bytes.
This line looks like gibberish.
ar %D9%85%D8%B7%D9%8A%D8%A7%D9%81%D9%8A%D8%A9 1 16742
The first two characters, however, indicate that it represents a reference to the Arabic version of wikipedia. The '%' items are non-ascii characters.

Related

HtmlHelp hhc file doesn't show russian characters

I use free pascal's chmcmd command to create chm file from hhp. After converting content goes right, but left pane side (tree) doesn't show russian characters. I tried to set charset at hhc file to cp1251. And saved file in windows 1251 encoding. After that it shows tree in russian right in cool reader but not in xChm. In windows it still doesnt work, only weird symbols. Utf-8 doesn't work at all.
The Microsoft CHM help format is very old and not maintained anymore. It wasn't created with Unicode in mind and various tricks need to be done in order to be able to generate CHM files for certain encodings:
You Windows is setup in the target language of the help file
The content HTML pages must be created using the proper charset

Windows Converting a Folder of Files From RTF to UTF-8

I am trying to analyze a corpus of 620 Korean language newspaper articles using the konlpy module in Python. The files are in rtf formatting. However konlpy only supports files encoded in UTF-8. In Windows, how can I convert a folder containing 620 rtf encoded articles to UTF-8 articles such that, upon opening the files in Notepad, the Korean characters are still in-tact?
Some things I have tried (but to no avail)
Used a freeware converter program (http://www.emreakkas.com/localization-tools/convert-rtf-to-txt) that converted the files into UNICODE and then tried to use a Cygwin iconv batch file to convert the files using the same script as this individual did:
cygwin syntax error near unexpected token `done'
When I do this all of the files are there but they are 0KB and they are blank. (let me know if you need more info about this method as i needed to do another step to get this to even loop over my files)
Used another freeware program (memory a little hazy on this one) that converted the rtf files but all the characters were just scrambled latin characters.
I'm thinking that there has to be an easy way to do this, but everything I tried is really complicated and does not work. Another funny thing is that whenever I simply manually take the original rtf file or the file converted into UNICODE and "Save As" and choose UTF-8, it works fine. I would love it if I did not have to "Save As" for 620 articles.
Thanks!

Determine whether file is a PDF in perl?

Using perl, what is the best way to determine whether a file is a PDF?
Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528
Detecting a PDF is not hard, but there are some corner cases to be aware of.
All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
The end of a PDF will contain a line with '%%EOF',
The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.
In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.
The module PDF::Parse has method called IsaPDF which
Returns true, if the file could be parsed and is a PDF-file.

Perl Net::FTP and non-ASCII (UTF8) characters in file names

I am using Net::FTP to access a PVR (satellite receiver) and retrieve recorded video files. Obtaining a list of all files using the dir() subroutine works fine, however if file names contain non-ASCII (UTF8) characters, calls to mtdm() and get() fail for these files. Here's an example (containing a german "umlaut"):
Net::FTP=GLOB(0x253d000)>>> MDTM /DataFiles/Kommissar Beck ~ Tödliche Kunst.rec
Net::FTP=GLOB(0x253d000)<<< 550 Can't access /DataFiles/Kommissar Beck ~ Tödliche Kunst.rec
File names only containing ASCII characters work well. Accessing files with non-ASCII characters through other FTP software works well too.
Does anyone have an idea how I can possibly make this work? Obviously I cannot simply avoid "umlauts" in file names.
Thank you ikegame and Slaven Rezic, your suggestions helped me solve the problem.
To sum it up: it is a bug in Topfield SRP2100's FTP implementation. The problem is not Perl or Net::FTP related. The MDTM command does not accept non-ASCII characters while the RETR command does. I checked with a network sniffer that my code and Net::FTP was doing everything right. All filenames sent in FTP commands were 100% correct.
I worked around the problem by parsing the date shown in the output of dir() instead of using MDTM for non-ASCII file names -- not a nice solution but it worked.

Uploading Amazon Inventory UTF 8 Encoding

I am trying to upload my english inventory to various european amazon sites. The issue I am having is that the accents found in certain different languages are not displaying correctly when an "inventory file" is uploaded to amazon. The inventory file is a tab delimited text file.
current setup:
$type = 'text/tab-separated-values; charset=utf-8';
header('Content-Type:'.$type);
header('Content-Disposition: attachment; filename="inventory-'.$_GET['cc'].'.txt');
header('Content-Length: ' . strlen($data));
header('Content-Encoding: UTF-8');
When the text file is outputted and saved it looks exactly how it should when opened in windows (all the characters are correct) but for some reason amazon doesn't see it as UTF8 and re-encodes it with all of the characters found here:
http://www.i18nqa.com/debug/utf8-debug.html
I have tried adding the BOM to the top of the file but this just results in amazon giving an error. Has anyone else experienced this?
As #fvu pointed out in his comment, Amazon is expecting the ISO-8859-1 format, not UTF-8. That's why you should use PHP's utf8_decode method when writing to your file.
Ok so after a lot of trying it turns out that the characters needed to be decoded. I opened the text files in excel and they seemed to encode themselves as weird characters like ü using php utf8_decode turned them back into the correct characters EVEN THOUGH the text file showed them as the right characters... very confusing.
To anyone out there having difficulties with UTF 8 try decoding first.
thanks for your help