Which files to download for all wikipedia images - dump

I want to download all the Chinese Wikipedia data (text + images), I downloaded the articles but I got confused with these media files, and also the remote-media files are ridiculously huge, what are they? do I have to download them?
From: http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121104/
zhwiki-20121104-local-media-1.tar 4.1G
zhwiki-20121104-remote-media-1.tar 69.9G
zhwiki-20121104-remote-media-2.tar 71.1G
zhwiki-20121104-remote-media-3.tar 69.3G
zhwiki-20121104-remote-media-4.tar 48.9G
Thanks!

I'd assume that they are the media files included from Wikimedia Commons, which are most of the images in the articles. From https://wikitech.wikimedia.org/wiki/Dumps/media:
For each wiki, we dump the image, imagelinks and redirects tables via /backups/imageinfo/wmfgetremoteimages.py. Files are written to /data/xmldatadumps/public/other/imageinfo/ on dataset2.
From the above we then generate the list of all remotely stored (i.e. on commons) media per wiki, using different args to the same script.
And it's not that huge for all files from the Chinese Wikipedia :-)

Related

Can I upload already-encoded content to azure media services, instead of uploading a video and then encoding it? How?

I want to encode locally and upload to avoid spending money in encoding.
Is this allowed? I did not find any documentation on it.
When I simply uploaded a file to the storage, the media services account said it could not play it without the ISM file. I had to encode (was it re-encode? it was an mp4) the file I had uploaded - I want to avoid that.
Yes, absolutely allowed and encouraged for customers. Especially ones that have custom encoding requirements that we may not support.
You can upload an .ism file that you create along with your encoded files. It's a simple SMIL 2.0 format XML file that points to the source files used.
It's a bit hard to find searching the docs, but there is a section outlining the workflow here - https://learn.microsoft.com/en-us/azure/media-services/latest/encode-dynamic-packaging-concept#on-demand-streaming-workflow
There is also a .NET Sample showing how to do it here:
https://github.com/Azure-Samples/media-services-v3-dotnet/tree/main/Streaming/StreamExistingMp4
You can see the code line at 111 that shows how to generate the .ism file -
// Generate the Server manifest for streaming .ism file.
// This file is a simple SMIL 2.0 file format schema that includes references to the uploaded MP4 files in the XML.
var manifestsList = await AssetUtils.CreateServerManifestsAsync(client, config.ResourceGroup, config.AccountName, inputAsset, locator);

Providing EXIF-free images in a gallery or other webpage

First, thanks for any and all help regarding this topic.
Sites like Facebook and Twitter strip EXIF information from images as they are uploading. My goal is to allow users to upload images to our platform (working with Nextcloud and others) with full EXIF information, however, we need to display images that do not contain EXIF information or any metadata. Without stripping and creating a second, Exif-Free image for each, is it possible to simply hide that EXIF info so that, if a user downloads that image, the EXIF is not embedded?
We were told that the only way to do this is to have a second, exif-free copy (the order of when that's created is irrelevant pre/during/post upload). I'm hoping there's a way that we can simply display such a copy without doubling our physical space requirements.
Thanks again for your help.
Exif is metadata, along with IPTC, XMP, AFCP, ICC, FPXR, MPF, JPS and a comment, just for the JFIF/JPEG file format alone. Other picture file formats support even more/other metadata.
You wrote it yourself: a download - so it's a file in any case. Pictures are files, just like executables, movies, texts, music and archives are files, too. And metadata is part of its content, so whoever accesses the raw bytes of the file can grab everything in it. Which is not "please don't look" proof. If you
create that on the fly by stripping metadata everytime a download is requested,
or if you do it once to preserve performance and instead occupy space remains your decision.
If there would be something as simple as a "don't show" feature then it would still be in the file and could be extracted easily by software written to ignore that instruction. Seriously, there's no shortcut to that - do it properly and don't spare yourself from getting work done at the wrong end.

How do I get my site images to properly load within LowCodeUnit?

When using LowCodeUnit, after 'Unpacking Latest' and syncing with GitHub and going to my lowcodeunit.com site, some of the images on my site are not showing up as desired.
Currently LowCodeUnit only supports file paths (and file names) without spaces. Also, entire file paths are case-sensitive (including file extensions). If needed, there is a great free MS tool for batch renaming of files on Windows called PowerRename that you can download here:
https://github.com/microsoft/PowerToys/releases/

Embed large or non-standard files in github issues

We use Github's issue tracking for a lot of project management, both code-related and not. For simple files like images, we can share them with Github's CDN via drag and drop into an issue or comment. However, this has limitations:
Github imposes a file type restriction: they will only allow GIF, JPEG, JPG, PNG, DOCX, GZ, LOG, PDF, PPTX, TXT, XLSX or ZIP.
Files larger than 25 MB or images larger than 10 MB are not supported.
While URLs are anonymized with Camo (https://docs.github.com/en/free-pro-team#latest/github/authenticating-to-github/about-anonymized-image-urls), files are not actually securely stored or password protected. This is really problematic when the files shared have a lot of sensitive data in them.
Is there a plugin or simple solution that would let us securely attach large or non-standard file types, while maintaining the nice UI of github issues? We'd be ok using a 3rd-party storage system (like Drive/Dropbox/Sharepoint/AWS), but forcing users to upload something then copy/paste a link over into the issue isn't ideal.
There's no way for you to embed other file types in an issue without using a standard Markdown link that gets rendered through Camo. That's because GitHub has a strict Content-Security-Policy that prevents files from other domains from even loading. That's intentional, since it prevents people from trying to embed tracking content or content that changes based on the user (e.g., ads).
Even if you could use some way to embed the files in the page, your browser probably wouldn't render them due to the Content-Security-Policy.

APIs/Services to Generate Thumbnails of Document

We have a website, which allows users to upload documents (word, pdf, PPT, etc.).
We are uploading files to Amazon S3. So, all files will have it's own web URL.
For these uploaded documents, we would like to generate thumbnails. This thumbnail needs to be generated based on it's content (like Google document viewer).
Is there any Service/API, which generates thumbnails of documents by it's URL?
Thanks and Regards,
Ashish Shukla
You could roll your own solution. I'm evaluating 2JPEG and it appears to support 275 formats including Word, Excel, Publisher & Powerpoint files. fCoder recommends running 2JPEG as a scheduled background task. The command line syntax is pretty comprehensive. I don't think it has the ability to process remote AWS files, but you could retain it locally temporarily, generate the thumbnail and then delete the local source file.
Here's a sample snippet to generate a thumbnail for a specific file:
2jpeg.exe -src "c:\files\myfile.docx" -dst "c:\files" -oper Resize size:"100 200" fmode:fit_width -options pages:"1" scansf:no overwrite:yes template:"{Title}_thumb.jpg" silent:yes
You should also take a look at AWS Lambda. In fact, this presentation from the AWS re:Invent 2014 conference shows a live example of using Lambda to generate thumbnail images. This solution will be very reliable, and very cost-effective, but has the downside that you'll be responsible for maintaining the code, or debugging issues.