Wget best practices on large files

Wget best practices on large files - wget

I have a text file (150GB) in size, with around 2.5 billion links I need to download.
I am wondering if I wget this file, if wget puts the entire files contents in memory, and if it will potentially crash the machine.
I am also interested to know on some best practices when downloading this amount of links, for example, will the directory it downloads to become unusable with this amount of files?
What about resuming, Does wget check from the top where the download stopped or failed, cause that could take a long time.
Does it keep a log? Will the log file be really big? Is the log file stored in memory before its saved somewhere? Can I disable the log.
Basically, I am interested in the best practice in downloading a file of this size. Each link I am downloading is around 250 bytes.
Many thanks

Related

Tool to download files (including files without direct link) from website?

I have been trying to find a solution to download files from URL such as: https://.com//. I learned about wget and tried quite many options, but realized it does not download any files that does not have direct link in index file or any sort.
For example, I'd like to download everything from https://somesites.com/myfolder/myfiles/.
Let's say there is an index.html under "myfiles" directory, and many html files and couple directories that are all referenced and linked in index but also couple of other html files such as sample123.html and sample456.html.
wget command successfully downloads all, but sample123.html and sample456.html with pretty much most of the common and well known options.
Is there any other tools that will grab ALL files that are located in https://somesites.com/myfolder/myfiles/ regardless with or without direct link?
I also tried lftp against the http URL, but download result was much fewer files that wget.
I looked through stack overflow for this, but recommended commands are the ones that only downloads files with direct link (by wget).

What you want to do is not possible and could be a security problem. Imagine that someone has, for example, a file with some sensitive data inside the folder and that file is not listed anywhere. You are asking for a tool that would also download that file.
So as said, is not possible, that's why it's always a good suggestion to disable directory listing in HTTP servers as a security option, to prevent exactly what you want to do.

Will deleting unnecessary files decrease the size of my Intel XDK app? If not, how can I go about this?

I am trying to decrease the size of my app as much as possible because my app will have a lot of moving parts. Will minimizing files and deleting files that are not used in the app decrease the size? When installing a framework, you normally get more files than you need.

When your app is delivered it is delivered as a ZIP file (APK == ZIP, IPA == ZIP, XAP == ZIP, etc.). So it is compressed. Of course, images and similar objects are already compressed, so there is no decrease in size of these objects.
Certainly deleting unused files will help. Does your project have a unique www directory in the project, where the sources are located? This will help, especially with future releases. See this FAQ if you need help converting your project to one that contains a www directory > https://software.intel.com/en-us/xdk/faqs/general#www-folder

iphone - where should I store image files if I don't care about its persistence?

I know normally I have two choices of places: Temporary folder or Cache folder.
But can anyone tell me the exact differences?
My app will download quite many images upon users' requests. Of course, no one need them on the disk of iphone permanently. But I still need to cache them in the case users will go back to see them in relatively short time period.
Temporary folder can be one place to go, as I understand it will be cleared by the system. But when will it be cleared?
For cache folder, will cache folder be cleared regularly as well? If the cache folder will not be cleared, and I write the images to Cache folder, that will occupy too much spaces in a longer term, and it is not good for the users of course.
So, can someone give me some hints and tell me the exact diff between these two folders?
Thanks

I would go with the Caches folder— look in NSPathUtilities.h for the relevant methods to get at that one. The Caches folder won't be backed up, but it won't necessarily be emptied either, and neither will the temp folder. /tmp would normally be cleared upon reboot (well, potentially), but on the iPhone that's not something that will happen commonly.
The best approach would be to put data into the Caches folder using some date-based scheme, so that you can clear its contents yourself when you deem it useful to do so. You could use the file's creation or modification date to inform this decision, and simply scan at each launch (or each enter-foreground event) to determine which items are old enough that they should be removed.

Quickly accessing files in a 'project'

I'm looking for a way to quickly open files in my project's source tree. What I've been doing so far is adding files to the file-name-cache like so:
(file-cache-add-directory-recursively (concat project-root "some/sub/folder") ".*\\.\\(py\\)$")
after which I can use anything-for-files to access any file in the source tree with about 4 keystrokes.
Unfortunately, this solution started falling over today. I've added another folder to the cache and emacs has started running out of memory. What's weird is that this folder contains less than 25% of files I'm adding, and yet emacs memory use goes up from 20mb to 400mb on adding just this folder. The total number of files is around 2000, so this memory use seems very high. Presumably I'm abusing the file cache.
Anyway, what do other people do for this? I like this solution for its simplicity and speed; I've looked at some of the many, many project management packages for emacs and none of them really grabbed me...
Thanks in advance!
Simon

Testing here give me no problem with some 50000 file (well, I had to say that I had to wait some time, but Emacs only use 48 mB when it finished), You seem to have been hit by some bug you should probably report.

I'd suggest you take a look at this article. I have to support Trey's comment - I don't think your approach is very good at the moment.

Practices to prevent/control image content

Dead code is easily recognised and eliminated by having code reviews, however, when it comes to images - unused images still get into our version control. Is there any clean way of organising graphic content so that a direct correlation exists between web pages and image files?
In our current project, we use create master PNG files then export the required layers for development purpose. Recently I figured out that there is some bloat in the images folder. Doing a search for image names in code helps but it is very painful when it needs to be done for hundred odd images. So asking the forum for suggestions

You could walk the website with a crawler (like wget) and remove any image not touched (i.e. not listed in your logs.)
A quicker way would be to just dump all the image file names found in your code.
grep -o -e \w*?\.png (caution: untested regex)

If you have a 1 step build and can test for dead links, then you should be able to write a script that would do a clean checkout of the project, delete a single image and build & test the project. If no errors pop up the image is unneeded.
It would take a long time to run (maybe days depending on the project size) but that's computer time, not man-hours.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Wget best practices on large files - wget

Related

Tool to download files (including files without direct link) from website?

Will deleting unnecessary files decrease the size of my Intel XDK app? If not, how can I go about this?

iphone - where should I store image files if I don't care about its persistence?

Quickly accessing files in a 'project'

Practices to prevent/control image content

Categories

Resources