scripting with sed and wget in Windows - sed

I have some issue with Internet connectivity in a LAN. Some users are happy and some complain about the Internet speed. So I came with an idea to install software on three different PCs and try to download/upload a file at the same time and record the speed. Then I will able to create a graph with the data that I acquired.
I am looking for a way to download several files and check the speed. I found How to grep download speed from wget output? for wget and sed. How do I use wget -O /dev/null http://example.com/index.html 2>&1 | sed -e 's|^.*(\([0-9.]\+ [KM]B/s\)).*$|\1|' for Windows? I already installed wget and sed on Windows.
All PCs running Windows XP or 7.

Sed isn't different on Windows. The only difference is, that /dev/null doesn't exist on Windows, but NUL.
So:
wget -O NUL http://example.com/index.html 2>&1 | sed -e 's|^.*(\([0-9.]\+ [KM]B/s\)).*$|\1|'
should work on Windows. I'm not 100% sure about 2>&1 - maybe there is some other syntax to use.

Related

Perl using the -i option on a vboxsf share: Can't remove input_file Text file busy, skipping file

System: Arch Linux in VirtualBox 5.1.26 on Windows 10 Host
I try to use perl like sed in the terminal for in place substitution the input file:
perl -i -p -e 's/orig/replace/g' input_file
But I always get:
Can't remove input_file Text file busy, skipping file
This happens only if the file is inside a VirtualBox vboxsf share. With all other tools (sed, mv, vim or whatever) it is no problem to change the file.
This problem seems to be related to:
https://www.virtualbox.org/ticket/2553
https://forums.virtualbox.org/viewtopic.php?t=4437
I can't find any solution googling around :(
Update:
Using perl -i.bak -p -e 's/orig/replace/g' input_file I get a similar message:
Can't rename input_file to input_file.bak: Text file busy, skipping file.
This is exactly the same message as gedit shows:
So it is the same behavior, but googling around I can only find the Gedit topic. It seems noone has noticed this with perl -i.
While you are running a unix OS, you are still using a Windows file system. NTFS doesn't support anonymous files like unix file systems, and Perl -i requires support for anonymous files.
The workaround is to use a temporary files by using -i<ext> (e.g. -i~) instead of -i.
I have same problem. My solution is a bashscript. Copy files to tmp. Search and Replace. Overwrite tmp-files with original-files. Than delete tmp-dir. If you need you can use parameter in script for dynamic search&replace and create an alias for call the script direct and everywhere.
#!/bin/bash
echo "Removing text from .log files..."
echo "Creating tmp-dir..."
mkdir /tmp/myTmpFiles/
echo "Copy .log files to tmp..."
cp -v /home/user/sharedfolder/*.log /tmp/myTmpFiles/
echo "Search and Replace in tmp-files..."
perl -i -p0e 's/orig/replace/g' /tmp/myTmpFiles/*.log
echo "Copy .log to sharedfolder"
cp -v /tmp/myTmpFiles/*.log /home/user/sharedfolder/
echo "Remove tmp-dir..."
rm -vr /tmp/myTmpFiles/
echo "Done..."

Batch rename with command line

I have some files: file1.txt, file2.txt and I would like to rename them like this: file1.something.txt and file2.something.txt
I looked for some similar questions and I come up with this:
for i in file*.txt; do echo mv $i file*.something.txt; done
but unfortunately the output is:
mv file1.txt file*.something.txt
mv file2.txt file*.something.txt
and therefore only 1 file is created.
Could please somebody help?
(I am using a macbook air, I am not sure if this is relevant)
Thank you very much
Try this :
rename -n 's/\.txt/something.txt' *
(remove -n switch when your tests are OK)
There are other tools with the same name which may or may not be able to do this, so be careful.
If you run the following command (GNU)
$ file "$(readlink -f "$(type -p rename)")"
and you have a result like
.../rename: Perl script, ASCII text executable
and not containing:
ELF
then this seems to be the right tool =)
If not, to make it the default (usually already the case) on Debian and derivative like Ubuntu :
$ sudo update-alternatives --set rename /path/to/rename
(replace /path/to/rename to the path of your perl's rename command.
If you don't have this command, search your package manager to install it or do it manually
Last but not least, this tool was originally written by Larry Wall, the Perl's dad.

Using wget (for windows) to download all MIDI files

I've been trying to use wget to download all midi files from a website (http://cyberhymnal.org/) using:
wget64 -r -l1 H -t1 -nd -N -np -A.mid -erobots=off http://cyberhymnal.org/
I got the syntax from various sites which all suggest the same thing, but it doesn't download anything. I've tried various variations on the theme, such as different values for '-l' etc.
Does anybody have any suggestions as to what I am doing wrong? Is it the fact that I am using Windows?
Thanks in advance.
I don't know much about all the parameters you are using like H, -t1, -N etc though we can find it online. But I also had to download files from a url matching a wildcard. So command that worked for me:
wget -r -l1 -nH --cut-dirs=100 -np "$url" -P "${newLocalLib/$tokenFind}" -A "com.iontrading.arcreporting.*.jar"
after -P you specify the path where you wanna save the files to and after -A you provide the wild card token. Like in your case that would be "*.mid".
-A means Accept. So here we provide the files to accept from the provided URL. Similarly -R for reject list.
You may have better luck (at least, you'll get more MIDI files), if you try the actual Cyber Hymnal™, which moved over 10 years ago. The current URL is now http://www.hymntime.com/tch/.

Executing perl script inside bash script

I inherited a long bash script that I recently needed to modify. The bash script is run as a cronjob on a daily basis. I am decent with bash scripting, but I do not know much about Perl.
I had to substitute all "rm" commands with a call to a perl script that does something similar (for security purposes). This script was not written by me, so there is no -f flag to skip the confirmation prompt. Therefore, to automate this script I pipe "yes" to the script.
Here is an example where I am sequentially deleting two directories:
echo REMOVING FILES TO SAVE DISK SPACE
echo "yes | sudo nice -n -10 perl <path_to_delete_script.pl> -dir <del_dir1>"
yes | sudo nice -n -10 perl <path_to_delete_script.pl> -dir <del_dir1>
echo "yes | sudo nice -n -10 perl <path_to_delete_script.pl> -dir <del_dir2>"
yes | sudo nice -n -10 perl <path_to_delete_script.pl> -dir <del_dir2>
echo DONE.
In my output file, I see the following:
REMOVING FILES TO SAVE DISK SPACE
yes | sudo nice -n -10 perl <path_to_delete_script.pl> -dir <del_dir1>
yes | sudo nice -n -10 perl <path_to_delete_script.pl> -dir <del_dir2>
DONE.
It does not appear that the perl script has run. Yet when I copy and paste those two commands into the terminal, they both run fine.
Any help is appreciated. Thank you in advance.
You simply put do
yes | ./myscript.pl
Thanks for all the comments. I ended up changing the group and permissions of the tool and all output files. This allowed me to run the perl script without using "sudo," which others pointed out is bad practice.

Multiple simultaneous downloads using Wget?

I'm using wget to download website content, but wget downloads the files one by one.
How can I make wget download using 4 simultaneous connections?
Use the aria2:
aria2c -x 16 [url]
# |
# |
# |
# ----> the number of connections
http://aria2.sourceforge.net
Wget does not support multiple socket connections in order to speed up download of files.
I think we can do a bit better than gmarian answer.
The correct way is to use aria2.
aria2c -x 16 -s 16 [url]
# | |
# | |
# | |
# ---------> the number of connections here
Official documentation:
-x, --max-connection-per-server=NUM: The maximum number of connections to one server for each download. Possible Values: 1-16 Default: 1
-s, --split=N: Download a file using N connections. If more than N URIs are given, first N URIs are used and remaining URLs are used for backup. If less than N URIs are given, those URLs are used more than once so that N connections total are made simultaneously. The number of connections to the same host is restricted by the --max-connection-per-server option. See also the --min-split-size option. Possible Values: 1-* Default: 5
Since GNU parallel was not mentioned yet, let me give another way:
cat url.list | parallel -j 8 wget -O {#}.html {}
I found (probably)
a solution
In the process of downloading a few thousand log files from one server
to the next I suddenly had the need to do some serious multithreaded
downloading in BSD, preferably with Wget as that was the simplest way
I could think of handling this. A little looking around led me to
this little nugget:
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url]
Just repeat the wget -r -np -N [url] for as many threads as you need...
Now given this isn’t pretty and there are surely better ways to do
this but if you want something quick and dirty it should do the trick...
Note: the option -N makes wget download only "newer" files, which means it won't overwrite or re-download files unless their timestamp changes on the server.
Another program that can do this is axel.
axel -n <NUMBER_OF_CONNECTIONS> URL
For baisic HTTP Auth,
axel -n <NUMBER_OF_CONNECTIONS> "user:password#https://domain.tld/path/file.ext"
Ubuntu man page.
A new (but yet not released) tool is Mget.
It has already many options known from Wget and comes with a library that allows you to easily embed (recursive) downloading into your own application.
To answer your question:
mget --num-threads=4 [url]
UPDATE
Mget is now developed as Wget2 with many bugs fixed and more features (e.g. HTTP/2 support).
--num-threads is now --max-threads.
I strongly suggest to use httrack.
ex: httrack -v -w http://example.com/
It will do a mirror with 8 simultaneous connections as default. Httrack has a tons of options where to play. Have a look.
As other posters have mentioned, I'd suggest you have a look at aria2. From the Ubuntu man page for version 1.16.1:
aria2 is a utility for downloading files. The supported protocols are HTTP(S), FTP, BitTorrent, and Metalink. aria2 can download a file from multiple sources/protocols and tries to utilize your maximum download bandwidth. It supports downloading a file from HTTP(S)/FTP and BitTorrent at the same time, while the data downloaded from HTTP(S)/FTP is uploaded to the BitTorrent swarm. Using Metalink's chunk checksums, aria2 automatically validates chunks of data while downloading a file like BitTorrent.
You can use the -x flag to specify the maximum number of connections per server (default: 1):
aria2c -x 16 [url]
If the same file is available from multiple locations, you can choose to download from all of them. Use the -j flag to specify the maximum number of parallel downloads for every static URI (default: 5).
aria2c -j 5 [url] [url2]
Have a look at http://aria2.sourceforge.net/ for more information. For usage information, the man page is really descriptive and has a section on the bottom with usage examples. An online version can be found at http://aria2.sourceforge.net/manual/en/html/README.html.
wget cant download in multiple connections, instead you can try to user other program like aria2.
use
aria2c -x 10 -i websites.txt >/dev/null 2>/dev/null &
in websites.txt put 1 url per line, example:
https://www.example.com/1.mp4
https://www.example.com/2.mp4
https://www.example.com/3.mp4
https://www.example.com/4.mp4
https://www.example.com/5.mp4
try pcurl
http://sourceforge.net/projects/pcurl/
uses curl instead of wget, downloads in 10 segments in parallel.
They always say it depends but when it comes to mirroring a website The best exists httrack. It is super fast and easy to work. The only downside is it's so called support forum but you can find your way using official documentation. It has both GUI and CLI interface and it Supports cookies just read the docs This is the best.(Be cureful with this tool you can download the whole web on your harddrive)
httrack -c8 [url]
By default maximum number of simultaneous connections limited to 8 to avoid server overload
use xargs to make wget working in multiple file in parallel
#!/bin/bash
mywget()
{
wget "$1"
}
export -f mywget
# run wget in parallel using 8 thread/connection
xargs -P 8 -n 1 -I {} bash -c "mywget '{}'" < list_urls.txt
Aria2 options, The right way working with file smaller than 20mb
aria2c -k 2M -x 10 -s 10 [url]
-k 2M split file into 2mb chunk
-k or --min-split-size has default value of 20mb, if you not set this option and file under 20mb it will only run in single connection no matter what value of -x or -s
You can use xargs
-P is the number of processes, for example, if set -P 4, four links will be downloaded at the same time, if set to -P 0, xargs will launch as many processes as possible and all of the links will be downloaded.
cat links.txt | xargs -P 4 -I{} wget {}
I'm using gnu parallel
cat listoflinks.txt | parallel --bar -j ${MAX_PARALLEL:-$(nproc)} wget -nv {}
cat will pipe a list of line separated URLs to parallel
--bar flag will show parallel execution progress bar
MAX_PARALLEL env var is for maximum no of parallel download, use it carefully, default here is current no of CPUs
tip: use --dry-run to see what will happen if you execute command.
cat listoflinks.txt | parallel --dry-run --bar -j ${MAX_PARALLEL} wget -nv {}
make can be parallelised easily (e.g., make -j 4). For example, here's a simple Makefile I'm using to download files in parallel using wget:
BASE=http://www.somewhere.com/path/to
FILES=$(shell awk '{printf "%s.ext\n", $$1}' filelist.txt)
LOG=download.log
all: $(FILES)
echo $(FILES)
%.ext:
wget -N -a $(LOG) $(BASE)/$#
.PHONY: all
default: all
Consider using Regular Expressions or FTP Globbing. By that you could start wget multiple times with different groups of filename starting characters depending on their frequency of occurrence.
This is for example how I sync a folder between two NAS:
wget --recursive --level 0 --no-host-directories --cut-dirs=2 --no-verbose --timestamping --backups=0 --bind-address=10.0.0.10 --user=<ftp_user> --password=<ftp_password> "ftp://10.0.0.100/foo/bar/[0-9a-hA-H]*" --directory-prefix=/volume1/foo &
wget --recursive --level 0 --no-host-directories --cut-dirs=2 --no-verbose --timestamping --backups=0 --bind-address=10.0.0.11 --user=<ftp_user> --password=<ftp_password> "ftp://10.0.0.100/foo/bar/[!0-9a-hA-H]*" --directory-prefix=/volume1/foo &
The first wget syncs all files/folders starting with 0, 1, 2... F, G, H and the second thread syncs everything else.
This was the easiest way to sync between a NAS with one 10G ethernet port (10.0.0.100) and a NAS with two 1G ethernet ports (10.0.0.10 and 10.0.0.11). I bound the two wget threads through --bind-address to the different ethernet ports and called them parallel by putting & at the end of each line. By that I was able to copy huge files with 2x 100 MB/s = 200 MB/s in total.
Call Wget for each link and set it to run in background.
I tried this Python code
with open('links.txt', 'r')as f1: # Opens links.txt file with read mode
list_1 = f1.read().splitlines() # Get every line in links.txt
for i in list_1: # Iteration over each link
!wget "$i" -bq # Call wget with background mode
Parameters :
b - Run in Background
q - Quiet mode (No Output)
If you are doing recursive downloads, where you don't know all of the URLs yet, wget is perfect.
If you already have a list of each URL you want to download, then skip down to cURL below.
Multiple Simultaneous Downloads Using Wget Recursively (unknown list of URLs)
# Multiple simultaneous donwloads
URL=ftp://ftp.example.com
for i in {1..10}; do
wget --no-clobber --recursive "${URL}" &
done
The above loop will start 10 wget's, each recursively downloading from the same website, however they will not overlap or download the same file twice.
Using --no-clobber prevents each of the 10 wget processes from downloading the same file twice (including full relative URL path).
& forks each wget to the background, allowing you to run multiple simultaneous downloads from the same website using wget.
Multiple Simultaneous Downloads Using curl from a list of URLs
If you already have a list of URLs you want to download, curl -Z is parallelised curl, with a default of 50 downloads running at once.
However, for curl, the list has to be in this format:
url = https://example.com/1.html
-O
url = https://example.com/2.html
-O
So if you already have a list of URLs to download, simply format the list, and then run cURL
cat url_list.txt
#https://example.com/1.html
#https://example.com/2.html
touch url_list_formatted.txt
while read -r URL; do
echo "url = ${URL}" >> url_list_formatted.txt
echo "-O" >> url_list_formatted.txt
done < url_list.txt
Download in parallel using curl from list of URLs:
curl -Z --parallel-max 100 -K url_list_formatted.txt
For example,
$ curl -Z --parallel-max 100 -K url_list_formatted.txt
DL% UL% Dled Uled Xfers Live Qd Total Current Left Speed
100 -- 2512 0 2 0 0 0:00:01 0:00:01 --:--:-- 1973
$ ls
1.html 2.html url_list_formatted.txt url_list.txt