I'm trying to scrape a website recursively, but I want to exclude some webpages under that domain, containing the string "unnecessary pages". The string is not present in the URL. Here's the original command to build from:
wget -r --no-parent http://www.website.com
For example; I want to scrape the wikipedia. But I want to exclude articles that contain the keyword "drugs".
Any ideas?
Thanks in advance!
One way to do this is with the following options. It will scrape a site beginning at any path you choose to start and will exclude directories you have specified in LIST:
$ wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains somesite.tld \
--no-parent \
--exclude-directories=LIST \
www.somesite.tld/path/to/start
Related
I want to mirror a site using wget:
wget --mirror \
--convert-links \
--adjust-extension \
--page-requisites \
--no-parent \
--wait=2 \
--progress=bar \
--show-progress \
--output-file=$LOG_FILE \
--directory-prefix=$DIR_PATH \
$URL
Now, it has been working well but I have come accross a website where the main page from which I want to start is under https://www.website.org/unique_path/here.html but it contains references to files or links that are like: https://www2.website.org/unique_path/there.pdf. However, --no-parent prevents the download of the content under www2... URL. Is there a way to circumvent this? (Or some option that would explicitly work as --no-parent by specifying some wildcard expression that it is ok to go and download here and there?
Is there a way to circumvent this?
You are apparently looking for Spanning Hosts options, you must provide -H option and then you might deliver comma-separated list of acceptable domains via -D, using your example
wget <your current options here> -H -D www.website.org,www2.website.org <your URL here>
Is it possible to build all of the packages for a specific image? I know I can build packages individually, but ideally would like to build all of them at once, through a single command.
Alternatively, is there a way to prevent the do_rootfs task from being executed for a particular image.
Cheers, Donal
First make an image that contains a packagegroup (or just list your dependencies there).
$ cat sources/meta-custom/recipes-custom/images/only-packages-image.bb
SUMMARY = "All dependencies no image"
LICENSE = "CLOSED"
version = "##DISTRO_VERSION##"
BB_SCHEDULER = "speed"
# option 1 - packagegroup, package list can be reused in real image
CORE_IMAGE_BASE_INSTALL += "\
packagegroup_all-depends \
"
# option 2 - list deps here, package list can not be reused in real image
CORE_IMAGE_BASE_INSTALL += "\
lshw \
systemd \
cronie \
glibc \
sqlite \
bash \
python3-dev \
python3-2to3 \
python3-misc \
python3-pyvenv \
python3-modules \
python3-pip \
wget \
apt \
pciutils \
file \
tree \
\
wpa-supplicant \
dhcpcd \
networkmanager \
curl-dev \
curl \
hostapd \
iw \
"
# remove the rootfs step
do_rootfs() {
}
Second make your packagegroup if you opted to reuse the list of packages
$ cat sources/meta-custom/recipes-custom/packagegroups/packagegroup-alldeps.bb
PACKAGE_ARCH = "${MACHINE_ARCH}"
inherit packagegroup
RDEPENDS_${PN} = " \
lshw \
systemd \
cronie \
glibc \
sqlite \
bash \
python3-dev \
python3-2to3 \
python3-misc \
python3-pyvenv \
python3-modules \
python3-pip \
wget \
apt \
pciutils \
file \
tree \
\
wpa-supplicant \
dhcpcd \
networkmanager \
curl-dev \
curl \
hostapd \
iw \
"
Finally build your new image placeholder
$ bitbake only-packages-image
In Yocto >=4.0 this is actually pretty easy to achieve. The packagegroup method did not work for me at all.
I don't know if this works in older versions though.
Create a new file in your custom layer, e.g. meta-custom/classes/norootfs.bbclass and put the following lines in there (as far as I noticed the order does not matter):
deltask do_deploy
deltask do_image
deltask do_rootfs
deltask do_image_complete
deltask do_image_setscene
then in your meta-custom/recipes-core/images/myimage.bb add norootfs to your other inherit commands
e.g. the most basic one
inherit core-image norootfs
You will notice your number of tasks decreasing by a fair amount (mine from ~4700 to ~3000) and there is no complete rootfs image anymore in build/tmp/deploy/images, except for bzImage and modules, just the plain ipk files in build/tmp/deploy/ipk.
I got this information by looking at https://docs.yoctoproject.org/ref-manual/tasks.html?highlight=do_image and .bbclass files in meta/classes where deltask is frequently used.
I'm running wget --recursive --no-parent --adjust-extension --convert-links --page-requisites --restrict-file-names=windows --keep-session-cookies --load-cookies cookies.txt http://DOMAIN/private/ and it correctly downloads the private/index.html file.
I inspected this file and it is the correct page shown only with successful authentication. It contains markup like:
<ul><li><a class="CP___PAGEID_56400" href="http://DOMAIN/private/page1.html">My private page</a></li>...
However, after fetching all the resources (images etc.) it seems to think it's finished and shuts down after 'converting links'.
If I skip --no-parent it keeps going. So is the --no-parent flag somehow confusing wget as to subpages?
Finally realized that wget is obeying robots.txt! I changed my command to wget -e robots=off --wait 0.25 --recursive --no-parent ... and got it working. I added the --wait 0.25 since I didn't want to clobber the server either.
I've been using wget spider to collect all of a website's links, but it will not return the paths to scripts. Is there any way to do this? My current wget request:
wget --spider --recursive --no-verbose --output-file=wget.txt https://www.example.com
Add the -p or --page-requisites option. That will download all the extra assets.
I try to save images from a site using wget. I have --page-requisites in the command line but it doesn't save the images. For the rest everything goes so fine, it even saves the extension.
wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
http://leveldesigninspirationmachine.tumblr.com/
Why doesn't it get the images?