How should I download specific file type from folder (and ONLY it's subfolders) using wget or httrack?

How should I download specific file type from folder (and ONLY it's subfolders) using wget or httrack? - wget

I'm trying to use HTTrack or Wget do download some .docx files from a website. I want to do this only for a folder and it's subfolders. Ex: www.examplewebsite.com/doc (this goes down 5 more levels)
How would be a good way to do this?

The previous proposed answer is ludicrous considering the "spider" option has ALWAYS specifically NOT DOWNLOADED, but instead followed.
Better late than never, but here is the command you seek to both mirror the desired file extension files locally, but then as a bonus pull down the target html and auto-adjust it so that if you open it locally and click the links, they will have been altered and adjusted accordingly to now point to the local drive.
wget -e robots=off -r -k -A docx,doc "https://<url>"
If this works for you, I would appreciate the answer points!

You can use --spider with -r (recursive option ) and have --accept to filter the files of your intrest
wget --spider -r --accept "*.docx" <url>

Usage
wget -r -np -A pdf,doc https://web.cs.ucla.edu/~harryxu/
Result
tree
└── web.cs.ucla.edu
├── ~harryxu
│   ├── papers
│   │   ├── chianina-pldi21.pdf
│   │   ├── dorylus-osdi21.pdf
│   │   ├── genc-pldi20.pdf
│   │   ├── jaaru-asplos21.pdf
│   │   ├── jportal-pldi21.pdf
│   │   ├── li-sigcomm20.pdf
│   │   ├── trimananda-fse20.pdf
│   │   ├── vigilia-sec18.pdf
│   │   ├── vora-asplos17.pdf
│   │   ├── wang-asplos17.pdf
│   │   ├── wang-osdi18.pdf
│   │   ├── wang-osdi20.pdf
│   │   ├── wang-pldi19.pdf
│   │   └── zuo-eurosys19.pdf

Related

Using GitHub Actions in a Single Repository with Multiple Projects

I am fairly competent in using GitHub actions to build a variety of languages, orchestrate deployments, and I've even done cross-repository actions using web-hooks, so I'd say that I'm pretty familiar with working with them.
I often find myself doing a lot of scratch projects to test out an API or making a demo, and these don't usually merit their own repositories, but I'd like to save them for posterity, rather than just making Gists out of them, Gists being largely impossible to search. I'd like to create a scratch repository, with folders per language, like:
.
└── scratch
├── go
│   ├── dancing
│   │   ├── LICENSE-APACHE
│   │   ├── LICENSE-MIT
│   │   ├── main.go
│   │   └── README.md
│   ├── gogettur
│   │   ├── LICENSE-APACHE
│   │   ├── LICENSE-MIT
│   │   ├── main.go
│   │   └── README.md
│   └── streeper
│   ├── LICENSE-APACHE
│   ├── LICENSE-MIT
│   ├── main.go
│   └── README.md
├── node
│   └── javawhat
│   ├── index.js
│   ├── LICENSE-APACHE
│   ├── LICENSE-MIT
│   └── README.md
└── rust
├── logvalanche
│   ├── Cargo.toml
│   ├── LICENSE-APACHE
│   ├── LICENSE-MIT
│   ├── README.md
│   └── src
├── streamini
│   ├── Cargo.toml
│   ├── LICENSE-APACHE
│   ├── LICENSE-MIT
│   ├── README.md
│   └── src
└── zcini
├── Cargo.toml
├── LICENSE-APACHE
├── LICENSE-MIT
├── README.md
└── src
I'd like to generalize GitHub actions per language, for Go, use go test ./... and go build, for Rust cargo test and cargo build, etc.
I know that what I could do is have a workflow for each created project, but this would be tedious, I'd end up copying and pasting most of the time, and every build would run on every change in the entire repository, and I don't want to be building node/javawhat if only rust/zcini has changed.
Therefore I have a few questions:
Is it possible to have a workflow only run when certain files have changed, rather than running everything every single time?
Is there a way to generalize my workflows so that every dir in rust/ uses the same generic workflow, or will I need one workflow per project in the repository?

Importing json resources inside .pex (Python Executable (format by Twitter))

I'm using a Twitter engineered build tool pants to manage many projects inside my monorepo. It outputs .pex files when I complete a build, this is a binary that packages the bare minimum dependencies I need for each project and makes them a "binary" (actually an archive that's decompressed at runtime), my issue is a utility that my code has used for a long time fails to detect some .json files(now that I'm using pants) I have stored under my environments library. all my other code seems to run fine. I'm pretty sure it has to do with my config, perhaps I'm not storing the resources properly so my code can find it, though when I use unzip my_app.pex the resources I desire are in the package and located in the proper location(dir). Here is the method my utility uses to load the json resources:
if test_env:
file_name = "test_env.json"
elif os.environ["ENVIRONMENT_TYPE"] == "PROD":
file_name = "prod_env.json"
else:
file_name = "dev_env.json"
try:
json_file = importlib.resources.read_text("my_apps.environments", file_name)
except FileNotFoundError:
logger.error(f"my_apps.environments->{file_name} was not found")
exit()
config = json.loads(json_file)
here is the the BUILD file I use for these resource currently:
python_library(
dependencies=[
":dev_env",
":prod_env",
":test_env"
]
)
resources(
name="dev_env",
sources=["dev_env.json"]
)
resources(
name="prod_env",
sources=["prod_env.json"]
)
resources(
name="test_env",
sources=["test_env.json"]
)
and here is the BUILD file for the utility that calls these resources of which the python code above is what you saw:
python_library(
name="environment_handler",
sources=["environment_handler.py"],
dependencies=[
"my_apps/environments:dev_env",
"my_apps/environments:prod_env",
"my_apps/environments:test_env"
]
)
I always get an FileNotFoundError exception and I'm confused because the files are available to the runtime, what's causing these files to not be accessible? and is there a different format I need to set up the JSON resources as?
Also for context here is the decompressed .pex file(actually just the source-code dir):
├── apps
│   ├── __init__.py
│   └── services
│   ├── charts
│   │   ├── crud
│   │   │   ├── __init__.py
│   │   │   └── patch.py
│   │   ├── __init__.py
│   │   └── main.py
│   └── __init__.py
├── environments
│   ├── dev_env.json
│   ├── prod_env.json
│   └── test_env.json
├── __init__.py
├── models
│   ├── charts
│   │   ├── base.py
│   │   └── __init__.py
│   └── __init__.py
└── utils
├── api_queries
│   ├── common
│   │   ├── connections.py
│   │   └── __init__.py
│   └── __init__.py
├── calculations
│   ├── common
│   │   ├── __init__.py
│   │   └── merged_user_management.py
│   └── __init__.py
├── environment_handler.py
├── __init__.py
├── json_response_toolset.py
└── security_toolset.py

I figured it out: I changed the way I access the files within the library and it works perfectly before and after the build to .pex format. I used:
import pkgutil
#json_file = importlib.resources.read_text("my_apps.environments", file_name)
json_file = pkgutil.get_data("my_apps.environments", file_name).decode("utf-8")

Install, with CPAN, Perl modules to specific directory when several appear in use [duplicate]

This question already has answers here:
How can I install a CPAN module into a local directory?
(5 answers)
Closed 4 years ago.
Running the following command returns several paths:
perl -e 'print join("\n",#INC,"")'
Every path has modules installed within. I would want to install modules, as root, into the following directory:
/usr/local/share/perl5
What commands would I run to find where cpan,as root, currently installs modules? How would I alter it if it is not the path shown above?

Here's how I have configured cpan to put all new modules in a specific directory:
o conf makepl_arg 'PREFIX=/usr/local/share/perl5 INSTALLMAN3DIR=/usr/local/share/perl5/man/man3'
o conf mbuild_arg '--install_base /usr/local/share/perl5'
o conf mbuild_install_arg '--install_base /usr/local/share/perl5'
o conf mbuildpl_arg '--install-base /usr/local/share/perl5'
[o conf commit]
The first line addresses modules that use ExtUtils::MakeMaker and the next three lines are for modules that use Module::Build.

You can also do this quite easily via App::cpm.
$ cpm install -L my-random-folder Open::This
DONE install Path-Tiny-0.108 (using prebuilt)
DONE install Try-Tiny-0.30 (using prebuilt)
DONE install Module-Build-0.4224 (using prebuilt)
DONE install Module-Runtime-0.016 (using prebuilt)
DONE install Open-This-0.000008 (using prebuilt)
5 distributions installed.
$ tree my-random-folder
my-random-folder
├── bin
│   ├── config_data
│   └── ot
└── lib
└── perl5
├── 5.26.1
│   └── darwin-2level
├── Module
│   ├── Build
│   │   ├── API.pod
│   │   ├── Authoring.pod
│   │   ├── Base.pm
│   │   ├── Bundling.pod
│   │   ├── Compat.pm
│   │   ├── Config.pm
│   │   ├── ConfigData.pm
│   │   ├── Cookbook.pm
│   │   ├── Dumper.pm
│   │   ├── Notes.pm
│   │   ├── PPMMaker.pm
│   │   ├── Platform
│   │   │   ├── Default.pm
│   │   │   ├── MacOS.pm
│   │   │   ├── Unix.pm
│   │   │   ├── VMS.pm
│   │   │   ├── VOS.pm
│   │   │   ├── Windows.pm
│   │   │   ├── aix.pm
│   │   │   ├── cygwin.pm
│   │   │   ├── darwin.pm
│   │   │   └── os2.pm
│   │   └── PodParser.pm
│   ├── Build.pm
│   └── Runtime.pm
├── Open
│   └── This.pm
├── Path
│   └── Tiny.pm
├── Try
│   └── Tiny.pm
└── darwin-2level
└── auto
├── Module
│   ├── Build
│   └── Runtime
├── Open
│   └── This
├── Path
│   └── Tiny
└── Try
└── Tiny

If I have a local rpm in my ansible-playbook can I do yum install in one step?

I have downloaded a rpm in my ansible-playbook:
(djangoenv)~/P/c/apache-installer ❯❯❯ tree .
.
├── defaults
│   └── main.yml
├── files
│   ├── apache2latest.tar
│   ├── httpd_final.conf
│   ├── httpd_temp.conf
│   └── sshpass-1.05-9.1.i686.rpm
├── handlers
│   └── main.yml
├── hosts
├── meta
│   └── main.yml
├── README.md
├── tasks
│   └── main.yml
├── templates
├── tests
│   ├── inventory
│   └── test.yml
└── vars
└── main.yml
My question is why can't I just install it using:
- yum: name=files/sshpass-1.05-9.1.i686.rpm
? It complains that files/sshpass-1.05-9.1.i686.rpm is not found in the system. Now I am doing it in two steps:
- copy: src=files/sshpass-1.05-9.1.i686.rpm dest=/tmp/sshpass-1.05-9.1.i686.rpm force=no
- yum: name=/tmp/sshpass-1.05-9.1.i686.rpm state=present

No, there is no simple way around coping the package to the remote host before installing. Ansible yum module expects a local file when you define a file in the name parameter.
IMHO it is not a good idea to keep packages inside the Ansible code base. Because they are binary and not exactly part of the actual Ansible code. It would be cleaner to setup a private repository and store those files there. That is the only way around coping a package in this situation I'm aware of.

Shell-script removing subdirectories in a for-loop

The small script below is intended to loop over the subdirectories of the supplied path, perform some operations on files in those subdirectories, and then delete everything under those subdirectories leaving the subdirectories intact but empty.
#!/bin/sh
basedir=$1
for subdir in $(find "$basedir" -mindepth 1 -maxdepth 1 -type d); do
dir=`basename $subdir`
# Some other code here
# ...
echo "Removing subdirectories $subdir/*"
rm -rfv "$dir/*"
done
For example the below directory tree:
.
├── test1
│   ├── db_1422323507_1421673171_272
│   │   └── rawdata
│   ├── db_1423828548_1423645476_289
│   │   └── rawdata
│   ├── db_1423837057_1423828554_290
│   │   └── rawdata
│   ├── db_1423838029_1423837138_291
│   │   └── rawdata
│   └── db_1424102912_1423838103_292
│   └── rawdata
├── test2
├── test3
│   ├── db_1430478916_1429109291_82
│   │   └── rawdata
│   ├── db_1430517825_1430478932_83
│   │   └── rawdata
│   ├── db_1430518751_1430518207_84
│   │   └── rawdata
│   └── db_1430920306_1430913191_86
│   └── rawdata
├── test4
│   └── db_1436338354_1430920324_100
│   └── rawdata
└── test5
After running the script ./myscript.sh . I would expect to see the following directory tree.
.
├── test1
├── test2
├── test3
├── test4
└── test5
But the script doesn't appear to delete anything, the other code works as expected on the files/folders underneath the subdirectories.

You shouldn't iterate over the output of find with a for loop here, but find is not necessary, either. The trailing / in the pattern will prevent subdir from being set to non-directory filesystem entries. Also, don't quote the * in the argument to rm.
#!/bin/sh
basedir=$1
for subdir in "$basedir"/*/; do
dir=`basename $subdir`
# Some other code here
# ...
echo "Removing subdirectories $subdir/*"
rm -rfv "$dir/"*
done

How about:
rm -rfv "$subdir/*"
instead of referencing $dir. I expect this will make a difference if $basedir is a path with directories in it, e.g. something like a/b/c.
Also, try omitting the -f option from the rm command and it might print out some error messages which will help you debug the problem.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How should I download specific file type from folder (and ONLY it's subfolders) using wget or httrack? - wget

I'm trying to use HTTrack or Wget do download some .docx files from a website. I want to do this only for a folder and it's subfolders. Ex: www.examplewebsite.com/doc (this goes down 5 more levels) How would be a good way to do this?

You can use --spider with -r (recursive option ) and have --accept to filter the files of your intrest wget --spider -r --accept "*.docx" <url>

Related

Using GitHub Actions in a Single Repository with Multiple Projects

Importing json resources inside .pex (Python Executable (format by Twitter))

Install, with CPAN, Perl modules to specific directory when several appear in use [duplicate]

If I have a local rpm in my ansible-playbook can I do yum install in one step?

Shell-script removing subdirectories in a for-loop

Categories

Resources