I want to mirror a site using wget:
wget --mirror \
--convert-links \
--adjust-extension \
--page-requisites \
--no-parent \
--wait=2 \
--progress=bar \
--show-progress \
--output-file=$LOG_FILE \
--directory-prefix=$DIR_PATH \
$URL
Now, it has been working well but I have come accross a website where the main page from which I want to start is under https://www.website.org/unique_path/here.html but it contains references to files or links that are like: https://www2.website.org/unique_path/there.pdf. However, --no-parent prevents the download of the content under www2... URL. Is there a way to circumvent this? (Or some option that would explicitly work as --no-parent by specifying some wildcard expression that it is ok to go and download here and there?
Is there a way to circumvent this?
You are apparently looking for Spanning Hosts options, you must provide -H option and then you might deliver comma-separated list of acceptable domains via -D, using your example
wget <your current options here> -H -D www.website.org,www2.website.org <your URL here>
Related
I have microservice ecosystem and all users interacting with it need to authenticate to a keycloak installation and receive a jwt token.
All is fine, I enabled audience support using this snippet:
/opt/jboss/keycloak/bin/kcadm.sh \
create clients/d3170ee6-7778-413b-8f41-31479bdb2166/protocol-mappers/models -r your-realm \
-s name=audience-mapping \
-s protocol=openid-connect \
-s protocolMapper=oidc-audience-mapper \
-s config.\"included.client.audience\"="your-audience" \
-s config.\"access.token.claim\"="true" \
-s config.\"id.token.claim\"="false"
as described here: Add protocol-mapper to keycloak using kcadm.sh
Which is fine, it works. My problem is, how do I enable multiple values for audience? I mean, I would like to allow the same user to use 2 different services with the same token - each of them should have a different audience.
And the token should look like:
{
"aud": [
"audience-1",
"audience-2"
]
}
Where audience-1 is the audience expected by the first service and audience-2 is the one expected by the 2nd service.
Is it even possible to do that via command line?
I think I may have found the answer. Or at least it worked for me:
kcadm.sh create clients/CLIENT_ID/protocol-mappers/models -r REALM_NAME \
-s name=audience-mapping \
-s prodocol=openid-connect \
-s protocolMapper=oidc-audience-mapper \
-s config.\"included.client.audience\"="audience" \
-s config.\"access.token.claim\"=\"true\" \
-s config.\"id.token.claim\"=\"false\"
I'm trying to crawl a website which needs to be logged in with wget but it stops everytime it finds a logout url (https://example.com/logout/).
I've tried excluding the directories but without success.
This is my command:
wget --content-disposition --header "Cookie: session_cookies" -k -m -r -E -p --level=inf --retry-connrefused -D site.com -X */logout/*,*/settings/* -o log.txt https://example.com/
I've tried with -R option instead of -X but that didn't work.
Can be solved by the keyword "--reject-regex", like this: "--reject-regex logout", see:wget-devTips
I have custom provider created and deployed.
Now I goto user federation select the drop down and add my provider using UI and fine. Image using UI
Can some one please let me know how to add the same using CLI as I want to automate the manual process.
This worked for me:
kcadm.bat create user-federation/instances -r Test1 \
-s providerName=tatts-asg-authentication \
-s priority=0 \
-s config.debug=false
This is what works for Keycloak 3.4.3:
kcadm.bat create components -x -r MyRealm \
-s providerType=org.keycloak.storage.UserStorageProvider \
-s name=my-provider \
-s parentId=MyRealm \
-s providerId=my-provider \
-s 'config.path=["C:\\path\\to\\properties"]' \
-s 'config.priority=["0"]'
user-federation/instances has been replaced with components: issues.jboss.org/browse/KEYCLOAK-6583
The -x option is to output the stacktrace on error.
I'm running wget --recursive --no-parent --adjust-extension --convert-links --page-requisites --restrict-file-names=windows --keep-session-cookies --load-cookies cookies.txt http://DOMAIN/private/ and it correctly downloads the private/index.html file.
I inspected this file and it is the correct page shown only with successful authentication. It contains markup like:
<ul><li><a class="CP___PAGEID_56400" href="http://DOMAIN/private/page1.html">My private page</a></li>...
However, after fetching all the resources (images etc.) it seems to think it's finished and shuts down after 'converting links'.
If I skip --no-parent it keeps going. So is the --no-parent flag somehow confusing wget as to subpages?
Finally realized that wget is obeying robots.txt! I changed my command to wget -e robots=off --wait 0.25 --recursive --no-parent ... and got it working. I added the --wait 0.25 since I didn't want to clobber the server either.
I'm trying to scrape a website recursively, but I want to exclude some webpages under that domain, containing the string "unnecessary pages". The string is not present in the URL. Here's the original command to build from:
wget -r --no-parent http://www.website.com
For example; I want to scrape the wikipedia. But I want to exclude articles that contain the keyword "drugs".
Any ideas?
Thanks in advance!
One way to do this is with the following options. It will scrape a site beginning at any path you choose to start and will exclude directories you have specified in LIST:
$ wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains somesite.tld \
--no-parent \
--exclude-directories=LIST \
www.somesite.tld/path/to/start