In AEM content hierarchy, I have a folder in which I have 4000 pages. And out of these pages, 3700 pages are following the naming convention of xyz-1, xyz-2, xyz-3..uptill xyz-3700 like this. I have a requirement to delete these pages starting with xyz but not the other 300 pages which have different names. I have tried below command with *, but this doesn’t work. Can anybody help me here to get this resolved?
curl -F":operation=delete" -F":applyTo=/content/foo/bar/xyz*" http://localhost:4502 -u admin:admin
Curl Needs a full path to execute the statement. You can pass individual paths like below, but it will not solve your issue as you have a lot of pages
curl -u admin:admin -F":operation=delete" -F":applyTo=/content/aem-flash/en/xyz-1" -F":applyTo=/content/aem-flash/en/xyz-2" http://localhost:4502
You have to write a script to delete all of them. There are multiple options, you can either write a standalone code and deploy as a bundle or connect from Eclipse
If you do not want to deploy a bundle, you can use groovy script to execute your code.
Below Groovy script should work for your requirement if all the pages are in the same parent node. If you would like to query the whole site, please update the query accordingly
import javax.jcr.Node
def buildQuery(page) {
def queryManager = session.workspace.queryManager;
def statement = 'select * from cq:Page where jcr:path like \''+page.path+'/xyz-%\'';
queryManager.createQuery(statement, 'sql');
}
final def page = getPage('/content/aem-flash/en/')
final def query = buildQuery(page);
final def result = query.execute()
result.nodes.each { node ->
println 'node path is: ' + node.path
node.remove();
session.save();
}
Solution with CURL
I found another way to use CURL command alone but it sends multiple requests to the server and might not be good.
The below command makes 4000 requests even the record is not present in the environment, it's just like a loop.
As regex is not supported by default in Windows, this will not work in a command prompt. It should work fine in a linux environment. If you want this to be executed in Windows, you can use Git bash console or other tools. I have tried in Git Bash and it worked fine.
curl -X DELETE "http://localhost:4502/content/aem-flash/en/xyz-[0-4000]" -u admin:admin
Related
So I tried to put a docker-compose.yml file in the .github/workflows directory, of course it tried to pick that up and run it... which didn't work. However now this always shows up as a workflow, is there any way to delete it?
Yes, you can delete the results of a run. See the documentation for details.
To delete a particular workflow on your Actions page, you need to delete all runs which belong to this workflow. Otherwise, it persists even if you have deleted the YAML file that had triggered it.
If you have just a couple of runs in a particular action, it's easier to delete them manually. But if you have a hundred runs, it might be worth running a simple script. For example, the following python script uses GitHub API:
Before you start, you need to install the PyGithub package (like pip install PyGithub) and define three things:
PAT: create a new personal access GitHub token;
your repo name
your action name (even if you got deleted it already, just hover over the action on the actions page):
from github import Github
import requests
token = "ghp_1234567890abcdefghij1234567890123456" # your PAT
repo = "octocat/my_repo"
action = "my_action.yml"
g = Github(token)
headers = {'Accept': 'application/vnd.github.v3',
'Authorization': f'token {token}'}
for run in g.get_repo(repo).get_workflow(id_or_name=action).get_runs():
response = requests.delete(url=run.url, headers=headers)
if response.status_code == 204:
print(f"Run {run.id} got deleted")
After all the runs are deleted, the workflow automatically disappears from the page.
Yes, you can delete all the workflow runs in the workflow which you want to delete, then this workflow will disappear.
https://docs.github.com/en/rest/reference/actions#delete-a-workflow-run
To delete programmatically
Example (from the docs)
curl \
-X DELETE \
-H "Authorization: token <PERSONAL_ACCESS_TOKEN>"
-H "Accept: application/vnd.github.v3+json" \
https://api.github.com/repos/octocat/hello-world/actions/runs/42
I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character â - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/
I've got a Perl script that uploads documents into Alfresco using curl.
Some of the documents have ampersand in the file name and initially this caused curl to fail. I fixed this by placing a carat symbol in front of the ampersand. But now I'm finding some documents are failing to upload when they don't have a space either side of the ampersand. Other documents with spaces in the file name and an ampersand do load successfully.
The snippet of Perl code that is running is:
# Escape & for curl in file name with a ^
my $downloadFileNameEsc = ${downloadfile};
$downloadFileNameEsc =~ s/&/^&/g;
$command = "curl -u admin:admin -F file=\#${downloadFileNameEsc} -F id=\"${docId}\" -F title=\"${docTitle}\" -F tags=\"$catTagStr\" -F abstract=\"${abstract}\" -F published=\"${publishedDate}\" -F pubId=\"${pubId}\" -F pubName=\"${pubName}\" -F modified=\"${modifiedDate}\" -F archived=\"${archived}\" -F expiry=\"${expiryDate}\" -F groupIds=\"${groupIdStr}\" -F groupNames=\"${groupNameStr}\" ${docLoadUrl}";
logmsg(4, $command);
my #cmdOutput = `$command`;
$exitStatus = $?;
my $upload = 0;
logmsg(4, "Alfresco upload status $exitStatus");
if ($exitStatus != 0) {
You can see that I am using backticks to execute the curl command so that I can read the response. The perl script is being run under windows.
What this effectively tries to run is:
curl -u admin:admin -F file=#tmp-download/Multiple%20Trusts%20Gift%20^&%20Loan.pdf -F id="e2ef104d-b4be-4896-8360-7d6f2e7c7b72" ....
This works.
curl -u admin:admin -F file=#tmp-download/Quarterly_Buys^&sells_Q1_2006.doc -F id="78d18634-ee93-4c29-b01d-270aeee3219a" ....
This fails!!
The only difference being as far as I can see is that in the one that works the file name has spaces (%26) in the file name somewhere around the ampersand, not necessarily next to the ampersand.
I can't see why one runs successfully and the other doesn't. Think it must be to do with backticks and ampersands in the file name. I haven't tried using system as I wanted to capture the response.
Any thoughts because I've exhausted all options.
You should learn to use Perl modules. Perl has some great modules to handle the Web requests. If you depend upon operating system commands, you will end up with not only dependencies upon those commands, but shell interactions and whether or not you need to quote special characters.
Perl modules remove a lot of the issues that you can run into. You are no longer dependent upon particular commands or even particular implementation of those commands. (The curl command can vary from system to system, and may not even be on the system you're on). Plus, most of these modules handle the piddling details for you (such as URI escaping strings).
LWP is the standard Perl library for implementing these requests. Take a look at the LWP Cookbook. This is a tutorial on the whole HTTP process. Basically, you need to create an agent which is really just a virtual web browser for you to use. Then, you can configure it (for example, setting the machine, browser type, etc.) you might need.
What is really nice is HTTP::Request::Common that provides a simple interface for using HTTP forms.
my $results = POST "$docLoadUrl"
[ file => '#' . "$downloadFileName",
id => $docId,
title => $docTitle,
tag => $catTagStr,
abstract => $abstract,
published => $publishedDate,
pubId => $pubId,
pubName => $pubName,
...
];
This is a lot easier to read and maintain. Plus, it will handle URI encoding for you.
I need to load a shell script from a raw gist but I can't find a way to get raw URL.
curl -L address-to-raw-gist.sh | bash
And yet there is, look for the raw button (on the top-right of the source code).
The raw URL should look like this:
https://gist.githubusercontent.com/{user}/{gist_hash}/raw/{commit_hash}/{file}
Note: it is possible to get the latest version by omitting the {commit_hash} part, as shown below:
https://gist.githubusercontent.com/{user}/{gist_hash}/raw/{file}
February 2014: the raw url just changed.
See "Gist raw file URI change":
The raw host for all Gist files is changing immediately.
This change was made to further isolate user content from trusted GitHub applications.
The new host is
https://gist.githubusercontent.com.
Existing URIs will redirect to the new host.
Before it was https://gist.github.com/<username>/<gist-id>/raw/...
Now it is https://gist.githubusercontent.com/<username>/<gist-id>/raw/...
For instance:
https://gist.githubusercontent.com/VonC/9184693/raw/30d74d258442c7c65512eafab474568dd706c430/testNewGist
KrisWebDev adds in the comments:
If you want the last version of a Gist document, just remove the <commit>/ from URL
https://gist.githubusercontent.com/VonC/9184693/raw/testNewGist
One can simply use the github api.
https://api.github.com/gists/$GIST_ID
Reference: https://miguelpiedrafita.com/github-gists
Gitlab snippets provide short concise urls, are easy to create and goes well with the command line.
Sample example: Enable bash completion by patching /etc/bash.bashrc
sudo su -
(curl -s https://gitlab.com/snippets/21846/raw && echo) | patch -s /etc/bash.bashrc
I am using following command to get a brief history of the CVS repository.
cvs -d :pserver:*User*:*Password*#*Repo* rlog -N -d "*StartDate* < *EndDate*" *Module*
This works just fine except for one small problem. It lists all tags created on each file in that repository. I want the tag info, but I only want the tags that are created in the date range specified. How do I change this command to do that.
I don't see a way to do that natively with the rlog command. Faced with this problem, I would write a Perl script to parse the output of the command, correlate the tags to the date range that I want and print them.
Another solution would be to parse the ,v files directly, but I haven't found any robust libraries for doing that. I prefer Perl for that type of task, and the parsing modules don't seem to be very high quality.