Powershell, Invoke-WebRequest and the downloaded content - powershell

I often do the same mistake over again: in powershell, I run
wget http://example.com
instead of
wget http://example.com -OutFile somename
and when the command (wget aka Invoke-WebRequest) is done executing, the downloaded file is stored... apparently, nowhere.
Q: Is there a way to store the downloaded content post-factum?

No, if you dont specify -outfile, it is only returned to the pipeline to be used in next Statement.

Related

Invoke-WebRequest equivalent of wget -N (timestamping)

wget -N (or more verbose wget --timestamping) has the nice effect that files that are already downloaded are not attempted to be downloaded again.
That way you can save time and resources. I'm looking for the equivalent in PowerShell's Invoke-WebRequest.
Is there a way to respect the file's and the server's time stamp in Invoke-WebRequest?
based on what i can find in the documentation, no, it doesn't appear that Invoke-WebRequest has an option similar to that.
the best i could tell you is to check yourself in a script through conditionals and saving the new file with a different file name, since if you're using Invoke-WebRequest to download a file, i can only assume you're also using -OutFile as an option;
$File1Creation=(Get-ChildItem <PathToFile1> -Force).CreationTime
Invoke-WebRequest -Uri https://website.com -Outfile <PathToFile2>
$File2Creation=(Get-ChildItem <PathToFile2> -Force).CreationTime
if ($File1Creation -eq $File2Creation)
{
#do something here
} else {
#do something else here
}
the biggest problem is that, because I-WR doesn't have an option similar to it, unless your file has a timestamp embedded somewhere on its originating webpage, there's no way to check it prior to actually downloading it.

Powershell download file onedrive/googledrive

I want to download a file to a pc from onedrive/google drive.
After some digging into this subject i found invoke-Webrequest was the best command to use in this subject.
# Download the file $zipFile = "https://xxxxxxmy.sharepoint.com/:u:/g/personal/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxfRW5c" Invoke-WebRequest -Uri $zipFile -OutFile "c:\temp\xxxxxx.exe"
only to found out the code was working but only downloaded a .exe file of 156kB
This file i wanted to download is 22mb? i get no error's in powershell but maybe you have any idea what is going on?
zipfiles work but then i need to extract a zip file in the code and i dont know the working code for that..? ( expand-archive didnt work).
So there is no login context for the session spawned by your script. If you open one drive in your browser, once authentication is established and a session exists, the browser is given access to the file.
If you open your file that is 156kb in notepad, you should find it's just a webpage saying the URL is not available.
I believe this will help the situation, but it's more complex:
https://techcommunity.microsoft.com/t5/onedrive-for-business/how-to-download-root-level-files-from-onedrive-using-powershell/m-p/758689
Thnx you for you reply and sorry for my late reply,
It turns out that the link i was using didn't have access to the file directly.
When you add in the onedrive/google docs :download=1 it will skip the "virus scan".
&download=1 needs to be added.

Dockerfile RUN powershell wget and see progress

In my dockerfile want to use the following sequence of commands to download and extract a large zip file:
RUN powershell -Command \
wget http://my_server/big_huge.zip \
-OutFile C:\big_huge.zip ; \
Expand-Archive -Path C:\big_huge.zip \
-DestinationPath C:\big_huge ; \
Remove-Item C:\big_huge.zip -Force
I don't want to use ADD to download the zip file isn't going to change and I want this step to be cached.
What I have above seems to work but I do not get any indication of the progress of the download like I normally would. That's a bummer because this is a large download. The progress of the download is obscured I suppose because Invoke-WebRequest which wget is an alias to is a cmdlet. Is there any way to pipe the output of a cmdlet to stdout so I can see it when I am running docker build?
I gave up on trying to do the download from the Dockerfile and instead wrote a separate script that pre-downloads the files I need and expands their archives if the files aren't already present. This script then calls docker build, docker run, etc. In the Dockerfile I am copying the directory where I expanded the archives.
I don't know Docker. But maybe you can pipe the output through the powershell cmdlet Out-Host. Type in help Out-Host for more information.

Ctools do not show up in pentaho UI

I am using Pentaho CE 5 on windows. I would like to use CTools but I can't make them show up in the File -> New menu to use them.
Being behind a proxy, I can not use the Marketplace plugin, so I have tried a manual installation.
First, I tried to use the ctools-installer.sh. I have run the following command line in cygwin (wget and unzip are installed):
./ctools-installer.sh -s /cygdrive/d/Users/[user]/Mes\ Programmes/pentaho/biserver-ce/pentaho-solutions/ -w /cygdrive/d/Users/[user]/Mes\ programmes/pentaho/biserver-ce/tomcat/webapps/pentaho/
The script starts, asks me what module I want to install, and begins the downloads.
For each module, I get an output like (set -x added to the script) :
echo -n 'Downloading CDF...' Downloading CDF...+ wget -q --no-check-certificate 'http://ci.analytical-labs.com/job/Webdetails-CDF-5-Release/lastSuccessfulBuild/artifact/bi-platform-v2-plugin/dist/zip/dist.zip'
-O .tmp/cdf/dist.zip SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
'[' '!' -z '' ']'
rm -f .tmp/dist/marketplace.xml
unzip -o .tmp/cdf/dist.zip -d .tmp End-of-central-directory signature not found. Either this file is not a zipfile, or it
constitutes one disk of a multi-part archive. In the latter case
the central directory and zipfile comment will be found on the last
disk(s) of this archive. unzip: cannot find zipfile directory in
.tmp/cdf/dist.zip,
and cannot find .tmp/cdf/dist.zip.zip, period.
chmod -R u+rwx .tmp
echo Done Done
Then the script ends. I have seen on this page (pentaho-bi-suite) that it is the normal output. Nevertheless, it seems a bit strange to me and when I start my pentaho server (login: admin/password), I cannot see any new tools in the menus.
After a look to a few other tutorials and the script itself, I have downloaded the .zip snapshots for every tool and unzipped them in the system directory of my pentaho server. Same result.
I would like to make the .sh works, what can I try or adjust ?
Thanks
EDIT 05/06/2014
I checked the dist.zip files dowloaded by the script and they are all empty. It seems that wget cannot fetch the zip files, and therefore the installation fails.
When I try to get any webpage through wget, it fails. I think it is because of the proxy.
Here is my .wgetrc file, located in my user's cygwin home folder:
use_proxy=on
http_proxy=http://[url]:[port]
https_proxy=http://[url]:[port]
proxy_user=[user]
proxy_password=[password]
How could I make this work?
EDIT 10/06/2014
In the end, I have changed my network connection settings to bypass the proxy. It seems that there is an offline mode for the installer, so one can download all needed files on a proxy-free environment and then run the script offline.
I guess this is related with the -r option.
I consider this post solved, since it not a CTools issue anymore.
Difficult to identify the issue in the above procedure..
but you can refer this blog he is key member of pentaho itself..
In the end, I have changed my network connection settings to bypass the proxy. It seems that there is an offline mode for the installer, so one can download all needed files on a proxy-free environment and then run the script offline. I guess this is related with the -r option.
I consider this post solved, since it is not a CTools issue anymore.
You can manually install the components from http://www.webdetails.pt/ctools/ or if you have pentaho 5.1 or above, you add the following parameters to CATALINA_OPTS option (in start-pentaho.bat or start-pentaho.sh):
-Dhttp.proxyHost= -Dhttp.proxyPort= -Dhttp.nonProxyHosts="localhost|127.0.0.1|10...*"
http://docs.treasuredata.com/articles/pentaho-dataintegration#tips-how-can-i-use-pentaho-through-a-proxy

wget appends query string to resulting file

I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character รข - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/