How to use wget to download all URLs matching a pattern

How to use wget to download all URLs matching a pattern - wget

Say I have a website like this:
https://mywebsite.com/dir1/id-1
https://mywebsite.com/dir1/id-2
https://mywebsite.com/dir1/id-3
https://mywebsite.com/dir2/foo-id-1
https://mywebsite.com/dir2/foo-id-2
https://mywebsite.com/dir2/foo-id-3
https://mywebsite.com/dir3/list-1
https://mywebsite.com/dir3/list-2
https://mywebsite.com/dir3/list-...
https://mywebsite.com/dir3/list-n
https://mywebsite.com/dir4/another-list-type-1
https://mywebsite.com/dir4/another-list-type-2
https://mywebsite.com/dir4/another-list-type-...
https://mywebsite.com/dir4/another-list-type-n
https://mywebsite.com/random-other-directories-i-dont-care-about...
I would like to download all the /dir1/:id and /dir2/foo-:id pages, but would like to follow the links from all the pages in /dir1 through /dir4, where some of those directories are just lists of links to /dir/:id for example.
Wondering how I do this. Ideally it would maximize downloading all the :id links first, rather than getting stuck downloading the thousands or millions of list pages first.
Wondering how to do that. It is not just a simple "mirror the site". A lot of the time when I've tried this, wget gets overly absorbed in links I don't care about. I want it to _maximize downloading /dir1/:id and /dir2/foo-:id, while also gathering whatever links it finds on other pages it encounters. Basically, some way to prioritize it.

So you want neither a breadth-first nor a depth-first approach, but rather one that uses some notion of priorities.
This is unfortunately not possible purely with Wget. However, with a little bash scripting, you might be able to get quite close. There are 2 simple approaches I can think of:
Give Wget the link to /dir1/ and /dir2/ first and let it recursively download that. Once it is done, invoke wget with mywebsite.com/ in order to download the rest of the files. It will waste a few seconds sending HEAD requests for all the files you've already downloaded, but that's it.
This is similar to (1) above. Except, you invoke wget with a ``--accept-regex` for each of the directories, causing them to downloaded one after another

Related

Simplest CMS ever?

I’m building a super simple website with 5 pages and I want a CMS that allows me to change the text and the pictures in a couple of them.
In the past I used wordpress, but it has way too many features that i don’t need in this case.
I’ve been trying to learn gatsby.js so I would like to build it on that, but trying to see how to source from Netlify-CMS I started facing an overwhelming amount of information which I'm not sure I need.
Any tips?
Thanks!
M

Netlify has a built in CMS, and it's compatible with Gatsby! You can find examples online. It should be good for smaller sites, but for larger projects, I really like Prismic.io. Contentful is another popular one, but it's a bit pricier than prismic.
Edit: reread your comment about sourcing from Netlify. Netlify is not a "source" plug in in Gatsby. You use a local file +markdown source, and do the configuration for netlify, which adds an admin interface at an endpoint. You configure your data models in the interface, create login, etc. Then, when you submit changes, it modifies files in your connected git repo, so the local file + remark will make the data available in the graphql queries.

In the end I used Forestry.io, a good simple solution that did exactly what I needed in combination with Jekyll.

Google Content Experiments for whole part of the site

I want to run an A/B-test or an experiment for whole part of the site. For example on my /blog/ page, where one variation would have a newsletter form and other variation a free ebook download button.
The problem is that I have to use a full URL path for the experiments, for example /blog/2013/article/1?var=1 and /blog/2013/article/1?var=2 With this method I would need create a new experiment for each blog post. This is impossible.
Any tips on how to approach this?

It's possible, but the documentation is lacking.
When you choose your variation URLs, you need to use relative instead of http://. This let's you use query parameters to define the variations, instead of the full URL. In your example, you would define your original page as:
http: //www.example.com/blog/2013/article/1
and your variation URLs would be ?var=1, var=2, etc. using relative as the option in the dropdown (instead of http:// or https://).
Here's the not-so-clear documentation on using relative URLs for your variations:
https://support.google.com/analytics/answer/2664470?hl=en&ref_topic=1745208
One important thing to remember is that if you're doing it this way, you need to include the content experiment code on every "original" page.
There's also another way to have even more control over serving the variation pages and controlling the experiment using the Content Experiments JavScript API. This is a relatively new feature - you can see the developer documentation about this here:
https://developers.google.com/analytics/devguides/collection/gajs/experiments

I am not sure this is possible. You might look at a more robust yet simple to use tool like Visual Website Optimizer or Optimizely.

Common features of a robust CMS

This is not a direct code question, however, I think it may be useful. After google-ing for a while, I can't find a definitive answer....
A while back, I built a rudimentary CMS for shcool. Image upload, gallery, text, a basic captcha, etc. Basically a blog that you could upload images to. My quesiton is this:
Could any of you clever ducks tell me what features a robust, solid, home-made CMS should contain? I don't want to make a super fancy pants sort of site, but I do want to flesh it out a little. My current job is in Sharepoint design, and I don't want to lose any of the PHP skills that course taught me.
Any input would be greatly appreciated.
Thanks.

Well.. The best product is a product that reaches the requirements of the customer.
But I would say:
Dynamic menu
Dynamic pages
Different type of pages - front page, posts, lists, media, gallery
Secure back end
Dynamic user configuration
A install script
Template editor, where you can define modules
Maybe a offline post editor, with a up-loader (Drag a .doc file in a folder, and the file is automatically added as a post on the page)

Extremely simple content updating tool for websites - CMS? PHP forms? Suggestions please!

As a side project I tutor grandparents and other computer novices in Computer & Internet 101, from physically using a mouse to dealing with e-mail/searching/etc. Web development isn't really my area of focus - I do have reasonable HTML/CSS/Javascript etc skills, so I can throw together a decent-looking simple, static site - but occasionally I get asked to put together extremely simple websites for these people, that they can update themselves; that is, edit text-based content without giving Grandpa a heart attack by making him come face-to-face with HTML/Javascript.
I've waded through a mile-long list of CMS software - largely culled from the many other similar questions on SO - but they've all got something ruling it out: hosted, restricts the design (can't use w/existing CSS, looks "Word-press-y", etc), not free/FOSS, etc. I wonder if "CMS" is even the right word for what I'm looking for. What I need is a simple text editor for the client: that is, something that will give the client a text box of some variety, let them edit it, and update the content with that info. They can't mess with navigation, add new pages, change anything other than text. If it was really fancy, they could upload a picture.
I was planning to do this just with a couple of password-protected php forms, but thought I'd ask if there's anything already out there that might provide this functionality? Any suggestions on building my own version of this, in PHP or something else?
What I'm really interested in is:
1) the simplicity/customize-ability of the admin interface (or lack of admin interface, if the client could somehow edit directly in the page), and
2) ease of set up for me (not getting paid much if at all for this, don't want to wade through three million plugin options to figure out how to get some unwieldy, high learning-curve framework to do what I want).

Try pulsecms.
Here is another very simple CMS that has JQuery and modernizr , HTML5 Boilerplate and TinyMCE.

I have my wife setup with Windows LiveWriter
http://explore.live.com/windows-live-writer?os=other
This means that she just builds her articles as if she is using a word processor (almost exactly the same) and then just uploads the article to her blog. I use Blogengine.net to host the blog on a Godaddy hosting solution.
Blogengine comes with built in support for LiveWriter and only required that you input the address, username and password in.

I understand this is an old post, but i hope someone find this of interest.
You could give the users the instruction to upload text files to the site, and the have the HTLM/PHP/ASP pages load the context of such .ts files.
Each web page should have a specific named .txt file associated.

What are some good ways of keeping content from being copied to other sites

I understand that no matter what I do, someone will be able to copy it. However I can still make them work hard for it. What are some good ways of making data not easily copied using php compatible coding.
--- Added ----
The data is a listing of results for certain local sports events. We send people out to collect the information, post the information, make corrections and such. However a competing website takes our results (I know they are directly copying them) and never updates them which causes people to call our office and complain.
---- Answer for my Use ----
I picked one of them, however I am going to use multiple of your answers. I am going to add my link in a using the copy pasta trick. I am going to put fake hidden text into it. I am also going to do the fake hidden text trick with different versions of the div tag that are fake (making it even harder to scrape or to do something like copy to textpad and replace it real easily), and I am going to talk to a lawyer as well about legal recourse and what I can do to make it illegal for them to copy the data (such as creative bios or something cool like that). Thanks for your help.

Joe, you can't really make them work really hard to get your data. It's essentially just a single request to any of your pages. Your best option is to explicitly state that you own the rights to all of your content, and that any infringement on that ownership will lead to legal ramifications*.
* Not a lawyer

Your data will be copied to every computer that requests the page and it will stay there until the person clears their cache. To answer your question, you can't.
What you can do is create a CSS style such as:
.copy-pasta { display: none; }
And then throughout your content, add something like this:
<p class="copy-pasta">Content provided via [your website here]</p>
This will increase your page rank when copy-pasters blatantly steal your content, meaning you will show up first in search results.

Place some <div style="display: inline; position: absolute; overflow: hidden; width: 0px">useless words</div> in the text. It won't display for reading, but if someone copy and paste... "WOW where it came from WTF!! *CRY*"

How about putting links to your site in with the displayed data? No big fanfare, but just suggest that the for the most up to date figures, they can go to the real website that publishes them.
Most of what you try will only work for a time. Until you exceed their laziness factor. (What they're doing suggests a high laziness factor.)
Laws don't protect publicly available data, but you may be able to protect the packaging and presentation.

Programs used to copy out data look for the data using pattern-matching. You could 'decorate' your data with randomly-chosen tags (like one row would have a span tag surrounding it, the next row a div, etc...). Just a thought.
Clarification:
With screen-scraper at least, the user of the program specifies what HTML comes before the data they want, and what HTML comes after it. You can make it more difficult for them to automatically retrieve the data.

Why are people calling your office to complain if the data is on a competing website? If they have a domain name that is similar enough to yours that people are confusing the two of you or if they've put something on their site that makes it look like you've endorsed them, then you've got them for trademark infringement.

Disable the context menu is a start.
$(document).bind('contextmenu', function(e)
{
return false;
});
Or
<body oncontextmenu="return false;">

Forbidding people to get data is almost impossible. You can mess up your tags and make the code really dirty and hard to parse... but it's not really enough. You could also generate a big image with the data in it, this would be painful to parse! ... but you don't want to do that.
Because you said...
However a competing website takes our
results (I know they are directly
copying them) and never updates them
which causes people to call our office
and complain.
... my call would be to take this the other way and create an API allowing people to get your content in a way that YOU designed.
Also if they are just shamelessly stealing your data and they don't have the right to do it, consider a legal option.

Another option is to use PHP code to generate images from the site's HTML. You would use the images to display the content, instead of HTML which can be easily copied out. Example code is here, and I bet you could find more code to do this by Googling:
http://www.acasystems.com/en/web-thumb-activex/faq-php-convert-html-to-image.htm

Try Copyscape it wont prevent your content from being copied, but it will make finding the copies very easy.

You may encrypt the data on the page, and have javascript obfuscated decoding routine that will decode it for you viewers. You may switch keys and encryption algorithms from time to time. Same javascript should disable ability to select text and/or copy it to prevent manual copy-pasting.
They won't be able to copy manually and their scraper would have to be able to run javascript to get the data.
Caveat is that the data won't be visible for Google, but if data is rather numeric it might not be such a big harm.
If they scrape automatically and very often you may also try to pinpoint their IP by observing most active IP-s on your site and serve them fake data.
Please don't use lawyers, that's hitting below the belt.

use swf to display your data, just like other online books

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to use wget to download all URLs matching a pattern - wget

Related

Simplest CMS ever?

Google Content Experiments for whole part of the site

Common features of a robust CMS

Extremely simple content updating tool for websites - CMS? PHP forms? Suggestions please!

What are some good ways of keeping content from being copied to other sites

Categories

Resources