one robots.txt to allow crawling only live website rest of them should be disallowed - robots.txt

I need guideline about using of robots.txt problem is as following.
I have one live website "www.faisal.com" or "faisal.com" and have two testing web servers as follows
"faisal.jupiter.com" and "faisal.dev.com"
I want one robots.txt to handle this all, i don't want crawlers to index pages from "faisal.jupiter.com" and "faisal.dev.com" only allowed to index pages from "www.faisal.com" or "faisal.com"
I want one robots.txt file which will be on all web servers and and should allow indexing only live website.

The disallow commands specifies only relative URL so I guess you cannot have the same robots.txt file for all.
Why not force HTTP authentification on the dev/test servers ?
That way the robots wont be able to crawl these servers.
Seems like a good idea if you want to allow specific people to check them but not everybody trying to find flaws in your not yet debugged new version ...
Especially now that you gave the adresses to everybody on the web.

Depending on who needs to access the dev and test servers -- and from where, you could use .htaccess or iptables to restrict at the IP address level.
Or, you could separate your robots.txt file from the web application itself, so that you can control the contents of it relative to the environment.

Related

Robots.txt taking out form keys

I was wondering if someone would be able to take a look at this to tell me if I configured it correctly? I am not trying to block all parameters (?) but just the ones with hsformkey. Here is how i wrote the directive. I tested it in search console and it says it is blocked, but I'm not sure I trust it.
Disallow: *?hsFormKey*
Thanks!
Robots.txt is a convention that many automated web crawlers (most often used by search engines to index sites) use to tell them any pages you don't want them to crawl. It's a fairly loose convention, and different crawlers support different features.
The original documentation, which is the closest there is to a universal standard, doesn't include the concept of "wildcards" or "globbing". Per "The Web Robots Pages":
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".
That being said, it's a common addition that many crawlers do support. For example, in Google's documentation of they directives they support, they describe pattern matching support that does handle using * as a wildcard. Note that since Disallow directives are beginning-of-URL instructions anyway, the asterisk at the end doesn't do anything useful.
The thing to keep in mind is that the exclusion rules work directly on the URL, not on any path or conventions used by your web server or application framework. While your application may treat characters like ? and & as delimiting parameter information (as it's a pretty common and standard thing to do), the web crawler is just interpreting the entire URL. Also, URL paths are case sensitive. So, you are excluding the web crawler from loading any URL that has ?hsFormKey in it, but not excluding URLs with ?somethingelse=value&hsFormKey=123 in them, nor are you excluding URLs with ?hsformkey in them. This may be exactly what you want, but I'm not sure of your requirements. And as I said, it will only be effective for those crawlers that recognize this style of wildcard.
Also be aware that robots.txt only prevents loading of the excluded URLs by crawlers. Excluded URLs can still show up other places that people can find (bookmarked links, links shared on social media, etc.) and can still show up in search engines. Google explicitly states they will still index and list a URL, even if they're not allowed to crawl it:
While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the site can still appear in Google search results. You can stop your URL from appearing in Google Search results completely by using other URL blocking methods, such as password-protecting the files on your server or using the noindex meta tag or response header.

How to refuse wget?

I am uploading images to a public directory and I would like to prevent users from downloading the whole lot using wget. Is there a way to do this?
As far as I can see, there must be. I have found a number of sites where, as a public browser, I can download a single image, but as soon as I run wget against them I get a 403 (Forbidden). I have tried using the no-robot argument, but I'm still not able to download them. (I won't name the sites here, for security reasons).
You can restrict access using user-agent string, see apache 2.4 mod_authz_core for example.
Wget also respects robots.txt directives by default. This should repent any casual user.
However, a careful look into wget manual will let to bypass these restrictions. Wget also lets to add random delays between requests, so even advanced techniques based on access pattern analysis may be bypassed.
So the proper way is to mess with wget link/reference recognition engine. Namely, the content you want to keep unmirrored should be loaded dynamically using javascript and the urls must be encoded in a way that would require js code to decode. This would protect your content, but would require to manually provide unobfuscated version for web bots you want to index your site, such as google bot (and no, it is not the only one one should care about). Also, some people do not run js scripts by default (esoteric browsers, low-end machines, mobile devices may demand such policy).

How do I set up an intranet that can be accessed in different locations?

I want to set up an intranet that can be accessed in more than one location.
I want the server to be located in one location and be accessed in another. For example it would be at the users home, or in one of our many offices. At the moment I can't see more than 7 people using it, so we won't need anything large to start off with.
I use Wampserver for building our webpages, but I don't think Wampserver will be enough to do what we need. As if I set up Wampserver it is only accessible from the building we are in. I do not want to open the firewall to put it online as the pages that we will be serving will not be for the public.
The typical way of doing this is to set up and configure a VPN solution for your home users. You could do this yourself or use a third party solution. Normally, you would allow VPN users access to specific resources, such as your intranet server.
The other alternative is to allow public access to the intranet server, but implement authentication on the intranet server so only your users can access the content.
I would normally go for the former as a more secure solution, but it depends on your environment and requirements.

fetching a file from a url for mobile app: How to manage server side running Joomla?

Im new to website development and design so apologize in advance if the question is redundant.
I have a program where a client, using a URL string fetches a XML file from a webserver. This would be no problem right if it were a simple URL with no security or no CMS (like Joomla) involved: Just put the exact URL string and the client gets the file from the web server, done.
But, how would the process work if the URL is on my site hosted on GoDaddy and using a Joomla CMS?
Im trying to understand how the same process of fetching a file works on a hosted server using a CMS. Since I just made the transition from my site being on my school's servers to having a Joomla website Im hosting on goDaddy.
I mean where would I put the file if I also want the file to be accessible only if the client authenticates itself first. Just to be on the safe side. I mean is this how normally things work in mobile apps? I have a client program thats a iPhone app and within the app I have a XML file which is used as a data source for my UITavleView, but I want to check some URL to see if an updated version of the XML file exists. My app side programming is mostly done, now Im trying to learn the server side things I need to do to make this process happen with Joomla and my own hosted site
I donot understand how would the process work in that case. I mean, what are the things I would need to do on the server side to and the client side to make this possible?
Please help me understand or if you could point me to some links where these steps are illustrated...or if you could give me some Google key words I can search for to learn about this process.
thanks a lot
The fact that you have a CMS does not generally change how you access a file within the file structure of your domain unless the CMS protects certain directories. In this case, Joomla does not so you can directly access any file you wish. Depending on the sensitivity of the information you are trying to retrieve, you can protect the directory through your domain management panel. If it's not particularly sensitive, the authentication can be done by the app since the URL you are accessing can be easily hidden from the user.
It seems like that would be the simplest solution since the app will have access to user information by nature of where it resides.

Building a webportal which will be rented to customers. Need an Architecture Suggestion

Iam building a web portal which will be rented to customers on a hosted model (SAAS), where they will be using the entire portal features on their own domains with their own branding.
Now I don't want them to get the files of my web-portal, but still be able to use a custom branded portal.
One solution which someone suggested here was to host the branded version on my server and all it via an Iframe on the customer's domain. However I didn't like the idea very much.
One second approach which I researched and found was to host the portal on a fresh IP in my server and ask the customer to point his domain to that ip.
The webportal will be sold to lot of customers and they all will have separate User Interfaces and brandings, so this is needed.
Please suggest me what do you feel about my approach or if you guys have a better idea in mind please pour in your suggestions.
iFrames are evil.
With that said I would probably go with a subdomain approach. They add a subdomain like webportal.somecompany.com that points to you and have your webserver route them to the correct hosted instance of your application based on subdomain. That way their www.somecompany.com still goes to their website.
We're running a SAAS application that supports branding, and we do it by dynamically serving up CSS. If all of your customers have a unique domain name pointed at your server, you could select your CSS files by domain name: If a customer logs in at "http://portal.customer.com/login", you can have his HTML link to the file "/stylesheets/portal.customer.com.css", and so forth. Alternatively, you can create a subdomain for each of your customers, and point them all at your master server, using very similar code to pick the CSS.
This lets you have a single IP address for all customers (and only as many servers as you need to support all your customers behind that IP address), instead of one IP address / server per customer - should cut save on hosting costs!
(NOTE: I'm leaning toward the subdomain approach, the more I think about it. If you're using HTTPS, it would let you use a single "*.yourdomain.com" certificate, rather than trying to mess with separate certificates for each client domain.)
You don't need to run different IPs for different customers. HTTP 1.1 supports Host: like so
GET / HTTP/1.1
Host: example.com
This is how most shared hosts work. When a customer sets up their DNS records to point at your server/load balancer, the incoming requests will have your client's hostname in the headers. Whether you set up virtual hosts in say Apache or do it at the application level is up to you.
Please for your own sake don't do iframes. There's a lot of information on the web on architecture for multi tenant applications.
I made the experience that in such a scenario your customers will come up with any possible web UI requirement you can imagine. Therefore it is rather difficult to build a web UI framework that can accomodate to all the needs, in fact this would rather be a content management system.
Furthemore, for building the web UI, you may meet any combination of customer in-house development, 3rd party web agency or request to get it developed by yourself.
In such situations I made good experiences with offering the SaaS as actual web services allowing custom developed portals to run on top. With this, anybody can build the actual portal with the clients look and feel. You could offer development and hosting as an option.