We're developing an educational multiplayer game for kids and want to allow players to chat with each other using a whitelist system. When using whitelist chat, players will be able to type only words which appear in the whitelist.
We're aware of the limitations of whitelists in general, but we think a whitelist chat system is something that would allow our players to express themselves better in the game, while allowing a higher level of security than moderated or blacklist chat.
While the system is easy enough to implement, we haven't been able to find a sample whitelist of "safe" words online. Does anyone know of where we can find such a list, preferably with a license that allows us to use it in a commercial project?
Thanks.
I do not believe that a simple whitelist of words will cut it. There are quite a few euphemisms for a lot of stuff out there, that a whitelist would never block (e.g. "he is growing like a weed" is fine, "he is growing weed" is NOT). And let's not mention the basic "would you like to meet?" which would be fine if the meeting were to happen in-game, but very dangerous if it were to happen out of it. Then there is also the issue of blocking rare, foreign or mistyped words, that might make your chat system frustrating enough that it would not be used.
In my opinion, there is absolutely no way you could ever match the security offered by an active and competent human moderator. Of course, depending on the volume of chat traffic and any real-time requirements there are quite a few practical issues with using humans for this. Considering that your application is targeted at children, however, human moderation might be quite acceptable, despite its much higer cost.
A second choise, but one very far from the abilities of human moderation, is to use some statistical filter such as Bogofilter, which will happily sort arbitrary text if you train it well. A blacklist would also help to immediately cut down messages with words that little kids should not (but usually do) know. You would also need a bunch of filters that would cut down messages with stuff like telephone numbers, email and street addresses and web links.
Perhaps the method with the best effectiveness/cost ratio would be to use human moderators assisted by multiple statistical filters to better make use of their time. Keep in mind, however, that if there are malicious users (i.e. anything else than same-age kids in a classroom) there is no way to make sure that nothing questionable or dangerous ever goes through.
You can try the standard unix dictionary. /usr/share/dict/words. But you'll have to modify it to remove the naughty words.
http://en.wikipedia.org/wiki/Words_%28Unix%29
http://www.openwall.com/wordlists/
While this doesn't exactly answer your question, Runescape uses a white list of phrases, rather than words.
The implementation in Runescape is awkward, because there are so many phrases to choose from. You have to go through 3 or 4 menus sometimes to get to the phrase you want.
If you can come up with a better organization of phrases, then this might work for you.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a serious question. Is it ever ethical to ignore the presence of a robots.txt file on a website? These are some of the considerations I've got in mind:
If someone puts a web site up they're expecting some visits. Granted, web crawlers are using bandwidth without clicking on ads that may support the site but the site owner is putting their site on the web, right, so how reasonable is it for them to expect that they'll never get visited by a bot?
Some sites apparently use a robots.txt exactly in order to keep their site from being crawled by Google or some other utility that might grab prices and therefore allow people to do price comparisons easily. They have private search engines on the site so they obviously want people to be able to search the site; apparently they just don't want people to be able to easily compare their information with other vendors.
As I said, I'm not trying to be argumentative; I would just like to know if anyone has ever come up with a case where it's ethically permissible to ignore the presence of a robots.txt file? I cannot think of a case where it's permissible to ignore the robots.txt mainly because people (or businesses) are paying money to put up their web sites so they should be able to tell the Googles/Yahoos/Other SE's of the world that they don't want to be on their indices.
To put this discussion in context, I'd like to create a price comparison website and one of the major vendors has a robots.txt that basically prevents anyone from grabbing their prices. I'd like to be able to get their information but, as I said, I can't justify simply ignoring the wishes of the site owner.
I have seen some very sharp discussion here and that's why I would like to hear the opinions of developers that follow Stack Overflow.
By the way, there is some discussion of this topic on a Hacker News question but they seem to mainly focus on the legal aspects of this.
Arguments:
A robots.txt file is an implied license, especially since you are aware of it. Thus, continuing to scrape their site could be seen as unauthorized access (i.e., hacking). Sucks, but arguments like this have been made in other legal cases recently (not directly related to robots.txt, but in relation to other "passive controls".)
Grabbing prices violates no copyright law, including DMCA, since copyright does not include factual information, only creative.
Ethically, you should not grab prices because the vendor should have the ability to change prices without worrying about being accused of a bait/switch by people coming from your site.
Have you taken the high road, explaining the site to them and saying you'd love to include them in your list of vendors? Maybe they will love the idea and actually expose the data in a way that is easy for you to consume and less resource-intensive for them to produce.
There are no laws written directly about robots.txt because netiquette is generally followed. Don't be one of the "bad guys."
Some people filter robots because they use URL links to perform "actions" like adding things to carts, and robots leave them with massive numbers of abandoned shopping carts in their database.
Some people filter robots because they have exclusive prices that they can't advertise openly based on agreements with their vendors. You could be putting them in a bad position by exposing those prices on your site.
In this economy, if a company doesn't want to do everything possible to advertise themselves, it's their own fault that you don't include them.
The other use of robots.txt is to help protect web spiders from themselves. It's relatively easy for a web spider to get mired in an infinitely deep forest of links, and a properly constructed robots.txt file will tell the spider that "you don't need to go here".
Many people have tried to build businesses off building "price comparison" engines that scraped major sites.
Once you start getting any sort of traffic/revenue to speak of, you will receive a cease and desist. It's happened to dozens, if not hundreds of projects. I even worked on a small project that received a C&D from Craigslist.
You know how they say "It's easier to ask forgiveness than it is to get permission"? It doesn't hold true with page scraping. Get permission, or you will be hearing from their lawyers.
If you're lucky, it'll be early on, when you've got nothing to lose. If it's late, you may lose your business and all your work overnight, with a single letter.
Getting permission shouldn't be hard. Unless you're doing something sneaky, you're likely going to drive them additional traffic. Hell, once your product takes off, sites may be begging you, or even paying you to add their data.
One reason we allow robots to dig through the web without complaint is that we have a way to stop them if we want to. Protects both sides.
Remember the uproar when Cuil's robots were accused of going over-the-top, apparently acting like a DoS attack in some cases and using up bandwidth allowances of some small sites?
If too many people violate robots.txt we might get something worse.
"No" means "no".
To answer the narrow question, for the price comparison website you're probably best grabbing the price in real time, rather then scrapping the database in advance. Hard to imagine that being a problem.
An interesting IRL version of story involving The Harvard Coop:
Coop Calls Cops On ISBN Copiers.
Short answer: No.
On the narrow issue: If a seller says that their prices are secret, I think you have to respect that. I'd contact them and ask if they really don't want price comparison engines like yours to include them, or if the "no trespassing" sign is for technical reasons. If the latter, perhaps they'll provide you with an alternative. If the former, then I'd say too bad, they don't get included, they lose some business, and it's their problem.
Tangential rant: Personally, I get pretty annoyed with companies that make me jump through hoops to find out the price of their products, places that make me call and talk to a salesman so he can give me a hard-sell pitch, or worse, make me give them my phone number so their salesman can call and harass me. I figure that if they're afraid to tell me the price, it probably means that it's too high.
In general: A robots.txt file is like a "No Trespassing" sign. It's the owner's right to say who is allowed on their property. If you think their reasons are dumb, you can politely suggest they take the sign down. But you don't have the right to disregard their wishes. If someone puts a No Trespassing sign on his yard, and I say, "Hey, I just want to take a quick short cut, what's the big deal?" -- Maybe I'm stepping on his prized Bulgarian violet bulbs and destroying a valuable investment. Maybe I'm crossing his people's sacred burial ground and offending their religious sensibilities. Or maybe he's just an ornery jerk. But it's still his property and his right. Oh, and if I fall into the dangerous sinkhole after ignoring the No Trespassing sign, who's to blame? (In America, I could probably still sue him for all he's worth despite the fact that he warned me, but is that right?)
I'm showing some ignorance here, but I always thought a bot was something only sent out by a search engine. Like Google or Yahoo.
Thus, if you wrote an application that searched content on the internet, I wouldn't consider that a search engine bot, which to my knowledge is what robots.txt is trying to block.
But this may just be selective ignorance, because I might do it until the webmaster of that site contacted me and asked me to stop :)
If people make it available to public access, they shouldn't try to put limits on it. Adding a robots.txt file to your site is the equivalent to putting a sign on your lawn that says "Please don't look at me."
Yes, I realize this question was asked and answered, but I have specific questions about this that I feel were not clear on that thread and I'd prefer not to get lost in the shuffle on another thread as well.
Previous threads said that rendering the email address to an image the way Facebook does is overkill and unprofessional user experience for business/professional websites. And it seems that the general consensus is to use a JavaScript document.write solution using html entities or some other method that breaks up and/or makes the string unreadable by a simple bot. The application I'm building doesn't even need the "mailto:" functionality, I just need to display the email address. Also, this is a business web application, so it needs to look/act as professional as possible. Here are my questions:
If I go the document.write route and pass the html entity version of each character, are there no web crawlers sophisticated enough to execute the javascript and pull the rendered text anyway? Or is this considered best practice and completely (or almost completely) spammer proof?
What's so unprofessional about the image solution? If Facebook is one of the highest trafficked applications in the world and not at all run by amateurs, why is their method completely dismissed in the other thread about this subject?
If your answer (as in the other thread) is to not bother myself with this issue and let the users' spam filters do all the work, please explain why you feel this way. We are displaying our users' email addresses that they have given us, and I feel responsible to protect them as much as I can. If you feel this is unnecessary, please explain why.
Thanks.
It is not spammer proof. If someone looks at the code for your site and determines the pattern that you are using for your email addresses, then specific code can be written to try and decipher that.
I don't know that I would say it is unprofessional, but it prevents copy-and-paste functionality, which is quite a big deal. With images, you simply don't get that functionality. What if you want to copy a relatively complex email address to your address book in Outlook? You have to resort to typing it out which is prone to error.
Moving the responsibility to the users spam filters is really a poor response. While I believe that users should be diligent in guarding against spam, that doesn't absolve the person publishing the address from responsibility.
To that end, trying to do this in an absolutely secure manner is nearly impossible. The only way to do that is to have a shared secret which the code uses to decipher the encoded email address. The problem with this is that because the javascript is interpreted on the client side, there isn't anything that you can keep a secret from scrapers.
Encoders for email addresses nowadays generally work because most email bot harvesters aren't going to concern themselves with coding specifically for every site. They are going to try and have a minimal algorithm which will get maximum results (the payoff isn't worth it otherwise). Because of this, simple encoders will defeat most bots. But if someone REALLY wants to get at the emails on your site, then they can and probably easily as well, since the code that writes the addresses is publically available.
Taking all this into consideration, it makes sense that Facebook went the image route. Because they can alter the image to make OCR all but impossible, they can virtually guarantee that email addresses won't be harvested. Given that they are probably one of the largest email address repositories in the world, it could be argued that they carry a heavier burden than any of us, and while inconvenient, are forced down that route to ensure security and privacy for their vast user base.
Quite a few reasons Javascript is a good solution for now (that may change as the landscape evolves).
Javascript obfuscation is a better mouse trap for now
You just need to outrun the others. As long as there are low hanging fruit, spammers will go for those. So unless everyone starts moving to javascript, you're okay for now at least
most spammers use http based scripts which GET and parse using regex. using a javascript engine to parse is certainly possible but will slow things down
Regarding the facebook solution, I don't consider it unprofessional but I can clearly see why purists may disagree.
It breaks accessibility standards (cannot be parsed by browsers, voice readers or be clicked.
It breaks semantic construct (it's an image, not a mailto link anymore)
It breaks the presentational layer. If you increase browser default font size or use high contrast custom CSS, it won't apply to the email.
Here is a nice blog post comparing a few methods, with benchmarks.
http://techblog.tilllate.com/2008/07/20/ten-methods-to-obfuscate-e-mail-addresses-compared/
As I'm starting to develop for the web, I'm noticing that having a document between the client and myself that clearly lays out what they want would be very helpful for both parties. After reading some of Joel's advice, doing anything without a spec is a headache, unless of course your billing hourly ;)
In those that have had experience,
what is a good way to extract all
the information possible from the
client about what they want their
website to do and how it looks? Good
ways to avoid feature creep?
What web specific requirements
should I be aware of? (graphic
design perhaps)
What do you use to write your specs in?
Any thing else one should know?
Thanks!
Ps: to "StackOverflow Purists" , if my question sucks, i'm open to feed back on how to improve it rather than votes down and "your question sucks" comments
Depends on the goal of the web-site. If it is a site to market a new product being released by the client, it is easier to narrow down the spec, if it's a general site, then it's a lot of back and forth.
Outline the following:
What is the goal of the site / re-design.
What is the expected raise in customer base?
What is the customer retainment goal?
What is the target demographic?
Outline from the start all the interactive elements - flash / movies / games.
Outline the IA, sit down with the client and outline all the sections they want. Think up of how to organize it and bring it back to them.
Get all changes in writing.
Do all spec preparation before starting development to avoid last minute changes.
Some general pointers
Be polite, but don't be too easy-going. If the client is asking for something impossible, let them know that in a polite way. Don't say YOU can't do it, say it is not possible to accomplish that in the allotted time and budget.
Avoid making comparisons between your ideas and big name company websites. Don't say your search function will be like Google, because you set a certain kind of standard for your program that the user is used to.
Follow standards in whatever area of work you are. This will make sure that the code is not only easy to maintain later but also avoid the chances of bugs.
Stress accessibility to yourself and the client, it is a big a thing.
More stuff:
Do not be afraid to voice your opinion. Of course, the client has the money and the decision at hand whether to work with you - so be polite. But don't be a push-over, you have been in the industry and you know how it works, so let them know what will work and what won't.
If the client stumbles on your technical explanations, don't assume they are stupid, they are just in another industry.
Steer the client away from cliches and buzz words. Avoid throwing words like 'ajax' and 'web 2.0' around, unless you have the exact functionality in mind.
Make sure to plan everything before you start work as I have said above. If the site is interactive, you have to make sure everything meshes together. When the site is thought up piece by piece, trust me it is noticeable.
One piece of advice that I've seen in many software design situations (not just web site design) relates to user expectations. Some people manage them well by giving the user something to see, while making sure that the user doesn't believe that the thing they're seeing can actually work.
Paper prototyping can help a lot for this type of situation: http://en.wikipedia.org/wiki/Paper_prototyping
I'm with the paper prototyping, but use iplotz.com for it, which is working out fine so far from us.
It makes you think about how the application should work in more detail, and thus makes it less likely to miss out on certain things you need to build, and it makes it much easier to explain to the client what you are thinking of.
You can also ask the client to use iplotz to explain the demands to you, or cooperate in it.
I also found looking for client questionnaires on google a good idea to help generate some more ideas:
Google: web client questionnaire,
There are dozens of pdfs and other forms to learn from
We have a very small, specialized user base.
No community.
My boss wants to find out who is using it. And his approach is to simply make a hidden connection, maybe an auto update function, enabled by default WITHOUT notification when there is no update ...
I don't really like the idea and try to come up with something different.
There is a registration, then you can download a free trial. No other limitations, but the time limit.
Sold licenses are usable across an ip-range - universities.
So the registration and licensing itself is no indicator of usage. Not to speak of that the devs have no feedback about sold licenses whatsoever.
I would like to have some advice how you would, or better have actually, approached a problem like this.
"Simply call home" to notify you that someone is using your software is probably not a good idea, indeed : users don't tend to like that. And it can be bad for the reputation of your company/software.
A solution would be having some kind of good reason to "call home" ;-)
For instance, what about some kind of auto-update-mecanism ? That users could disable, of course, if they want (so not 100% percent efficient) ; but most won't disable it.
And it's really a good idea of reason to do a request to your server :-)
Just don't send anything that could identify the user ; some unique-id key, maybe (to make a distinction between users), but that cannot be used to identify a user ?
I don't like when software I use say to people "hello, this guy is using me!", but I really like the auto-update feature in Firefox, for instance... Event if it says I'm using the software ;-)
This is extremely subjective, and I'd strongly suggest you go and ask some of your actual users how they feel about it, instead of a bunch of opinionated programmers (unless your program is oriented towards programmers who frequents stackoverflow.com). If you make clear it's anonymous, light-weight, and your users like your program to begin with, maybe they'll be just fine with contributing data to build a better version. But there's no other way to know, then to simply ask them.
Concealing (to use a loaded phrase) your activities under some unrelated pre-text seems highly disingenuous.
If you're selling to anywhere that might have a competent IT setup, like a university, then I wouldn't even think about a sneaky don't-tell-them route. If you do, you're lining yourself up for bad publicity as soon as someone's firewall spots the unexpected connections
I start almost all my programs with a shell script that emails me who is using the program, what version they're running, and some other stuff. If nothing else, it's useful for the bean counters who want to track software usage to see if your job is worth keeping.
My software has a licensing scheme where each installed copy generates a unique product ID, and I then email the customer a matching code that unlocks the full program. So I know exactly how many (paying) customers I have.
This doesn't count people using cracked versions, but I'd rather not know how many of them there are anyway.
Since you can have multiple users on a single license, the only thing you can really do is add something to your software that sends a notification to your server every time the application starts. Obviously this won't catch people who aren't connected to the Intertubes, but there's no way to measure them anyway (short of calling them, as you've mentioned already).
Well I would certainly hope you know who you're selling your software to if you're keeping licenses like that, and that you have their phone numbers. Give them a call, sit on the phone with them for a while, ask them what they'd change, what they don't like, what bothers them.
That would truly be going the extra mile, and would most likely impress whoever is using the software. When you call, make sure to let them know you aren't some 3rd party calling on your company's behalf, let them know you actually work on the software that they're using, and that you really want to know what they think, and that their opinions have some form of influence on future versions and features.
You could also send out a mass e-mail to do the same thing, but that's lazy, imo.
Auto-Update w/ usage stats is a GREAT idea.
I am interested in choosing a good structure for an online message board-type application. I will use SO as an example, as I think it's an example that we are all familiar with, but my question is more general; it is about how to achieve the right balance between organization and flexibility in online message boards.
The questions page is a load of random stuff. It moves quickly (some might say, too quickly) and contains a huge number of questions that I'm not interested in.
The idea, I imagine, is that we can use tags to find questions that we're interested in. However, I'm not sure that this works: you can't use tags negatively. I'm not interested in PHP or perl or web development. I want to exclude such posts. But with the tags, I can't.
Although discrete subforums are in a sense less flexible, as they generally force you to pick a category even if a question might fit into two (if SO had, say, areas for "Web Development", "Games development", "Computer Science", "Systems Programming", "Databases", etc. then sure, some people might want to post about developing of web-based games, for example) is it worth sacrificing some of that flexibility in order to make it easier to find the content that you are interested in, and hide the content that you are not interested in?
Is there any way with a pure tagging system to achieve the greater ease of use that subforums provide?
The real problem with subforums comes when you guess wrong about which topics have enough interest to get their own subforums. While some topics end up with their own vibrant subcommunities others end up as empty ghettos, with little activity or feeling of community. Topics that might flourish as occasional subjects in a larger forum end up fragmented among many subforums, none of which has the critical mass of people necessary to have an active, vibrant community.
Though I think that tagging is supperior to grouping, people tend to think hierarchically.
In general it depends on the target group for the forum.
Maybe you can go with a mixture: use tagging and later use tag groups to order to posts. Delicious uses this, for example, and I find it rather helpful.
If you're worried about the divide between specific forums and open tag-based systems, like Stack Overflow, consider making a query system that allows you to do a bit more complex queries than just the AND operator, like here on Stack Overflow.
I cannot make a query here that will give me all questions in .NET, SQL or C#, combined, and that is the biggest irritation I have with the tags. With such a query system, you can create virtual forums at least.
Other than that, I don't really have a good opinion. I like both, and I haven't yet decided which one is best.
The idea, I imagine, is that we can use tags to find questions that we're interested in. However, I'm not sure that this works: you can't use tags negatively. I'm not interested in PHP or perl or web development. I want to exclude such posts. But with the tags, I can't.
While it's currently the case that you can't use tags to hide content, it shouldn't be impossible. Using SO as an example again, there's no reason that a system similar to the ignore function on a forum couldn't be made for the tag system. By adding a right-click context menu or a small "X" link somewhere in the tag display, tags could be marked as ignored. This would also allow the current tag feature to function; Seeing everything (minus your ignore list), or clicking a tag to see only questions with that tag.
Ignored tags could be managed in your profile if you should later develop an interest in PHP or INTERCAL that you lacked before.
The real question is that of performance. In my head it's as simple as replacing a SELECT [stuff] WHERE Tag = 'buffer-overflow' with SELECT [stuff] WHERE Tag NOT IN ('php','offtopic','funny-hat-friday') but I've not put together any DB backed sites that get absolutely pounded on by thousands people.