Censor certain words and images - iphone

My question is how can i censor bad language and nudity images on my application, is there some sort of a framework that can filter the content that is introduced by the user? what are you, IOS experts, using at this moment to solve this problem?

There are two parts to your question: 1) Censoring text, 2) Censoring images.
In the case of text, you could store a dictionary and match user input to words that you want to censor. However how do you define bad language? Words have meaning depending on their context, and software can't determine context. Ie Is saying that someone likes to eat watermelon something that should be censored? Well it could be considered racist if applied to certain groups of people. And that's something your dictionary can't tell.
In the case of images, there is no reliable method to discern nudity via an algorithm. In fact all websites that do image censoring use humans to categorize and censor user supplied images for that reason (And from what I have read its not the best of jobs to have). And even the humans make mistakes. Recently FB rejected an image of a woman In a bath because they mistook her elbow for a naked breast.

Related

Determining canonical classes with text data

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.
I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type:
aple, apples, appls, ornge, fruits, orange, orange z, pear,
cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.
What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.
Put another way I have multiple spellings of various permutations of a parent level variable (fruits or vegetables in this example) and I need to be able to group them as best I can.
The only other potentially relevant feature of the data is the team that entered it, assuming some consistency in the way each team enters their data.
So, I have several million records of multiple spellings and short spellings (e.g. apple, appls) and I want to group them together in some way. In this example by fruits and vegetables.
Clustering would be challenging since each entry is most often 1 or two words, making it tricky to calculate a distance between terms.
Short of creating a massive lookup table created by a human (not likely with millions of rows), is there any approach I can take with this problem?
You will need to first solve the spelling problem, unless you have Google scale data that could allow you to learn fixing spelling with Google scale statistics.
Then you will still have the problem that "Apple" could be a fruit or a computer. Apple and "Granny Smith" will be completely different. You best guess at this second stage is something like word2vec trained on massive data. Then you get high dimensional word vectors, and can finally try to solve the clustering challenge, if you ever get that far with decent results. Good luck.

What's the correct RESTful way to structure A Website's Parent/Child Views

NOTE: This is not specifically for an API.
I have three Entities: Building Unit Person
These are pure simple easy Exclusive 1:M relationships
A Person can only live in (1) unit
A Unit can only exist in (1) unit
The Building is essentially the parent.
Should I have URLs like:
The View mode is pretty easy
/buildings //Show all buildings
/buildings/[id] //Show one building
/buildings/[id]/units //Show all units in a building
/buildings/[id]/units/[id]/people //Show all people in a unit
However, this seems kind of verbose. While those URLs may work for PUTS and POSTS which redirect to a GET, if I want to show all the units and people in a building, should I be using a route like buildings/[id]/details or is there some other standard convention?
Also, when I want to display a form to edit the values, is there a standard url path like buildings/[id]/edit A POST and a PUT in this case will essentially be using the same form ( but with the PUT having the fields filled out ) .
I think your question may attract some opinionated answers, but it'd be good to hear about other peoples' practices regarding RESTful API designs.
You say your paths seem kind of verbose, and you may feel that way if your IDs are auto incremented integers and the only way to specify buildings, units, etc is with paths like
buildings/1/units/4/tenants
buildings/1/units/4/tenants/5
To me these are clear interfaces. If I had to maintain your code, I'd think it's pretty obvious what's going on here. If I had to criticize something, I would say you seem to be developing in a way that allows for all or one selection. It's your design choice, though. Maybe that's exactly what you need in this case. Here are some examples that come to mind.
update one tenant
PUT buildings/1/units/4/tenants/2
create three units
POST buildings/2/units //carries message body for SQL in back end
read tenants with certain criteria
GET buildings/1/tenants?params= //GET can't carry a message body
delete tenants with certain criteria
DELETE buildings/5/tenants?criteria= //params needed?

Can I use Apache Mahout Taste for User Preferences matching?

I am trying to match objects based on predefined user preferences. A simple example would be finding best matching vechicle.
Lets say a user 'Tom' is offered a rented vehicle for travel based on his predefined preferences. In this case, the predefined user preferences will be -
** Pre-defined user preferences for Tom:
PreferredVehicle (Make='ANY', Type='3-wheeler/4-wheeler',
Category='Sedan/Hatchback', AC/Non-AC='AC')
** while the 10 available vehicles are -
Vechile1(Make='Toyota', Type='4-wheeler', Category='Hatchback', AC/Non-AC='AC')
Vechile2(Make='Tata', Type='3-wheeler', Category='Transport', AC/Non-AC='Non-AC')
Vechile3(Make='Honda', Type='4-wheeler', Category='Sedan', AC/Non-AC='AC')
;
;
and so on upto 'Vehicle10'
All I want to do is - choose a vehicle for Tom that best matches his preferences and also probably give him choices in order, i.e. best match first.
Questions I have :
Can this be done with Mahout Taste?
If yes, can someone please point me to some example code where I can start quickly?
A recommender may not be the best tool for the job here, for a few reasons. First, I don't expect that the best answers are all that personal in this domain. If I wanted a Ford Focus, the best alternative you have is likely about the same for most every user. Second, there is not much of a discovery problem here. I'm searching for a vehicle that meets certain needs; I don't particularly want or need to find new and unknown vehicles, like I would for music. Finally you don't have much data per user; I assume most users have never rented before, and very few have even 3+ rentals.
Can you throw this data at a recommender anyway? Sure, try Mahout Taste (I'm the author). If you have the book Mahout in Action it will walk you through it. Since it's non-rating data, I can also recommend the successor project, Myrrix (http://myrrix.com) as it will be easier to set up and run. You can at least evaluate the results to see if it's anywhere near useful.
Either way, your work will just be to make a CSV file of "userID,vehicleID" pairs from your data and feed it in. Then it will give you vehicle IDs as recommendations for any user ID.
But, I imagine you will do much better to analyze what people picked when the car wasn't available, and look at the difference, and learn which attributes they are most and least likely to be sacrificed, and learn to score the alternatives that way. This is entirely feasible since this data set is small, and because you have rich item attribute data.

how to guess the nationality of a person from the surname?

What approach can I use to predict the nationality of a person from the surname?
I have a huge list of texts and surnames of authors. I would like to identify which texts have been written by latin-language speakers and which texts have been written by native english speakers, in order to study if certain writing style patterns are different in one group compared to the other.
I have looked in google and in pubmed for a database of surnames, but I could not find any accessible for free. Another approach is to use some regexs, for example ".*ez" to identify some hispanic surnames such as 'rodriguez', but it doesn't get me very far.
Do you have any suggestion? Since I will manually revise all the associations after making the prediction, I don't need a great accuracy, but any help or idea will be welcome.
I don't think you can do this with any degree of reliability. A Rodriguez may well have a Spanish origin name, but could well have been born and brought up anywhere. They could be second generation British, and never have had Spanish spoken around them, and so come into the category of Native English speaker.
If Actual authors then maybe you can spider amazon and check their 'Author information' details?
I don't think you can guess. E.g. Irish last names - there are an estimated 80,000,000 people with Irish heritage however on 4.5 million of these live in Ireland/went through Irish education.
There is no meaningful way to do this. There is no reason why people with hispanic names cannot be native english speakers.
If you are going to revise it anyway, why not use the data you have?
Assuming you are intending on doing a programmatic comparison of the texts, you have to manually categorize the texts. Incorrect guesses would likely lead you to build a broken algorithm for textual analysis. This will be especially problematic with machine learning, such as artificial neural networks.

How to address semantic issues with tag-based web sites

Tag-based web sites often suffer from the delicacy of language such as synonyms, homonyms, etc. For programmers looking for information, say on Stack Overflow, concrete examples are:
Subversion or SVN (or svn, with case-sensitive tags)
.NET or Mono
[Will add more]
The problem is that we do want to preserve our delicacy of language and make the machine deal with it as good as possible.
A site like del.icio.us sees its tag base grow a lot, thus probably hindering usage or search. Searching for SVN-related entries will probably list a majority of entries with both subversion and svn tags, but I can think of three issues:
A search is incomplete as many entries may not have both tags (which are 'synonyms').
A search is less useful as Q/A often lead to more Qs! Notably for newbies on a given topic.
Tagging a question (note: or an answer separately, sounds useful) becomes philosophical: 'Did I Tag the Right Way?'
One way to address these issues is to create semantic links between tags, so that subversion and SVN are automatically bound by the system, not by poor users.
Is it an approach that sounds good/feasible/attractive/useful? How to implement it efficiently?
Recognizing synonyms and semantic connections is something that humans are good at; a solution to organizing an open-ended taxonomy like what SO is featuring would probably be well served by finding a way to leave the matching to humans.
One general approach: someone (or some team) reviews new tags on a daily basis. New synonyms are added to synonym groups. Searches hit synonym groups (or, more nuanced, hit either literal matches or synonym group matches according to user preference).
This requires support for synonym groups on the back end (work for the dev team). It requires a tag wrangler or ten (work for the principals or for trusted users). It doesn't require constant scaling, though—the rate at which the total tag pool grows will likely (after the initial Here Comes Everybody bump of the open beta) will in all likelihood decrease over time, as any organic lexicon's growth-rate does.
Synonymy strikes me as the go-to issue. Hierarchical mapping is an ambitious and more complicated issue; it may be worth it or it may not be, but given the relative complexity of defining the hierarchy it'd probably be better left as a Phase 2 to any potential synonym project's Phase 1.
The way the software on blogspot.com is set up, is that there is an ajax-autocomplete-thingie on the box where you write the name of the tags. This searches all your previous posts for tags that start with the same letters. At least that way you catch different casings and spellings (but not synonyms).
How would the system know which tags to semantically link? Would it keep an ever-growing map of tags? I can't see that working. What if someone typed sbversion instead? How would that get linked?
I think that asking the user when they submit tags could work. For example, "You've entered the following tags: sbversion, pascal and bindings. Did you mean, "Subversion", "Pascal" and "Bindings"?
Obviously the system would have to have a fairly smart matching system for that to work. Doing it this way would be extra input for the user (which'd probably annoy them) but the human input would, if done correctly, make for less duplicate tags.
In fact, having said all that, the system could use the results of the user's input as a basis for automatic tag matching. From the previous example, someone creates a tag of "sbversion" and when prompted changes it to "Subversion" - the system could learn that and do it automatically next time.
Part of the issue you're looking at is that English is rife with synonyms - are the following different: build-management, subversion, cvs, source-control?
Maybe, maybe not. Having a system, like the one [now] in use on SO that brings up the tag you probably meant is extremely helpful. But it doesn't stop people from bulling-through the tagging process.
Maybe you could refuse to accept "new" tags without a user-interaction? Before you let 'sbversion' go in, force a spelling check?
This is definitely an interesting problem. I asked an open question similar to this on my blog last year. A couple of the responses were quite insightful.
I completely agree. The mass of tags that have currently. I don't participate in other tagged based sites. However having a hierarchy of tags would be very helpful, instead of ruby rails ruby-on-rails rubyonrails etc...
Tags are basically our admission that search algorithms aren't up to snuff. If we can get a computer to be smart enough to identify that things tagged "Subversion" have similar content to things tagged "svn", presumably we can parse the contents, so why not skip tags altogether, and match a search term directly to the content (i.e., autotagging, which is basically mapping keywords to results)?!
The problem is to make the search engine use the fact that 'subversion' and 'svn' are very similar to the point that they mean the same 'thing'.
It might be attractive to compute a simple similarity between tags based on frequency: 'subversion' and 'svn' appear very often together, so requesting 'svn' would return SVN-related questions, but also the rare questions only tagged 'subversion' (and vice versa). However, 'java' and 'c#' also appear often together, but for very different reasons (they are not synonyms). So similarity based on frequency is out.
An answer to this problem might be a mix of mechanisms, as the ones suggested in this Q/A thread:
Filtering out typos by suggesting tags when the user inputs them.
Maintaining a user-generated map of synonyms. This map may not be that big if it just targets synonyms.
Allowing multi-tag search, such that the user can put 'subversion svn' or 'subversion && svn' (well, from programmers to programmers) in the search box and get both. This would be quite practical as many users may actually try such approach when they do not know which term is the most meaningful.
#Nick: Agreed. The question is not meant to argue against tags. Tags have great potential, but users will face a growing issue if one cannot search 'across' tags.
#Steve: Maintaining an ever-growing map of tags is definitely not practical. As SO is accumulating an ever-growing bag of tags, how could we shade some light on this bag to make search of Q/A tags even more useful, in a convenient way?
#Espo: 'Ajax-powered' tag suggestions based on existing tags is apparently available on SO when creating a question. This is by the way very helpful to choose tags and appropriate spelling (avoiding the 'subversion' vs. 'sbversion' issue from Steve).