Identify groups that differ statistically in gt_summary - gtsummary

I just recently discovered gtsummary and I'm impressed by the ease of use and the amount of work that our team won't have to do creating summary tables of our results. Thanks!
My question: With more than two groups, the p-value included with add_p() refers to the global test. How can I obtain the information about the post hoc comparaisons between groups? In some journals, we see the use of superscripted letters.
I looked for the add_difference() option but it prints the difference between two groups. I also thought that add_q() would help me.
Thanks in advance!

There is no built-in function for pairwise differences, but this post illustrates how to incorporate them
How to calculate SMD between 3 groups or more?

I have tried to summarize the what, how and why (not) about the "Compact Letter Display" you are talking about in this chapter: https://schmidtpaul.github.io/dsfair_quarto/summaryarticles/compactletterdisplay.html
Does that help?

Related

Classification of gender for given names

after some research I could not find yet a suitable open source library or software I can use to classify by most likely gender a long table of first names I have.
For my application I have a set of first names from many different countries, and many of them are also pretty exotic.
For example, when I tried to use Genderize I could get only 1/8 of the names classified, while the remaining are labeled as Unknown (I made sure that the format is correct, no lower/upper case ambiguity, etc..).
Any advise would be appreciated. Thank you in advance !
For the record, the best I could find was really just do it manually looking up names from google or dedicated websites such as https://namepedia.org. I am afraid there is no automated solution for my use case. This mostly for the following reasons:
Many names are somewhat archaic (I could not even recognise several names of my own nationality)
Many names were truncated to form nicknames or had two nearby letters swapped: here a LUT approach would fail and rather one would need a score from a model
There were several names not based on Roman alphabet but where the mapping into roman characters produced some ambiguities I guess
For those curious of the original dataset, this is part of a Kaggle challenge (Spaceship Titanic, https://www.kaggle.com/competitions/spaceship-titanic).

Determining canonical classes with text data

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.
I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type:
aple, apples, appls, ornge, fruits, orange, orange z, pear,
cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.
What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.
Put another way I have multiple spellings of various permutations of a parent level variable (fruits or vegetables in this example) and I need to be able to group them as best I can.
The only other potentially relevant feature of the data is the team that entered it, assuming some consistency in the way each team enters their data.
So, I have several million records of multiple spellings and short spellings (e.g. apple, appls) and I want to group them together in some way. In this example by fruits and vegetables.
Clustering would be challenging since each entry is most often 1 or two words, making it tricky to calculate a distance between terms.
Short of creating a massive lookup table created by a human (not likely with millions of rows), is there any approach I can take with this problem?
You will need to first solve the spelling problem, unless you have Google scale data that could allow you to learn fixing spelling with Google scale statistics.
Then you will still have the problem that "Apple" could be a fruit or a computer. Apple and "Granny Smith" will be completely different. You best guess at this second stage is something like word2vec trained on massive data. Then you get high dimensional word vectors, and can finally try to solve the clustering challenge, if you ever get that far with decent results. Good luck.

Tool or technique to compare and group diffs by similarity

I have developed a system that allows visitors to submit typo corrections for my blog. It works by having a small client-side app which then sends unified diffs to a server. Behind that, I have an interface which allows me to see all diffs in a nice graphical way, sort them, etc.
However I am thinking that as time passes, many visitors will submit corrections for the same things before I have time to fix them. So I would need a way to group similar or identical diffs together.
Identical diffs are easy enough. But there might be people who fix errors differently, e.g. using American or British spellings, different rules for punctuation, varying understandings of unclear phrases, that kind of thing. Grouping similar diffs would be tremendously helpful.
Are there techniques, algorithms, or tools that are specifically designed or can be used to compute the similarity of diffs?
I believe that you have two problems to solve: 1. recognizing fixes for the same text (e.g. same typo location), 2. potentially remove those with the same or nearly equal solutions and at least group all the patches that are related to that location.
Problem 1. The unified diff format is somewhat OK as it gives the lines, but a word level or character level diff (for example, counting each word as a line as wdiff does) might be more precise and help you group more precisely the patches.
Problem 2. if the patches are identical, as you noted it is trivial, if they are different, solving the problem 1 already did much of the work. You can of course use a normalization such as "inflected word parts removal" (removing 's', 'ing' and so on at end of words for example) or "lower casing" before the comparison the replacements part in the unified diffs, thus helping group together nearly identical solutions.
The problem 1 is the problem paused by integration or merge of patches. Problem 2 is more relevant to your particular case.
Maybe you could adopt the Damerau-Levenshtein algorithm. It is used to calculate the distance between two strings.

how to guess the nationality of a person from the surname?

What approach can I use to predict the nationality of a person from the surname?
I have a huge list of texts and surnames of authors. I would like to identify which texts have been written by latin-language speakers and which texts have been written by native english speakers, in order to study if certain writing style patterns are different in one group compared to the other.
I have looked in google and in pubmed for a database of surnames, but I could not find any accessible for free. Another approach is to use some regexs, for example ".*ez" to identify some hispanic surnames such as 'rodriguez', but it doesn't get me very far.
Do you have any suggestion? Since I will manually revise all the associations after making the prediction, I don't need a great accuracy, but any help or idea will be welcome.
I don't think you can do this with any degree of reliability. A Rodriguez may well have a Spanish origin name, but could well have been born and brought up anywhere. They could be second generation British, and never have had Spanish spoken around them, and so come into the category of Native English speaker.
If Actual authors then maybe you can spider amazon and check their 'Author information' details?
I don't think you can guess. E.g. Irish last names - there are an estimated 80,000,000 people with Irish heritage however on 4.5 million of these live in Ireland/went through Irish education.
There is no meaningful way to do this. There is no reason why people with hispanic names cannot be native english speakers.
If you are going to revise it anyway, why not use the data you have?
Assuming you are intending on doing a programmatic comparison of the texts, you have to manually categorize the texts. Incorrect guesses would likely lead you to build a broken algorithm for textual analysis. This will be especially problematic with machine learning, such as artificial neural networks.

How to auto-tag content, algorithms and suggestions needed

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.
I am now searching for ways to help me tag these articles with somewhat descriptive tags.
All these articles is accessible from a URL that looks like this:
http://web.site/CATEGORY/this-is-the-title-slug
So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.
My initial approach was doing this:
Get all articles
Get all words, remove all punctuation, split by space, and count them by occurrence
Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
When all the common words was filtered out, the only thing left is words that is tag-worthy.
But this turned out to be a rather manual task, and not a very pretty or helpful approach.
This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.
Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.
To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.
You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.
Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.
You should use a metric such as tf-idf to get the tags out:
Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
For each document, declare the top k terms by their tf-idf score to be the tags for that document.
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.
If you want to do better than this, use language models. That requires some knowledge of probability theory.
Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.
Your problem has also been discussed many times at http://metaoptimize.com/qa:
http://metaoptimize.com/qa/questions/1527/what-are-some-good-toolkits-to-get-lda-like-tagging-of-my-documents
http://metaoptimize.com/qa/questions/1060/tag-analysis-for-document-recommendation
If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.
This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).
I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).
You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.
A couple questions to consider while you're looking at clustering systems:
Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not
Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.
Your approach seems sensible and there are two ways you can improve the tagging.
Use a known list of keywords/phrases for your tagging and if the count of the instances of this word/phrase is greater than a threshold (probably based on the length of the article) then include the tag.
Use a part of speech tagging algorithm to help reduce the article into a sensible set of phrases and use a sensible method to extract tags out of this. Once you have the articles reduced using such an algorithm, you would be able to identify some good candidate words/phrases to use in your keyword/phrase list for method 1.
If the content is an image or video, please check out the following blog article:
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
If the content is a large text document, please check out this blog article:
Best Key Phrase Extraction APIs in the Market
http://scottge.net/2015/06/13/best-key-phrase-extraction-apis-in-the-market/
Thanks, Scott
Assuming you have pre-defined set of tags, you can use the Elasticsearch Percolator API like this answer suggests:
Elasticsearch - use a "tags" index to discover all tags in a given string
Are you talking about the name-entity recognition ? if so, Anupam Jain is right. it;s research problem with using deep learning & CRF. In 2017, the name-entity recognition problem is force on semi-surprise learning technology.
The below link is related ner of paper:
http://ai2-website.s3.amazonaws.com/publications/semi-supervised-sequence.pdf
Also, The below link is key-phase extraction on twitter:
http://jkx.fudan.edu.cn/~qzhang/paper/keyphrase.emnlp2016.pdf