Likelihood Ratio Test results in a gtsummary table - gtsummary

I am looking to make a table in gtsummary which displays the LRT results of a sequence of nested models:
model1 <- coxph(Surv(tte, dead) ~ histology + modality, data)
model2 <- coxph(Surv(tte, dead) ~ histology + modality + age_cat + group, data)
model3 <- coxph(Surv(tte, dead) ~ histology + modality + age_cat + group + gender, data)
anova(model1, model2, model3) outputs:
Analysis of Deviance Table
Cox model: response is Surv(tte, dead)
Model 1: ~ histology + modality
Model 2: ~ histology + modality + age_cat + group
Model 3: ~ histology + modality + age_cat + group + gender
loglik Chisq Df P(>|Chi|)
1 -270.53
2 -257.44 26.1794 6 0.0002061 ***
3 -256.99 0.9107 1 0.3399197
How do I get this table in a nice format for a report?

Related

How to implement a Slope graph in R for two variables

I'm analyzing how many users have used a particular hashtag and how they have contributed to the total number of tweets. My results are:
Data:
20.68% of tweets related to #HashtagX are created by 20 users. Now, these 20 users only represent 0.001% of the total of 14,432 users who have ever used the hashtag #HashtagX.
What happens if we take the top 100 users by number of tweets? 44% of tweets are created by the top 100 users.
If we extend to the top 500 users by number of users we see that 72% of tweets is created by the top 500.
I am wondering how to implement a slope graph because I think that is a good way to show the relationship between both variables, but it is not a default graph provides for any library.
One of the ways to show the relationship between both variables ("Users" vs "Tweets") is a Slope Chart.
Visualization obtained (solved graph for the question):
Slope Chart
1) Libraries
library(ggplot2)
library(scales)
library(ggrepel)
theme_set(theme_classic())
2) Data example
Country = c('20 accounts', '50 accounts', '100 accounts','200 accounts','300 accounts',
'500 accounts','1000 accounts','14.443 accounts')
January = c(0.14, 0.34, 0.69,1.38,2.07,3.46,6.92,100)
April = c(20.68, 33.61, 44.94, 57.49,64.11,72,80,100)
Tweets_N = c(26797, 43547, 58211, 74472,83052,93259,103898,129529)
a = data.frame(Country, January, April)
left_label <- paste(a$Country, paste0(a$January,"%"),sep=" | ")
right_label <- paste(paste0(round(a$April),"%"),paste0(Tweets_N," tweets"),sep=" | ")
a$color_class <- "green"
3) Plot
p <- ggplot(a) + geom_segment(aes(x=1, xend=2, y=January, yend=April, col=color_class), size=.25, show.legend=F) +
geom_vline(xintercept=1, linetype="dashed", size=.1) +
geom_vline(xintercept=2, linetype="dashed", size=.1) +
scale_color_manual(labels = c("Up", "Down"),
values = c("blue", "red")) +
labs(
x="", y = "Percentage") +
xlim(.5, 2.5) + ylim(0,(1.1*(max(a$January, a$April)))) # X and Y axis limits
# Add texts
p <- p + geom_text_repel(label=left_label, y=a$January, x=rep(1, NROW(a)), hjust=1.1, size=3.5,direction = "y")
p <- p + geom_text(label=right_label, y=a$April, x=rep(2, NROW(a)), hjust=-0.1, size=3.5)
p <- p + geom_text(label="Accounts", x=1, y=1.1*(max(a$January, a$April)), hjust=1.2, size=4, check_overlap = TRUE) # title
p <- p + geom_text(label="Tweeets (% of Total)", x=2, y=1.1*(max(a$January, a$April)), hjust=-0.1, size=4, check_overlap = TRUE)
# title
# Minify theme
p + theme(panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
panel.border = element_blank(),
plot.margin = unit(c(1,2,1,2), "cm"))

Spark GraphX : Filtering by passing a vertex value in triplet

I am using Spark 2.1.0 on Windows 10. Since I am new to Spark, I am following this tutorial
In the tutorial, the author prints all the triplets of the graph using the following code:
graph.triplets.sortBy(_.attr, ascending=false).map(triplet =>
"There were " + triplet.attr.toString + " flights from " + triplet.srcAttr + " to " + triplet.dstAttr + ".").take(10)
Problem : I would like to give an input (ATL for example) and I would like to see all the outbound flights from ATL and their counts as shown below:
res60: Array[String] = Array(There were 1388 flights from ATL to LAX.,
There were 1330 flights from ATL to SFO., There were 1283 flights from ATL to HNL.,
There were 1205 flights from ATL to BOS., There were 1229 flights from ATL to LGA.,
There were 1214 flights from ATL to OGG., There were 1173 flights from ATL to LAS.,
There were 1157 flights from ATL to SAN.)
The following is the code:
// Selecting the desired airport
val input = "ATL"
// filtering the edges of the desired airport (here "ATL") from the `graph`(which is built on the full data)
val TEMPEdge = graph.edges.filter { case Edge(src, dst, prop) => src == MurmurHash3.stringHash(input) }
// Creating a new graph with the filtered edges
val TEMPGraph = Graph(airportVertices, TEMPEdge, defaultAirport)
// Printing the top 10
TEMPGraph.triplets.sortBy(_.attr, ascending=false).map(triplet => "There were " + triplet.attr.toString + " flights from " + triplet.srcAttr + " to " + triplet.dstAttr + "\n").take(10)
or, we can use filter
graph.triplets.sortBy(_.attr, ascending=false).filter {_.dstAttr == input }.map(triplet => "There were " + triplet.attr.toString + " flights from " + triplet.srcAttr + " to " + triplet.dstAttr + "\n").take(3)

Spark UDF optimization for Graph Database (Neo4j) inserts

This is first issue i am posting so apologies if i miss some info and mediocre formatting. I can update if required.
I will try to add as many details as possible. I have a not so optimized Spark Job which converts RDBMS data to graph nodes and relations in Neo4j.
To do this. Here is the steps i follow:
create a denormalized dataframe 'data' with spark sql and joins.
Foreach row in 'data' run a graphInsert function which does the following:
a. read contents of the row b. formulate a neo4j cypher query (We use Merge command so that we have have only one City e.g. Chicago created in Neo4j when Chicago will be present in multiple lines in RDBMS table) c. connect to neo4j d. execute the query e. disconnect from neo4j
Here is the list of problems i am facing.
Inserts are slow.
I know Merge query is slower than create but is there another way to do this instead of connecting and disconnecting for every record? This was my first draft code and maybe i am struggling how i will use one connection to insert from multiple threads on different spark worker nodes. Hence connecting and disconnecting for every record.
The job is not scalable. It only runs fine with 1 core. As soon as i run the job with 2 spark cores i suddenly get 2 cities with same name, even when i am running merge queries. e.g. There are 2 Chicago cities which violates the use of Merge. I am assuming that Merge functions something like "Create if not exist".
I dont know if my implementation is wrong in neo4j part or spark. If anyone can direct me to any documentation which helps me implement this on a better scale it will be helpful as i have a big spark cluster which i need to utilize at full potential for this job.
If you are interested to look at code instead of algorithm. Here is graphInsert implementation in scala:
class GraphInsert extends Serializable{
var case_attributes = new Array[String](4)
var city_attributes = new Array[String](2)
var location_attributes = new Array[String](20)
var incident_attributes = new Array[String](20)
val prop = new Properties()
prop.load(getClass().getResourceAsStream("/GraphInsertConnection.properties"))
// properties Neo4j
val url_neo4j = prop.getProperty("url_neo4j")
val neo4j_user = prop.getProperty("neo4j_user")
val neo4j_password = prop.getProperty("neo4j_password")
def graphInsert(data : Row){
val query = "MERGE (d:CITY {name:city_attributes(0)})\n" +"MERGE (a:CASE { " + case_attributes(0) + ":'" +data(11) + "'," +case_attributes(1) + ":'" +data(13) + "'," +case_attributes(2) + ":'" +data(14) +"'}) \n" +"MERGE (b:INCIDENT { " + incident_attributes(0) + ":" +data(0) + "," +incident_attributes(1) + ":" +data(2) + "," +incident_attributes(2) + ":'" +data(3) + "'," +incident_attributes(3) + ":'" +data(8)+ "'," +incident_attributes(4) + ":" +data(5) + "," +incident_attributes(5) + ":'" +data(4) + "'," +incident_attributes(6) + ":'" +data(6) + "'," +incident_attributes(7) + ":'" +data(1) + "'," +incident_attributes(8) + ":" +data(7)+"}) \n" +"MERGE (c:LOCATION { " + location_attributes(0) + ":" +data(9) + "," +location_attributes(1) + ":" +data(10) + "," +location_attributes(2) + ":'" +data(19) + "'," +location_attributes(3) + ":'" +data(20)+ "'," +location_attributes(4) + ":" +data(18) + "," +location_attributes(5) + ":" +data(21) + "," +location_attributes(6) + ":'" +data(17) + "'," +location_attributes(7) + ":" +data(22) + "," +location_attributes(8) + ":" +data(23)+"}) \n" +"MERGE (a) - [r1:"+relation_case_incident+"]->(b)-[r2:"+relation_incident_location+"]->(c)-[r3:belongs_to]->(d);"
println(query)
try{
var con = DriverManager.getConnection(url_neo4j, neo4j_user, neo4j_password)
var stmt = con.createStatement()
var rs = stmt.executeQuery(query)
con.close()
}catch{
case ex: SQLException =>{
println(ex.getMessage)
}
}
}
def operations(sqlContext: SQLContext){
....
#Get 'data' before this step
city_attributes = entity_metadata.filter(entity_metadata("source_name") === "tb_city").map(x =>x.getString(5)).collect()
case_attributes = entity_metadata.filter(entity_metadata("source_name") === "tb_case_number").map(x =>x.getString(5)).collect()
location_attributes = entity_metadata.filter(entity_metadata("source_name") === "tb_location").map(x =>x.getString(5)).collect()
incident_attributes= entity_metadata.filter(entity_metadata("source_name") === "tb_incident").map(x =>x.getString(5)).collect()
data.foreach(graphInsert)
}
object GraphObject {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("GraphNeo4j")
.setMaster("xyz")
.set("spark.cores.max","2")
.set("spark.executor.memory","10g")
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val graph = new GraphInsert()
graph.operations(sqlContext)
}
}
Whatever you write inside the closure i.e it needs to be executed on Worker gets distributed.
You can read more about it here : http://spark.apache.org/docs/latest/programming-guide.html#understanding-closures-a-nameclosureslinka
And as you increase the number of cores, I think it must not effect the application because if you do not specify it ! then it takes the greedy approach ! I hope this document helps .
I am done improving the process but nothing could make it as fast as LOAD command in Cypher.
Hope this helps someone though:
use foreachPartition instead of foreach gives significant gain while doing such process. Also adding periodic commit using cypher.

Simplifying boolean algebra (XOR)

I know how to convert first and second term to the first term of the simplified expression, but I don't know how to convert the rest.
By simplifying, I can get rid of A_Bar in the third term and A in the fifth term and get =B*C_bar
How is it that B*C_bar + the fourth term = becomes XOR(B,C) ?
The two expressions are clearly the same. This can be easily proven by truth tables.
The first one is:
And the second one:
However, this does not fully answer your question.
B*C_bar + the fourth term = becomes XOR(B,C)
This is clearly true if A is true, since per definitionem, B XOR C = B_bar and C OR B and C_bar.
If A is false, these terms are always false and you cannot simplify these two to B XOR C! They are not equal!
Note: Tables generated with http://web.stanford.edu/class/cs103/tools/truth-table-tool/
Note2: ^= OR, ¬ = NOT, ∨ = AND
let play a game.
Let a=not(A), b=not(B) and c=not(C) and *=xor
Y = ab + (B*C)
Y = ab + Bc + bC
Y = ab(1) + Bc(1) + bC(1)
Y = ab(c+C) + Bc(a+A) + bC(a+A)
Y = abc + abC + Bca + BcA + bCa + bCA
Y = abc + abC + aBc + ABc + abC + AbC
Y = abc + abC + aBc + ABc + AbC
That is the first equ.

how to solve this boolean algrbra expression

I would like help simplifying this boolean algebra expression:
B*C + ~A*~B + ~A*~C => A*B*C + ~A
I need to know the steps of how to simplify it to the ABC + ~A
'*' indicates "AND"
'+' indicates "OR"
"~A" indicates "A NOT"
Any help would be appreciated!
Thank you!
For a better view, i'll skip * for conjunction, and use ' for negation.
First you shall expand the 2 term disjunctions: Expand B*C , A'*B' and A'*C'
1) (A + A')BC + A'B'(C + C') + A'(B + B')C'
now distribute the parentheses.
2) ABC + A'BC + A'B'C + A'B'C' + A'BC' + A'B'C'
the fourth term and the last term are the same, A'B'C', so ignore one of them since p + p = p or you can expand the situation for your needs (might be needed for some situations) as in p+p+p+p+....+p = p
3) So now, lets try to search for common terms. See the 2nd term and 5th term, A'BC and A'BC'. Take common parenthesis, A'B(C+C') => A'B.
Do the same for 3rd term and the 4th term, A'B'C and A'B'C'. A'B'(C+C') => A'B' since X+X' = 1.
now we have:
ABC + A'B + A'B'
4) take common parenthesis again, 2nd and 3rd term: A'(B+B')
There you have ABC + A'
BC + A'B' + A'C' => ABC + A'