Filtering Data From Scraped Tweets Using rtweet Package - filtering

meta_mueller <- search_tweets("mueller", n = 250000, retryonratelimit = TRUE)
Within the dataframe is a column "geo_coords". A majority upon visual scan are c(NA,NA).
I have dplyr installed (other packages are fine, too) and I want to identify any rows that do not equal c(NA,NA).
filter(!is.na(meta_mueller(geo_coords))
This did not work.

Solution:
meta_mueller_location = select(meta_mueller, place_full_name)
meta_mueller_location_filter = filter(meta_mueller_location,
place_full_name != "NA")
Instead of geo_coords I used the command on "place_full_name" column which was only NA not c(NA,NA). This was a better solution for my needs.

Related

spark program to check if a given keyword exists in a huge text file or not

To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.
Approach1:
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)
Approach2:
var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)
Main difference is first one using map and then reducing whereas second one is filtering and doing a count.
Could anyone suggest which is efficient.
I would suggest:
val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()
Benefit: The search can be stopped once 1 occurence of keyword was found
see also Spark: Efficient way to test if an RDD is empty

Spark, Scala, Databricks, combine and add columns

Using Spark/Scala to attempt a "simple" query. I have a file which, after line 1 below runs, looks like this
EmpReg,EmpOT,RegPay,OTPay
Alice,Alice,400,20
Bob,Bob,300,0
Carol,Carol,450,120
Dan,Dan,400,200
Ellen,Ellen,360,40
The first and third columns (EmpReg, RegPay) come from one source and the second and third columns (EmpOT, OTPay) come from a second source. My objective is output that looks like this.
Emp,Pay
Alice,420
Bob,300
Carol,570
Dan,600
Ellen,400
Here is the code that I have been trying, at least what I have saved.
var q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
//q2 = q2.select("EmpReg", ($"RegPay" + $"OTPay"))
//q2 = q2.groupBy($"EmpReg".sum($"RegPay" + $"OTPay"))
var add = q2.select(($"RegPay" + $"OTPay"))
//q2 = q2.sum("RegPay", "OTPay")
//q2 = q2.groupBy("EmpReg", "EmpOT")
//var q2 = q.join(q1).where("EmpReg") === "EmpOT"))
//q2 = q2.select("EmpReg").sum("RegPay", "OTPay")
//q2.show
add.show
[q] is the first file which represents regular pay. [q1] is the second file which represents overtime pay. [q2] is the combination shown in the first example above. Primary keys are [EmpReg] and [EmpOT]. don't really need to combine [EmpReg] and [EmpOT] since they are the same, and it doesn't make any difference which I use.
I really need to add [RegPay] and [OTPay] to get [Pay], but for the life of me I can't get it to work. The lines commented out return various errors. I can add the two pay columns, and select an appropriate employee column, but can't seem to do it in one query. I am constrained to use Scala on Databricks. Othewise, I might do something like this.
select q.EmpReg as Emp, (q.RegPay + q1.OTPay) as Pay
from q join q1 on q.EmpReg = q1.EmpOT
(Why can't things ever be simple?)
You can use a similar approach as in your SQL query:
val q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
val add = q2.select(q("EmpReg").as("Emp"), (q("RegPay") + q1("OTPay")).as("Pay"))
Your code has this line
q2.select("EmpReg", ($"RegPay" + $"OTPay"))
which should work if you add $ before "EmpReg". You can't have both strings and columns in the select statement. This works in Python but not Scala.

save page rank output in neo4j

I am running Pregel Page rank algorith
m on twitter data in Spark using scala. The algorithm runs fine and gives me the output correctly finding out the highest page rank score. But I am unable to save graph on neo4j.
The inputs and outputs are mentioned below.
Input file: (The numbers are twitter userIDs)
86566510 15647839
86566510 197134784
86566510 183967095
15647839 11272122
15647839 10876852
197134784 34236703
183967095 20065583
11272122 197134784
34236703 18859819
20065583 91396874
20065583 86566510
20065583 63433165
20065583 29758446
Output of the graph vertices:
(11272122,0.75)
(34236703,1.0)
(10876852,0.75)
(18859819,1.0)
(15647839,0.6666666666666666)
(86566510,0.625)
(63433165,0.625)
(29758446,0.625)
(91396874,0.625)
(183967095,0.6666666666666666)
(197134784,1.1666666666666665)
(20065583,1.0)
Using the below scala code I try saving the graph but it does'nt. Please help me solve this.
Neo4jGraph.saveGraph(sc, pagerankGraph, nodeProp = "twitterId", relProp = "follows")
Thanks.
Did you load the graph originally from Neo4j? Currently saveGraph saves the graph data back to Neo4j nodes via their internal id's.
It actually runs this statement:
UNWIND {data} as row
MATCH (n) WHERE id(n) = row.id
SET n.$nodeProp = row.value return count(*)
But as a short term mitigation I added optional labelIdProp parameters that are used instead of the internal id's, and a match/merge flag. You'll have to build the library yourself though to use that. I gonna push the update the next few days.
Something you can try is Neo4jDataFrame.mergeEdgeList
Here is the test code for it.
You basically have a dataframe with the data and it saves it to a Neo4j graph (including relationships though).
val rows = sc.makeRDD(Seq(Row("Keanu", "Matrix")))
val schema = StructType(Seq(StructField("name", DataTypes.StringType), StructField("title", DataTypes.StringType)))
val df = new SQLContext(sc).createDataFrame(rows, schema)
Neo4jDataFrame.mergeEdgeList(sc, df, ("Person",Seq("name")),("ACTED_IN",Seq.empty),("Movie",Seq("title")))
val edges : RDD[Edge[Long]] = sc.makeRDD(Seq(Edge(0,1,42L)))
val graph = Graph.fromEdges(edges,-1)
assertEquals(2, graph.vertices.count)
assertEquals(1, graph.edges.count)
Neo4jGraph.saveGraph(sc,graph,null,"test")
val it: ResourceIterator[Long] = server.graph().execute("MATCH (:Person {name:'Keanu'})-[:ACTED_IN]->(:Movie {title:'Matrix'}) RETURN count(*) as c").columnAs("c")
assertEquals(1L, it.next())
it.close()

Is it possible to return a map of key values using gremlin scala

Currently i have two gremlin queries which will fetch two different values and i am populating in a map.
Scenario : A->B , A->C , A->D
My queries below,
graph.V().has(ID,A).out().label().toList()
Fetch the list of outE labels of A .
Result : List(B,C,D)
graph.traversal().V().has("ID",A).outE("interference").as("x").otherV().has("ID",B).select("x").values("value").headOption()
Given A and B , get the egde property value (A->B)
Return : 10
Is it possible that i can combine both there queries to get a return as Map[(B,10)(C,11)(D,12)]
I am facing some performance issue when i have two queries. Its taking more time
There is probably a better way to do this but I managed to get something with the following traversal:
gremlin> graph.traversal().V().has("ID","A").outE("interference").as("x").otherV().has("ID").label().as("y").select("x").by("value").as("z").select("y", "z").select(values);
==>[B,1]
==>[C,2]
I would wait for more answers though as I suspect there is a better traversal out there.
Below is working in scala
val b = StepLabel[Edge]()
val y = StepLabel[Label]()
val z = StepLabel[Integer]()
graph.traversal().V().has("ID",A).outE("interference").as(b)
.otherV().label().as(y)
.select(b).values("name").as(z)
.select((y,z)).toMap[String,Integer]
This will return Map[String,Int]

Spark Iterating RDD over another RDD with filter conditions Scala

I wants to iterate one BIG RDD with small RDD with some additional filter conditions . the below code is working fine but the process is running only with Driver and Not spread-ed across the nodes . So please suggest any other approach ?
val cross = titlesRDD.cartesian(brRDD).cache()
val matching = cross.filter{ case( x, br) =>
((br._1 == "0") &&
(((br._2 ==((x._4))) &&
((br._3 exists (x._5)) || ((br._3).head==""))
}
Thanks,
madhu
You probably don't want to cache cross. Not caching it will, I believe, let the cartesian product happen "on the fly" as needed for the filter, instead of instantiating the potentially large number of combinations resulting from the cartesian product in memory.
Also, you can do brRDD.filter(_._1 == "0") before doing the cartesian product with titlesRDD, e.g.
val cross = titlesRDD.cartesian(brRRD.filter(_._1 == "0"))
and then modify the filter used to create matching appropriately.