save page rank output in neo4j - scala

I am running Pregel Page rank algorith
m on twitter data in Spark using scala. The algorithm runs fine and gives me the output correctly finding out the highest page rank score. But I am unable to save graph on neo4j.
The inputs and outputs are mentioned below.
Input file: (The numbers are twitter userIDs)
86566510 15647839
86566510 197134784
86566510 183967095
15647839 11272122
15647839 10876852
197134784 34236703
183967095 20065583
11272122 197134784
34236703 18859819
20065583 91396874
20065583 86566510
20065583 63433165
20065583 29758446
Output of the graph vertices:
(11272122,0.75)
(34236703,1.0)
(10876852,0.75)
(18859819,1.0)
(15647839,0.6666666666666666)
(86566510,0.625)
(63433165,0.625)
(29758446,0.625)
(91396874,0.625)
(183967095,0.6666666666666666)
(197134784,1.1666666666666665)
(20065583,1.0)
Using the below scala code I try saving the graph but it does'nt. Please help me solve this.
Neo4jGraph.saveGraph(sc, pagerankGraph, nodeProp = "twitterId", relProp = "follows")
Thanks.

Did you load the graph originally from Neo4j? Currently saveGraph saves the graph data back to Neo4j nodes via their internal id's.
It actually runs this statement:
UNWIND {data} as row
MATCH (n) WHERE id(n) = row.id
SET n.$nodeProp = row.value return count(*)
But as a short term mitigation I added optional labelIdProp parameters that are used instead of the internal id's, and a match/merge flag. You'll have to build the library yourself though to use that. I gonna push the update the next few days.
Something you can try is Neo4jDataFrame.mergeEdgeList
Here is the test code for it.
You basically have a dataframe with the data and it saves it to a Neo4j graph (including relationships though).
val rows = sc.makeRDD(Seq(Row("Keanu", "Matrix")))
val schema = StructType(Seq(StructField("name", DataTypes.StringType), StructField("title", DataTypes.StringType)))
val df = new SQLContext(sc).createDataFrame(rows, schema)
Neo4jDataFrame.mergeEdgeList(sc, df, ("Person",Seq("name")),("ACTED_IN",Seq.empty),("Movie",Seq("title")))
val edges : RDD[Edge[Long]] = sc.makeRDD(Seq(Edge(0,1,42L)))
val graph = Graph.fromEdges(edges,-1)
assertEquals(2, graph.vertices.count)
assertEquals(1, graph.edges.count)
Neo4jGraph.saveGraph(sc,graph,null,"test")
val it: ResourceIterator[Long] = server.graph().execute("MATCH (:Person {name:'Keanu'})-[:ACTED_IN]->(:Movie {title:'Matrix'}) RETURN count(*) as c").columnAs("c")
assertEquals(1L, it.next())
it.close()

Related

Faster way to get single cell value from Dataframe (using just transformation)

I have the following code where I want to get Dataframe dfDateFiltered from dfBackendInfo containing all rows with RowCreationTime greater than timestamp "latestRowCreationTime"
val latestRowCreationTime = dfVersion.agg(max("BackendRowCreationTime")).first.getTimestamp(0)
val dfDateFiltered = dfBackendInfo.filter($"RowCreationTime" > latestRowCreationTime)
The problem I see is that the first line adds a job in Databricks cluster making it slower.
Is there anyway if I could use a better way to filter (for ex. just using transformation instead of action)?
Below are the schemas of the 2 Dataframes:
case class Version(BuildVersion:String,
MainVersion:String,
Hotfix:String,
BackendRowCreationTime:Timestamp)
case class BackendInfo(SerialNumber:Integer,
NumberOfClients:Long,
BuildVersion:String,
MainVersion:String,
Hotfix:String,
RowCreationTime:Timestamp)
The below code worked:
val dfLatestRowCreationTime1 = dfVersion.agg(max($"BackendRowCreationTime").as("BackendRowCreationTime")).limit(1)
val latestRowCreationTime = dfLatestRowCreationTime1.withColumn("BackendRowCreationTime", when($"BackendRowCreationTime".isNull, DefaultTime))
val dfDateFiltered = dfBackendInfo.join(latestRowCreationTime, dfBackendInfo.col("RowCreationTime").gt(latestRowCreationTime.col("BackendRowCreationTime")))

How to use forEachPartition on pyspark dataframe?

I am trying to use forEachPartition() method using pyspark on a RDD that has 8 partitions. My custom function tries to generate a string output for a given string input. Here is the code
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import pandas as pd
import datetime
def compute_sentiment_score(text):
client = language.LanguageServiceClient()
document = types.Document(content=text,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
return str(sentiment.score)
def compute_sentiment_magnitude(text):
client = language.LanguageServiceClient()
document = types.Document(content=text,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
return str(sentiment.magnitude)
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/path-to-file.json"
imdb_reviews = pd.read_csv('imdb_reviews.csv', header=None, names=['input1', 'input2'], encoding= "ISO-8859-1")
imdb_reviews.head()
input1 input2
0 first think another Disney movie, might good, ... 1
1 Put aside Dr. House repeat missed, Desperate H... 0
2 big fan Stephen King's work, film made even gr... 1
3 watched horrid thing TV. Needless say one movi... 0
4 truly enjoyed film. acting terrific plot. Jeff... 1
spark_imdb_reviews = spark.createDataFrame(imdb_reviews) # create spark dataframe
spark_imdb_reviews.printSchema()
root
|-- input1: string (nullable = true)
|-- input2: long (nullable = true)
And this is my custom function -
def compute_sentiment_score(text):
client = language.LanguageServiceClient()
document = types.Document(content=text,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
return str(sentiment.score)
def compute_sentiment_magnitude(text):
client = language.LanguageServiceClient()
document = types.Document(content=text,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
return str(sentiment.magnitude)
Here is how I try to use the forEachPartition() method -
create_rdd = spark_imdb_reviews.select("input1").rdd # create RDD
print(create_rdd.getNumPartitions()) # print the partitions
print(create_rdd.take(1)) # display data
new_rdd = create_rdd.foreachPartition(compute_sentiment_score) # compute score
Which gives this output and an error -
8
[Row(input1="first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play part better. Christopher Lloyd hilarious perfect part. Tony Danza believable Mel Clark. can't help, enjoy movie! give 10/10!")]
File "<ipython-input-106-e3fd65ce75cc>", line 3, in compute_sentiment_score
TypeError: <itertools.chain object at 0x11ab7f198> has type itertools.chain, but expected one of: bytes, unicode
There are two similar functions:
RDD.forachPartition and
RDD.mapPartitions
Both functions expect another function as parameter (here compute_sentiment_score). This function gets the content of a partition passed in form of an iterator. The text parameter in the question is actually an iterator that can be used inside of compute_sentiment_score.
The difference between foreachPartition and mapPartition is that foreachPartition is a Spark action while mapPartition is a transformation. This means the code being called by foreachPartition is immediately executed and the RDD remains unchanged while mapPartition can be used to create a new RDD. In order to store the calculated sentiment score mapPartitions should be used.
def compute_sentiment_score(itr_text):
#setup the things that are expensive and should be prepared only once per partition
client = language.LanguageServiceClient()
#run the loop for each row of the partition
for text in itr_text:
document = types.Document(content=text.value,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
yield (text.value, sentiment.score)
df_with_score = df.rdd.mapPartitions(compute_sentiment_score)
df_with_score.foreach(print)
In this example client = language.LanguageServiceClient() is called once per partition. Probably the amount of partitions has to be reduced, for example with coalesce.

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

Is it possible to return a map of key values using gremlin scala

Currently i have two gremlin queries which will fetch two different values and i am populating in a map.
Scenario : A->B , A->C , A->D
My queries below,
graph.V().has(ID,A).out().label().toList()
Fetch the list of outE labels of A .
Result : List(B,C,D)
graph.traversal().V().has("ID",A).outE("interference").as("x").otherV().has("ID",B).select("x").values("value").headOption()
Given A and B , get the egde property value (A->B)
Return : 10
Is it possible that i can combine both there queries to get a return as Map[(B,10)(C,11)(D,12)]
I am facing some performance issue when i have two queries. Its taking more time
There is probably a better way to do this but I managed to get something with the following traversal:
gremlin> graph.traversal().V().has("ID","A").outE("interference").as("x").otherV().has("ID").label().as("y").select("x").by("value").as("z").select("y", "z").select(values);
==>[B,1]
==>[C,2]
I would wait for more answers though as I suspect there is a better traversal out there.
Below is working in scala
val b = StepLabel[Edge]()
val y = StepLabel[Label]()
val z = StepLabel[Integer]()
graph.traversal().V().has("ID",A).outE("interference").as(b)
.otherV().label().as(y)
.select(b).values("name").as(z)
.select((y,z)).toMap[String,Integer]
This will return Map[String,Int]

How to use RowMatrix.columnSimilarities (similarity search)

TL;DR; I am trying to train off of an existing data set (Seq[Words] with corresponding categories), and use that trained dataset to filter another dataset using category similarity.
I am trying to train a corpus of data and then use it for text analysis*. I've tried using NaiveBayes, but that seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
So, I am now trying to use TFIDF and passing that output into a RowMatrix and computing the similarities. But, I'm not sure how to run my query (one word for now). Here's what I've tried:
val rddOfTfidfFromCorpus : RDD[Vector]
val query = "word"
val tf = new HashingTF().transform(List(query))
val tfIDF = new IDF().fit(sc.makeRDD(List(tf))).transform(tf)
val mergedVectors = rddOfTfidfFromCorpus.union(sc.makeRDD(List(tfIDF)))
val similarities = new RowMatrix(mergedVectors).columnSimilarities(1.0)
Here is where I'm stuck (if I've even done everything right until here). I tried filtering the similarities i and j down to the parts of my query's TFIDF and end up with an empty collection.
The gist is that I want to train on a corpus of data and find what category it falls in. The above code is at least trying to get it down to one category and checking if I can get a prediction from that at least....
*Note that this is a toy example, so I only need something that works well enough
*I am using Spark 1.4.0
Using columnSimilarities doesn't make sense here. Since each column in your matrix represents a set of terms you'll get a matrix of similarities between tokens not documents. You could transpose the matrix and then use columnSimilarities but as far as I understand what you want is a similarity between query and corpus. You can express that using matrix multiplication as follows:
For starters you'll need an IDFModel you've trained on a corpus. Lets assume it is called idf:
import org.apache.spark.mllib.feature.IDFModel
val idf: IDFModel = ??? // Trained using corpus data
and a small helper:
def toBlockMatrix(rdd: RDD[Vector]) = new IndexedRowMatrix(
rdd.zipWithIndex.map{case (v, i) => IndexedRow(i, v)}
).toCoordinateMatrix.toBlockMatrix
First lets convert query to an RDD and compute TF:
val query: Seq[String] = ???
val queryTf = new HashingTF().transform(query)
Next we can apply IDF model and convert result to matrix:
val queryTfidf = idf.transform(queryTf)
val queryMatrix = toBlockMatrix(queryTfidf)
We'll need a corpus matrix as well:
val corpusMatrix = toBlockMatrix(rddOfTfidfFromCorpus)
If you multiple both we get a matrix with number of rows equal to the number of docs in the query and number of columns equal to the number of documents in the corpus.
val dotProducts = queryMatrix.multiply(corpusMatrix.transpose)
To get a proper cosine similarity you have to divide by a product of magnitudes but if you can handle that.
There are two problems here. First of all it is rather expensive. Moreover I am not sure if it really useful. To reduce cost you can apply some dimensionality reduction algorithm first but lets leave it for now.
Judging from a following statement
NaiveBayes (...) seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
I guess you want some kind of unsupervised learning method. The simplest thing you can try is K-means:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
val numClusters: Int = ???
val numIterations = 20
val model = KMeans.train(rddOfTfidfFromCorpus, numClusters, numIterations)
val predictions = model.predict(queryTfidf)