Performance issue with spark Dataframe, each iteration takes longer - scala

I´m using Spark 2.2.1 with Scala 2.11.12 version as a language to generate a recursive algorithm. First, I tried an implementation using RDD but the time when I used a lot of data was too much. I have made a new version using DataFrames but with very little data it takes too much, taking less data in each iteration than in the previous iteration.
I have tried to cache variables in different ways (types of persistence included), to use checkpoints in different moments, using the repartition method with different values ​​and in different functions, and nothing works.
The code starts looking for the minimum distance between the points that make up the matrix (matrix is a DataFrame):
println("Finding minimum:")
val minDistRes = matrix.select(min("dist")).first().getFloat(0)
val clusterRes = matrix.where($"dist" === minDistRes)
println(s"New minimum:")
clusterRes.show(1)
Then, save the coordenates to the points for later calculations:
val point1 = clusterRes.first().getInt(0)
val point2 = clusterRes.first().getInt(1)
After, made several filters to use them in the new points generated in the next iteration (the creation of a broadcast variable is necessary to be able to access this data in a later map):
matrix = matrix.where("!(idW1 == " + point1 +" and idW2 ==" + point2 + " )").cache()
val dfPoints1 = matrix.where("idW1 == " + point1 + " or idW2 == " + point1).cache()
val dfPoints2 = matrix.where("idW1 == " + point2 + " or idW2 == " + point2).cache()
val dfPoints2Broadcast = spark.sparkContext.broadcast(dfPoints2)
val dfUnionPoints = dfPoints1.union(dfPoints2).cache()
val matrixSub = matrix.except(dfUnionPoints).cache()
Continued with the calculation of the new points and I return the matrix that will be used recursively by the algorithm:
val newPoints = dfPoints1.map{
r => val distAux = dfPoints2Broadcast.value.where("idW1 == " + r.getInt(0) +
" or idW1 == " + r.getInt(1) + " or idW2 == " + r.getInt(0) + " or idW2 == " +
r.getInt(1)).first().getFloat(2)
(newIndex.toInt, filterDF(r.getInt(0),r.getInt(1), point1, point2), math.min(r.getFloat(2), distAux))
}.asInstanceOf[Dataset[Row]]
matrix = matrixSub.union(newPoints)
Finalize each iteration caching the matrix variable and realized a checkpoint every so often:
matrix.cache()
if (a % 5 == 0)
matrix.checkpoint()

Related

Optimizing Spark/Scala speed

I have a Spark script that establishes a connection to Hive and read Data from different databases and then writes the union into a CSV file. I tested it with two databases and it took 20 minutes. Now I am trying it with 11 databases and it has been running since yesterday evening (18 hours!). The script is supposed to get between 400000 and 800000 row per database.
My question is: is 18 hours normal for such jobs? If not, how can I optimize it? This is what my main does:
// This is a list of the ten first databases used:
var use_database_sigma = List( Parametre_vigiliste.sourceDbSigmaGca, Parametre_vigiliste.sourceDbSigmaGcm
,Parametre_vigiliste.sourceDbSigmaGge, Parametre_vigiliste.sourceDbSigmaGne
,Parametre_vigiliste.sourceDbSigmaGoc, Parametre_vigiliste.sourceDbSigmaGoi
,Parametre_vigiliste.sourceDbSigmaGra, Parametre_vigiliste.sourceDbSigmaGsu
,Parametre_vigiliste.sourceDbSigmaPvl, Parametre_vigiliste.sourceDbSigmaLbr)
val grc = Tables.getGRC(spark) // This creates the first dataframe
var sigma = Tables.getSIGMA(spark, use_database_sigma(0)) // This creates other dataframe which is the union of ten dataframes (one database each)
for(i <- 1 until use_database_sigma.length)
{
if (use_database_sigma(i) != "")
{
sigma = sigma.union(Tables.getSIGMA(spark, use_database_sigma(i)))
}
}
// writing into csv file
val grc_sigma=sigma.union(grc) // union of the 2 dataframes
grc_sigma.cache
LogDev.ecrireligne("total : " + grc_sigma.count())
grc_sigma.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_.csv"));
grc_sigma.unpersist()
Not written in an IDE so it might be off somewhere, but you get the general idea.
val frames = Seq("table1", "table2).map{ table =>
spark.read.table(table).cache()
}
frames
.reduce(_.union(_)) //or unionByName() if the columns aren't in the same order
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.format("csv")
.options(Map("header" -> "true", "delimiter" -> "|"))
.save("filePathName")

Spark UDF optimization for Graph Database (Neo4j) inserts

This is first issue i am posting so apologies if i miss some info and mediocre formatting. I can update if required.
I will try to add as many details as possible. I have a not so optimized Spark Job which converts RDBMS data to graph nodes and relations in Neo4j.
To do this. Here is the steps i follow:
create a denormalized dataframe 'data' with spark sql and joins.
Foreach row in 'data' run a graphInsert function which does the following:
a. read contents of the row b. formulate a neo4j cypher query (We use Merge command so that we have have only one City e.g. Chicago created in Neo4j when Chicago will be present in multiple lines in RDBMS table) c. connect to neo4j d. execute the query e. disconnect from neo4j
Here is the list of problems i am facing.
Inserts are slow.
I know Merge query is slower than create but is there another way to do this instead of connecting and disconnecting for every record? This was my first draft code and maybe i am struggling how i will use one connection to insert from multiple threads on different spark worker nodes. Hence connecting and disconnecting for every record.
The job is not scalable. It only runs fine with 1 core. As soon as i run the job with 2 spark cores i suddenly get 2 cities with same name, even when i am running merge queries. e.g. There are 2 Chicago cities which violates the use of Merge. I am assuming that Merge functions something like "Create if not exist".
I dont know if my implementation is wrong in neo4j part or spark. If anyone can direct me to any documentation which helps me implement this on a better scale it will be helpful as i have a big spark cluster which i need to utilize at full potential for this job.
If you are interested to look at code instead of algorithm. Here is graphInsert implementation in scala:
class GraphInsert extends Serializable{
var case_attributes = new Array[String](4)
var city_attributes = new Array[String](2)
var location_attributes = new Array[String](20)
var incident_attributes = new Array[String](20)
val prop = new Properties()
prop.load(getClass().getResourceAsStream("/GraphInsertConnection.properties"))
// properties Neo4j
val url_neo4j = prop.getProperty("url_neo4j")
val neo4j_user = prop.getProperty("neo4j_user")
val neo4j_password = prop.getProperty("neo4j_password")
def graphInsert(data : Row){
val query = "MERGE (d:CITY {name:city_attributes(0)})\n" +"MERGE (a:CASE { " + case_attributes(0) + ":'" +data(11) + "'," +case_attributes(1) + ":'" +data(13) + "'," +case_attributes(2) + ":'" +data(14) +"'}) \n" +"MERGE (b:INCIDENT { " + incident_attributes(0) + ":" +data(0) + "," +incident_attributes(1) + ":" +data(2) + "," +incident_attributes(2) + ":'" +data(3) + "'," +incident_attributes(3) + ":'" +data(8)+ "'," +incident_attributes(4) + ":" +data(5) + "," +incident_attributes(5) + ":'" +data(4) + "'," +incident_attributes(6) + ":'" +data(6) + "'," +incident_attributes(7) + ":'" +data(1) + "'," +incident_attributes(8) + ":" +data(7)+"}) \n" +"MERGE (c:LOCATION { " + location_attributes(0) + ":" +data(9) + "," +location_attributes(1) + ":" +data(10) + "," +location_attributes(2) + ":'" +data(19) + "'," +location_attributes(3) + ":'" +data(20)+ "'," +location_attributes(4) + ":" +data(18) + "," +location_attributes(5) + ":" +data(21) + "," +location_attributes(6) + ":'" +data(17) + "'," +location_attributes(7) + ":" +data(22) + "," +location_attributes(8) + ":" +data(23)+"}) \n" +"MERGE (a) - [r1:"+relation_case_incident+"]->(b)-[r2:"+relation_incident_location+"]->(c)-[r3:belongs_to]->(d);"
println(query)
try{
var con = DriverManager.getConnection(url_neo4j, neo4j_user, neo4j_password)
var stmt = con.createStatement()
var rs = stmt.executeQuery(query)
con.close()
}catch{
case ex: SQLException =>{
println(ex.getMessage)
}
}
}
def operations(sqlContext: SQLContext){
....
#Get 'data' before this step
city_attributes = entity_metadata.filter(entity_metadata("source_name") === "tb_city").map(x =>x.getString(5)).collect()
case_attributes = entity_metadata.filter(entity_metadata("source_name") === "tb_case_number").map(x =>x.getString(5)).collect()
location_attributes = entity_metadata.filter(entity_metadata("source_name") === "tb_location").map(x =>x.getString(5)).collect()
incident_attributes= entity_metadata.filter(entity_metadata("source_name") === "tb_incident").map(x =>x.getString(5)).collect()
data.foreach(graphInsert)
}
object GraphObject {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("GraphNeo4j")
.setMaster("xyz")
.set("spark.cores.max","2")
.set("spark.executor.memory","10g")
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val graph = new GraphInsert()
graph.operations(sqlContext)
}
}
Whatever you write inside the closure i.e it needs to be executed on Worker gets distributed.
You can read more about it here : http://spark.apache.org/docs/latest/programming-guide.html#understanding-closures-a-nameclosureslinka
And as you increase the number of cores, I think it must not effect the application because if you do not specify it ! then it takes the greedy approach ! I hope this document helps .
I am done improving the process but nothing could make it as fast as LOAD command in Cypher.
Hope this helps someone though:
use foreachPartition instead of foreach gives significant gain while doing such process. Also adding periodic commit using cypher.

How to see ELKI DBSCAN clustering result

I am using ELKI for DBSCAN clustering of some ~14,000 GPS points.Its running fine but I want to see information about clusters like how many points are in a cluster.?
If you use the -resulthandler ResultWriter and output to text, the cluster sizes will be at the top of each cluster file.
The visualizer currently doesn't seem to show cluster sizes.
If you use the -resulthandler ResultWriter and output to text, the cluster sizes will be at the top of each cluster file.
Also, if you want to merge all those results into a single file, here is a python script that works:
clusterout_path = "path/to/where/files/all/go/"
finalout_path = "/path/for/single/merged/file/"
consol_filename= "single_merged_file.txt"
cll_file = open(finalout_path + consol_filename,"a")
cll_file.write("ClusterID"+ "\t" + "Lon" + "\t" + "Lat" + "\n")
def readFile(file):
f = open(clusterout_path + file)
counter = 0
cluster = ""
lon = ""
lat = ""
for line in f.readlines():
counter+=1
if counter == 1:
cluster = line.split(":")[1].strip().lower()
if counter > 4 and line.startswith("ID"):
arr = line.split(" ")
lon = arr[1]
lat = arr[2]
cll_file.write(cluster + "\t" + lon + "\t" + lat + "\n")
f.close()
listing = os.listdir(clusterout_path)
for infile in listing:
print "Processing file: " + infile
readFile(infile)
cll_file.close()

In a recursive function, how does a series of returns get assembled back into a single result?

I'm trying to understand what's going on in this recursive function. It reverses a String, but I don't quite get how these separate return calls get assembled into one string at the end.
def reverse(string: String): String = {
if (string.length() == 0)
return string
return reverse(string.substring(1)) + string.charAt(0)
}
I've analysed the function by adding in print statement, and while I kind of understand how it works (conceptually), I don't understand, well... how it works.
For instance, I know that each cycle of recursion pushes things into the stack.
So, I would expect reverse("hello"), to become a stack of
o
l
l
e
h
But it must be more complex than that, as the recursive call is return reverse(string.substring(1)) + string.charAt(0). So is the stack actually
o,
l, o
l, lo
e, llo
H, ello
?
How does that get turned into the single string we expect?
The stack contains all local variables, as well as any temporary result in an expression where the recursion appears (though those are pushed on the stack even without recursion, because JVM is a stack machine) and, of course, the point where the code execution should resume on return.
In this case, the recursive call is the whole expression (that is, nothing is computed before reverse on the expression it appears). So the only thing besides the code pointer is string. At the deepest level of recursion, the stack will look like this:
level string
5 (empty string)
4 o
3 lo
2 llo
1 ello
0 hello
So when the call to level 5 returns, level 4 will finish computing the expression that reverse is a part of, reverse(string.substring(1)) + string.charAt(0). The value of reverse(string.substring(1)) is the empty string, and the value of string.charAt(0) is o (since the value of string on level 4 is o). The result is o, which is returned.
On level 3, it concatenates the return value from level 4 (o) with string.charAt(0) for string equal to lo, which is l, resulting in ol.
On level 2, it concatenates ol with l, giving oll.
Level 1 concatenates oll with e, returning olle.
Level 0, finally, concatenates olle with h, returning olleh to its caller.
On a final note, when a call is made, what is pushed into the stack is the return point for the code and the parameters. So hello is the parameter to reverse, which is pushed on the stack by reverse's caller.
Use the substitution model to work through the problem:
reverse("hello") =
(reverse("ello") + 'h') =
((reverse("llo") + 'e') + 'h') =
(((reverse("lo") + 'l') + 'e') + 'h') =
((((reverse("o") + 'l') + 'l') + 'e') + 'h') =
(((((reverse("") + 'o') + 'l') + 'l') + 'e') + 'h') =
((((("" + 'o') + 'l') + 'l') + 'e') + 'h') =
(((("o" + 'l') + 'l') + 'e') + 'h') =
((("ol" + 'l') + 'e') + 'h') =
(("oll" + 'e') + 'h') =
("olle" + 'h') =
"olleh"
Аdd a couple of tips on how to make your code better:
Don't use return in functions because function automaticaly return result of last evaluated line
String is a List of Char's, and you can replace string.substring(1) => string.tai, and string.charAt(0) => string.head
If you call immutable method, like length, size or etc you can omit the parentheses string.length() === string.length
That last version of your single line function:
def reverse(s: String): String = if (s.size == 0) s else reverse(s.tail) + s.head

Nested 'if' statement inside 'for' loop not working - MATLAB

for Temp = 1000:10:6000
cp_CO2 = ((2e-18)*Temp.^5) - ((4e-14)*Temp.^4) + ((3e-10)*Temp.^3) - ((8e-07)*Temp.^2) + (0.0013*Temp) + 0.5126;
cp_CO = ((5e-12)*Temp.^3) - ((7e-08)*Temp.^2) + (0.0003*Temp) + 0.9657;
cp_H2O = ((7e-12)*Temp.^3) - ((1e-07)*Temp.^2) + (0.0008*Temp) + 1.6083;
cp_N2 = ((-1e-18)*Temp.^5) + ((2e-14)*Temp.^4) - ((8e-11)*Temp.^3) + ((1e-07)*Temp.^2) + (0.0001*Temp) + 0.9985;
D_H = (y(1)*cp_CO2*44*(25-Temp)) + (y(2)*cp_CO*28*(25-Temp)) + (y(3)*cp_H2O*18*(25-Temp)) + (percent_air*x_final(2)*3.76*28*(25-Temp));
DELTA_H = round(D_H);
if DELTA_H == delta_h
break
end
end
The 'for' loop in my code is above, the variables delta_h, y and percent_air have been defined and calculated/input earlier. If I work on the loop as a cell and manually increase Temp then the values of D_H etc. all change. But for some reason when I try and execute the loop the 'if' statement doesn't seem to come into effect and the final values where Temp = 6000 are displayed in the workspace instead of the value of Temp where it produces a DELTA_H equal to that of delta_h. It's the first time I've used MATLAB for about 2 years (I'm a 3rd Year Mech Eng student) so please forgive me if it's a simple error to fix.
If either of the variables are floating-point, doing an exact compare like that is problematic. A <= or >= comparison might work better.