SCALA - How could a function be called in a struct? - scala

I build a dataframe, which is going to have a String column that is actuall a structure, turned to a JSON string.
val df = df_helper.select(
lit("some data").as("id"),
to_json(
struct(
col("id"),
col("type"),
col("path")
)
)).as("content")
I also built a function, that takes the identifier id:String as a parameter and spits out a string list.
def buildHierarchy(id:String) : List[String] = {
val check_df = hierarchy_df.select(
$"parent_id",
$"id"
).where($"id" === id)
val pathArray = List(id)
val parentString = check_df.select($"parent_id").first.getString(0)
if (parentString == null) {
return pathArray
}
else {
val pathList = buildHierarchy(parentString)
val finalList: List[String] = pathList ++ pathArray
return finalList
}
}
I want to call this function and replace the path column with the result of the function. Is this possible, or is there a workaround?
Thank you in advance!

With the help of the following blog post I built up the hierarchy:
https://www.qubole.com/blog/processing-hierarchical-data-using-spark-graphx-pregel-api/
I included all the necessary information in my final dataframe, that I need to build up the details for each elements for the structure, along with the hierarchy path for each elements. Then as a new column I created a structure for each elements with the details.
val content_df = hierarchy_df.select($"path")
.withColumn("content",
struct( col("id"),
col("type"),
col("path")
))
I exploded the path to reach each identifiers in it, but preserve the position order, to join on the content for each level in the path.
val exploded_df = content_df.select(
$"*",posexplode($"path"),
$"pos".as("exploded_level"),
$"col".as("exploded_id"))
Then finally, join the content with the path, and aggregate the content into a path that contains all the content for each level.
val level_content_df = exploded_df.as("e")
.join(content_df.as("ouc"), col("e.col") === col("ouc.id"), "left")
.select($"e.*", $"ouc.content".as("content"))
val pathFull_df = level_content_df
.groupBy(col("id").as("id"))
.agg(sort_array(collect_list("content")).as("pathFull"))
And then finally I just put the whole thing together again :)
val content_with_path_df = content_df.as("ou")
.join(pathFull_df.as("opf"), col("ou.id") === col("opf.id"), "left")
.select($"ou.*", $"opf.pathFull".as("pathFull"))
Feel free to reach out if it doesn't make sense, it took a while for me to make it happen! :D

Related

How can I construct a String with the contents of a given DataFrame in Scala

Consider I have a dataframe. How can I retrieve the contents of that dataframe and represent it as a string.
Consider I try to do that with the below example code.
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
df.foreach(x => {
println("x = ", x)
sb.append(x)
})
println("sb = ", sb)
The output of the code shows the example dataframe has contents:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(4.875333799256043,2.8363794106756046E-6))
However, the final stringbuilder contains an empty string.
Any thoughts how to retrieve a String for a given dataframe in Scala?
Many thanks
UPD: as mentioned by #user8371915, solution below will work only in single JVM in development (local) mode. In fact we cant modify broadcast variables like globals. You can use accumulators, but it will be quite inefficient. Also you can read an answer about read/write global vars here. Hope it will help you.
I think you should read topic about shared variables in Spark. Link here
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
Let's have a look at broadcast variables. I edited your code:
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
val broadcastVar = sc.broadcast(sb)
df.foreach(x => {
println("x = ", x)
broadcastVar.value.append(x)
})
println("sb = ", broadcastVar.value)
Here I used broadcastVar as a container for a StringBuilder variable sb.
Here is output:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(4.875333799256043,2.8363794106756046E-6))
(x = ,(14.316322626848278,0.0))
(sb = ,(7.876169953355888,7.489564524121306E-13)(1.866393526974307,0.064020056478447)(4.875333799256043,2.8363794106756046E-6)(2.864048126935307,0.004808399479386827)(14.316322626848278,0.0)(4.032486069215076,8.914865448939047E-5))
Hope this helps.
Does the output of df.show(false) help? If yes, then this SO answer helps: Is there any way to get the output of Spark's Dataset.show() method as a string?
Thanks everybody for the feedback and for understanding this slightly better.
The combination of responses result in the below. The requirements have changed slightly in that I represent my df as a list of jsons. The code below does this, without the use of the broadcast.
class HandleDf(df: DataFrame, limit: Int) extends java.io.Serializable {
val jsons = df.limit(limit).collect.map(rowToJson(_))
def rowToJson(r: org.apache.spark.sql.Row) : JSONObject = {
try { JSONObject(r.getValuesMap(r.schema.fieldNames)) }
catch { case t: Throwable =>
JSONObject.apply(Map("Row with error" -> t.toString))
}
}
}
The class I use here...
val jsons = new HandleDf(df, 100).jsons

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

scala java.lang.NullPointerException

The following code is causing java.lang.NullPointerException.
val sqlContext = new SQLContext(sc)
val dataFramePerson = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema1).load("c:\\temp\\test.csv")
val dataFrameAddress = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema2).load("c:\\temp\\test2.csv")
val personData = dataFramePerson.map(data => {
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
val addressRow = addressData.first;
address = addressRow.asInstanceOf[Address];
}
Person(data.getAs("Name"),data.getAs("Phone"),address)
})
I narrowed it down to the following line of that is causing the exception.
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
Can someone point out what the issue is?
Your code has a big structural flaw, that is, you can only refer to dataframes from the code that executes in the driver, but not in the code that is run by the executors. Your code contains a reference to another dataframe from within a map, that is executed in executors. See this link Can I use Spark DataFrame inside regular Spark map operation?
val personData = dataFramePerson.map(data => { // WITHIN A MAP
val addressData = dataFrameAddress.filter(i => // <--- REFERRING TO OTHER DATAFRAME WITHIN A MAP
i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
What you want to do instead is a left outer join, then do further processing.
dataFramePerson.join(dataFrameAddress, Seq("ID"), "left_outer")
Note also than when using getAs you want to specify the type, like getAs[String]("ID")
The only thing that can be said is that either dataFrameAddress, or i, or data is null. Use your favorite debugging technique to know which one actually is e.g., debugger, print statements or logs.
Note that if you see the filter call in the stacktrace of your NullPointerException, it would mean that only i, or data could be null. On the other hand, if you don't see the filter call, it would rather mean that it is dataFrameAddress that is null.

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

Iterating through files in scala to create values based on the file names

I think there may be a simple solution to this, I was wondering if anybody knew how to iterate over a set of files and output a value based on the files name.
My problem is, I want to read in a set of graph edges for each month, and then create a seperate monthly graphs.
Currently I've done this the long way, which is fine for doing one years worth, but I'd like a way to automate it.
You can see my code below which hopefully clearly shows what I am doing.
//Load vertex data
val vertices= (sc.textFile("D:~vertices.csv")
.map(line => line.split(",")).map(parts => (parts.head.toLong, parts.tail)))
//Define function for creating edges from csv file
def EdgeMaker(file: RDD[String]): RDD[Edge[String]] = {
file.flatMap { line =>
if (!line.isEmpty && line(0) != '#') {
val lineArray = line.split(",")
if (lineArray.length < 0) {
None
} else {
val srcId = lineArray(0).toInt
val dstId = lineArray(1).toInt
val ID = lineArray(2).toString
(Array(Edge(srcId.toInt, dstId.toInt, ID)))
}
} else {
None
}
}
}
//make graphs -This is where I want automation, so I can iterate through a
//folder of edge files and output corresponding monthly graphs.
val edgesJan = EdgeMaker(sc.textFile("D:~edges2011Jan.txt"))
val graphJan = Graph(vertices, edgesJan)
val edgesFeb = EdgeMaker(sc.textFile("D:~edges2011Feb.txt"))
val graphFeb = Graph(vertices, edgesFeb)
val edgesMar = EdgeMaker(sc.textFile("D:~edges2011Mar.txt"))
val graphMar = Graph(vertices, edgesMar)
val edgesApr = EdgeMaker(sc.textFile("D:~edges2011Apr.txt"))
val graphApr = Graph(vertices, edgesApr)
val edgesMay = EdgeMaker(sc.textFile("D:~edges2011May.txt"))
val graphMay = Graph(vertices, edgesMay)
val edgesJun = EdgeMaker(sc.textFile("D:~edges2011Jun.txt"))
val graphJun = Graph(vertices, edgesJun)
val edgesJul = EdgeMaker(sc.textFile("D:~edges2011Jul.txt"))
val graphJul = Graph(vertices, edgesJul)
val edgesAug = EdgeMaker(sc.textFile("D:~edges2011Aug.txt"))
val graphAug = Graph(vertices, edgesAug)
val edgesSep = EdgeMaker(sc.textFile("D:~edges2011Sep.txt"))
val graphSep = Graph(vertices, edgesSep)
val edgesOct = EdgeMaker(sc.textFile("D:~edges2011Oct.txt"))
val graphOct = Graph(vertices, edgesOct)
val edgesNov = EdgeMaker(sc.textFile("D:~edges2011Nov.txt"))
val graphNov = Graph(vertices, edgesNov)
val edgesDec = EdgeMaker(sc.textFile("D:~edges2011Dec.txt"))
val graphDec = Graph(vertices, edgesDec)
Any help or pointers on this would be much appreciated.
you can use Spark Context wholeTextFiles to map the filename, and use the String for naming/calling/filtering/etc your values/output/etc
val fileLoad = sc.wholeTextFiles("hdfs:///..Path").map { case (filename, content) => ... }
The Spark Context textFile only reads the data, but does not keep the file name.
----EDIT----
Sorry I seem to have mis-understood the question; you can load multiple files using
sc.wholeTextFiles("~/path/file[0-5]*,/anotherPath/*.txt").map { case (filename, content) => ... }
the asterisk * should load in all files in the path assuming they are all supported input file types.
This read will concatenate all your files into 1 single large RDD to avoid multiple calling (because each call, you have to specify the path and filename which is what you want to avoid I think).
Reading with the filename allows you to GroupBy the file name and apply your graph function to each group.