I have a csv file and I want to load into a Breeze DenseMatrix[Double]
This code eventually will work but I think it's not the scala way of doing things:
val resource = Source.fromResource("data/houses.txt")
val lines: Iterator[String] = resource.getLines
val tmp = lines.toArray
val numRows: Int = tmp.size
val numCols: Int = tmp(0).split(",").size
val m = DenseMatrix.zeros[Double](numRows, numCols)
//Now do some for loops and fill the matrix
Is there a more elegant and functional way of doing this?
val resource = Source.fromResource("data/houses.txt")
val lines: Iterator[String] = resource.getLines
val tmp = lines.map(l => l.split(",").map(str => str.toDouble)).toList
val m = DenseMatrix(tmp:_*)
much better
Consider I have a dataframe. How can I retrieve the contents of that dataframe and represent it as a string.
Consider I try to do that with the below example code.
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
df.foreach(x => {
println("x = ", x)
sb.append(x)
})
println("sb = ", sb)
The output of the code shows the example dataframe has contents:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(4.875333799256043,2.8363794106756046E-6))
However, the final stringbuilder contains an empty string.
Any thoughts how to retrieve a String for a given dataframe in Scala?
Many thanks
UPD: as mentioned by #user8371915, solution below will work only in single JVM in development (local) mode. In fact we cant modify broadcast variables like globals. You can use accumulators, but it will be quite inefficient. Also you can read an answer about read/write global vars here. Hope it will help you.
I think you should read topic about shared variables in Spark. Link here
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
Let's have a look at broadcast variables. I edited your code:
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
val broadcastVar = sc.broadcast(sb)
df.foreach(x => {
println("x = ", x)
broadcastVar.value.append(x)
})
println("sb = ", broadcastVar.value)
Here I used broadcastVar as a container for a StringBuilder variable sb.
Here is output:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(4.875333799256043,2.8363794106756046E-6))
(x = ,(14.316322626848278,0.0))
(sb = ,(7.876169953355888,7.489564524121306E-13)(1.866393526974307,0.064020056478447)(4.875333799256043,2.8363794106756046E-6)(2.864048126935307,0.004808399479386827)(14.316322626848278,0.0)(4.032486069215076,8.914865448939047E-5))
Hope this helps.
Does the output of df.show(false) help? If yes, then this SO answer helps: Is there any way to get the output of Spark's Dataset.show() method as a string?
Thanks everybody for the feedback and for understanding this slightly better.
The combination of responses result in the below. The requirements have changed slightly in that I represent my df as a list of jsons. The code below does this, without the use of the broadcast.
class HandleDf(df: DataFrame, limit: Int) extends java.io.Serializable {
val jsons = df.limit(limit).collect.map(rowToJson(_))
def rowToJson(r: org.apache.spark.sql.Row) : JSONObject = {
try { JSONObject(r.getValuesMap(r.schema.fieldNames)) }
catch { case t: Throwable =>
JSONObject.apply(Map("Row with error" -> t.toString))
}
}
}
The class I use here...
val jsons = new HandleDf(df, 100).jsons
I'm trying out my first Scala program to sort the following output such that when the value is identical, words are sorted alphabetically.
cookie 8
document 6
function 5
name 5
start 5
My current code is as follows:
object Problem1{
def main(args: Array[String]){
val inputFile = args(0)
val outputFolder = args(1)
val kValue = args(2)
val conf = new SparkConf().setAppName("Problem1").setMaster("local")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.toLowerCase().split( [\\s*&#^'''\\,..:;?!\\[\\](){}<>~\\-_]+"))
.filter(x => x.matches("[A-Za-z]+")&& x.length >2)
.map(word => (word,1)).reduceByKey(_+_).map(_.swap)
val freq = words.sortByKey(false,1).map(_.swap).take(kValue.toInt)
val topKrdd = sc.parallelize(freq)
val tabSeperated = topKrdd.map(f => f._1 +"\t" + f._2)
tabSeperated.saveAsTextFile(outputFolder)
}
}
Can someone help me with the alphabetical sort for the lines where the numerical value is identical?
Usually Scala provides and uses an implicit Ordering for methods like sortByKey, but you can also construct a custom one and pass it in explicitly. The Ordering trait and companion object have a fair few helpful methods for this. You could do this:
val ord = Ordering.Tuple2(Ordering[Int].reverse, Ordering[String])
val freq = words.takeOrdered(kValue.toInt)(ord).map(_.swap)
A follow-up to Are typed literals "tricky" in RDF4J?
I have some triples abut the weight of dump trucks, using literal objects with different data types. I'm only interested in the integer values, so I want to filter based on the data type. Jeen Broekstra sent a Java solution about a week ago, and I'm having trouble converting it into Scala, my team's preferred language.
This is what I have so far. Eclipse is complaining
not found: value l
val rdf4jServer = "http://host.domain:7200"
val repositoryID = "trucks"
val MyRepo = new HTTPRepository(rdf4jServer, repositoryID)
MyRepo.initialize()
var con = MyRepo.getConnection()
val f = MyRepo.getValueFactory()
val DumpTruck = f.createIRI("http://example.com/dumpTruck")
val Weight = f.createIRI("http://example.com/weight")
val m = QueryResults.asModel(con.getStatements(DumpTruck, Weight, null))
val intValuesStream = Models.objectLiterals(m).stream()
# OK up to here
# errors start below
val intValuesFiltered =
intValuesStream.filter(l -> l.getDatatype().equals(XMLSchema.INTEGER))
val intValues = intValuesFiltered.collect(Collectors.toList())
Replace the -> with =>:
val intValuesFiltered = intValuesStream.filter(l => l.getDatatype().equals(XMLSchema.INTEGER))
I think there may be a simple solution to this, I was wondering if anybody knew how to iterate over a set of files and output a value based on the files name.
My problem is, I want to read in a set of graph edges for each month, and then create a seperate monthly graphs.
Currently I've done this the long way, which is fine for doing one years worth, but I'd like a way to automate it.
You can see my code below which hopefully clearly shows what I am doing.
//Load vertex data
val vertices= (sc.textFile("D:~vertices.csv")
.map(line => line.split(",")).map(parts => (parts.head.toLong, parts.tail)))
//Define function for creating edges from csv file
def EdgeMaker(file: RDD[String]): RDD[Edge[String]] = {
file.flatMap { line =>
if (!line.isEmpty && line(0) != '#') {
val lineArray = line.split(",")
if (lineArray.length < 0) {
None
} else {
val srcId = lineArray(0).toInt
val dstId = lineArray(1).toInt
val ID = lineArray(2).toString
(Array(Edge(srcId.toInt, dstId.toInt, ID)))
}
} else {
None
}
}
}
//make graphs -This is where I want automation, so I can iterate through a
//folder of edge files and output corresponding monthly graphs.
val edgesJan = EdgeMaker(sc.textFile("D:~edges2011Jan.txt"))
val graphJan = Graph(vertices, edgesJan)
val edgesFeb = EdgeMaker(sc.textFile("D:~edges2011Feb.txt"))
val graphFeb = Graph(vertices, edgesFeb)
val edgesMar = EdgeMaker(sc.textFile("D:~edges2011Mar.txt"))
val graphMar = Graph(vertices, edgesMar)
val edgesApr = EdgeMaker(sc.textFile("D:~edges2011Apr.txt"))
val graphApr = Graph(vertices, edgesApr)
val edgesMay = EdgeMaker(sc.textFile("D:~edges2011May.txt"))
val graphMay = Graph(vertices, edgesMay)
val edgesJun = EdgeMaker(sc.textFile("D:~edges2011Jun.txt"))
val graphJun = Graph(vertices, edgesJun)
val edgesJul = EdgeMaker(sc.textFile("D:~edges2011Jul.txt"))
val graphJul = Graph(vertices, edgesJul)
val edgesAug = EdgeMaker(sc.textFile("D:~edges2011Aug.txt"))
val graphAug = Graph(vertices, edgesAug)
val edgesSep = EdgeMaker(sc.textFile("D:~edges2011Sep.txt"))
val graphSep = Graph(vertices, edgesSep)
val edgesOct = EdgeMaker(sc.textFile("D:~edges2011Oct.txt"))
val graphOct = Graph(vertices, edgesOct)
val edgesNov = EdgeMaker(sc.textFile("D:~edges2011Nov.txt"))
val graphNov = Graph(vertices, edgesNov)
val edgesDec = EdgeMaker(sc.textFile("D:~edges2011Dec.txt"))
val graphDec = Graph(vertices, edgesDec)
Any help or pointers on this would be much appreciated.
you can use Spark Context wholeTextFiles to map the filename, and use the String for naming/calling/filtering/etc your values/output/etc
val fileLoad = sc.wholeTextFiles("hdfs:///..Path").map { case (filename, content) => ... }
The Spark Context textFile only reads the data, but does not keep the file name.
----EDIT----
Sorry I seem to have mis-understood the question; you can load multiple files using
sc.wholeTextFiles("~/path/file[0-5]*,/anotherPath/*.txt").map { case (filename, content) => ... }
the asterisk * should load in all files in the path assuming they are all supported input file types.
This read will concatenate all your files into 1 single large RDD to avoid multiple calling (because each call, you have to specify the path and filename which is what you want to avoid I think).
Reading with the filename allows you to GroupBy the file name and apply your graph function to each group.