I apologise for what will probably be a simple question but I'm struggling to get to grips with parsing rdd's with scala/spark. I have an RDD created from a CSV, read in with
val partitions: RDD[(String, String, String, String, String)] = withoutHeader.mapPartitions(lines => {
val parser = new CSVParser(',')
lines.map(line => {
val columns = parser.parseLine(line)
(columns(0), columns(1), columns(2), columns(3), columns(4))
})
})
When I output this to a file with
partitions.saveAsTextFile(file)
I get the output with the parentheses on each line. I don't want these parentheses. I'm struggling in general to understand what is happening here. My background is with low level languages and I'm struggling to see through the abstractions to what it's actually doing. I understand the mappings but it's the output that is escaping me. Can someone either explain to me what is going on in the line (columns(0), columns(1), columns(2), columns(3), columns(4)) or point me to a guide that simply explains what is happening?
My ultimate goal is to be able to manipulate files that are on hdsf in spark to put them in formats suitable for mllib.I'm unimpressed with the spark or scala guides as they look like they have been produced with poorly annotated javadocs and don't really explain anything.
Thanks in advance.
Dean
I would just convert your tuple to the string format you want. For example, to create |-delimited output:
partitions.map{ tup => s"${tup._1}|${tup._2}|${tup._3}|${tup._4}|${tup._5}" }
or using pattern matching (which incurs a little more runtime overhead):
partitions.map{ case (a,b,c,d,e) => s"$a|$b|$c|$d|$e" }
I'm using the string interpolation feature of Scala (note the s"..." format).
Side note, you can simplify your example by just mapping over the RDD as a whole, rather than the individual partitions:
val parser = new CSVParser(',')
val partitions: RDD[(String, String, String, String, String)] =
withoutHeader.map { line =>
val columns = parser.parseLine(line)
(columns(0), columns(1), columns(2), columns(3), columns(4))
}
Related
I am reading in a file that has many spaces and need to filter out the space. Afterwards we need to convert it to a dataframe. Example input below.
2017123 ¦ ¦10¦running¦00000¦111¦-EXAMPLE
My solution to this was the following function which parses out all spaces and trims the file.
def truncateRDD(fileName : String): RDD[String] = {
val example = sc.textFile(fileName)
example.map(lines => lines.replaceAll("""[\t\p{Zs}]+""", ""))
}
However, I am not sure how to get it into a dataframe. sc.textFile returns a RDD[String]. I tried the case class way but the issue is we have 800 field schema, case class cannot go beyond 22.
I was thinking of somehow converting RDD[String] to RDD[Row] so I can use the createDataFrame function.
val DF = spark.createDataFrame(rowRDD, schema)
Any suggestions on how to do this?
First split/parse your strings into the fields.
rdd.map( line => parse(line)) where parse is some parsing function. It could be as simple as split but you may want something more robust. This will get you an RDD[Array[String]] or similar.
You can then convert to an RDD[Row] with rdd.map(a => Row.fromSeq(a))
From there you can convert to DataFrame wising sqlContext.createDataFrame(rdd, schema) where rdd is your RDD[Row] and schema is your schema StructType.
In your case simple way :
val RowOfRDD = truncateRDD("yourfilename").map(r => Row.fromSeq(r))
How to solve productarity issue if you are using scala 2.10 ?
However, I am not sure how to get it into a dataframe. sc.textFile
returns a RDD[String]. I tried the case class way but the issue is we
have 800 field schema, case class cannot go beyond 22.
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here
I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}
As you know, if you use saveAsTextFile on an RDD[String, Int], the output looks like this:
(T0000036162,1747)
(T0000066859,1704)
(T0000043861,1650)
(T0000075501,1641)
(T0000071951,1638)
(T0000075623,1638)
(T0000070102,1635)
(T0000043868,1627)
(T0000094043,1626)
You may want to use this file in Spark again and what should be best practice for reading and parsing it? Should it be something like that or is there any elegant way for it?
val lines = sc.textFile("result/hebe")
case class Foo(id: String, count: Long)
val parsed = lines
.map(l => l.stripPrefix("(").stripSuffix(")").split(","))
.map(l => new Foo(id=l(0),count = l(1).toLong))
It depends what you're looking for.
If you want something pretty I'd consider possibly adding an alternative constructor to Foo which takes a single string so you could have something like
lines.map(new Foo)
And Foo would look like
case class Foo(id: String, count: Long) {
def apply(l: String): Foo = {
val split = l.stripPrefix("(").stripSuffix(")").split(",")
new Foo(l(0), l(1))
}
}
If you have no requirement to output the data like that then I'd consider saving it as a sequence file.
If performance isn't an issue then its fine. I'd just say the most important thing is to just isolate the text parsing so that later you could unit test it and come back to it later and easily edit it.
you should either save it as a Dataframe which will use the case class as a schema (that allows you to easily parse it back into Spark) or you should map out the individual components of your RDD (so you remove the brackets before saving) since it only makes the file larger:
yourRDD.toDF("id","count").saveAsParquetFile(path)
when you load in the DF, you can pass it through a schema definition to get it back into a RDD if you want
RDDInput = input.map(x=>(x.getAs[Long]("id"),x.getAs[Int]("count")))
If you prefer to store as a text file, you could consider mapping the elements without the brackets:
yourRDD.map(x => s"${x._1}, ${x._2}")
The best way will be, you write dataframes instead of RDD directly as file.
Code that writing files -
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = rdd.toDF()
df.write.parquet("dir”)
Code that reading files -
val rdd = sqlContext.read.parquet(“dir”).rdd.map(row => (row.getString(0),row.getLong(1)))
Before making saveAsTextFile use map(x=>x.mkString(",").
rdd.map(x=>x.mkString(",").saveAsTextFile(path). Output will not have bracket.
Output of this will be:-
T0000036162,1747
T0000066859,1704
In a DataFrame object in Apache Spark (I'm using the Scala interface), if I'm iterating over its Row objects, is there any way to extract values by name? I can see how to do some really awkward stuff:
def foo(r: Row) = {
val ix = (0 until r.schema.length).map( i => r.schema(i).name -> i).toMap
val field1 = r.getString(ix("field1"))
val field2 = r.getLong(ix("field2"))
...
}
dataframe.map(foo)
I figure there must be a better way - this is pretty verbose, it requires creating this extra structure, and it also requires knowing the types explicitly, which if incorrect, will produce a runtime exception rather than a compile-time error.
You can use "getAs" from org.apache.spark.sql.Row
r.getAs("field1")
r.getAs("field2")
Know more about getAs(java.lang.String fieldName)
This is not supported at this time in the Scala API. The closest you have is this JIRA titled "Support converting DataFrames to typed RDDs"
I have been trying to get through with the code for inference from trained labeled LDA model and pLDA using TMT toolbox(stanford nlp group).
I have gone through the examples provided in the following links:
http://nlp.stanford.edu/software/tmt/tmt-0.3/
http://nlp.stanford.edu/software/tmt/tmt-0.4/
Here is the code I'm trying for labeled LDA inference
val modelPath = file("llda-cvb0-59ea15c7-31-61406081-75faccf7");
val model = LoadCVB0LabeledLDA(modelPath);`
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(model.tokenizer.get) //tokenize with model's tokenizer
}
val labels = {
source ~> // read from the source file
Column(2) ~> // take column two, the year
TokenizeWith(WhitespaceTokenizer())
}
val outputPath = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv",""));
val dataset = LabeledLDADataset(text,labels,model.termIndex,model.topicIndex);
val perDocTopicDistributions = InferCVB0LabeledLDADocumentTopicDistributions(model, dataset);
val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);
TSVFile(outputPath+"-word-topic-distributions.tsv").write({
for ((terms,(dId,dists)) <- text.iterator zip perDocTermTopicDistributions.iterator) yield {
require(terms.id == dId);
(terms.id,
for ((term,dist) <- (terms.value zip dists)) yield {
term + " " + dist.activeIterator.map({
case (topic,prob) => model.topicIndex.get.get(topic) + ":" + prob
}).mkString(" ");
});
}
});
Error
found : scalanlp.collection.LazyIterable[(String, Array[Double])]
required: Iterable[(String, scalala.collection.sparse.SparseArray[Double])]
EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);
I understand it's a type mismatch error. But I don't know how to resolve this for scala.
Basically I don't understand how should I extract the
1. per doc topic distribution
2. per doc label distribution after the output of the infer command.
Please help.
Same in the case of pLDA.
I reach the inference command and after that clueless what to do with it.
Scala type system is much more complex then Java one and understanding it will make you a better programmer. The problem lies here:
val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);
because either model, or dataset, or perDocTopicDistributions are of type:
scalanlp.collection.LazyIterable[(String, Array[Double])]
while EstimateLabeledLDAPerWordTopicDistributions.apply expects a
Iterable[(String, scalala.collection.sparse.SparseArray[Double])]
The best way to investigate this type errors is to look at the ScalaDoc (for example the one for tmt is there: http://nlp.stanford.edu/software/tmt/tmt-0.4/api/#package ) and if you cannot find out where the problem lies easily, you should explicit the type of your variables inside your code like the following:
val perDocTopicDistributions:LazyIterable[(String, Array[Double])] = InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)
If we look together to the javadoc of edu.stanford.nlp.tmt.stage:
def
EstimateLabeledLDAPerWordTopicDistributions (model: edu.stanford.nlp.tmt.model.llda.LabeledLDA[_, _, _], dataset: Iterable[LabeledLDADocumentParams], perDocTopicDistributions: Iterable[(String, SparseArray[Double])]): LazyIterable[(String, Array[SparseArray[Double]])]
def
InferCVB0LabeledLDADocumentTopicDistributions (model: CVB0LabeledLDA, dataset: Iterable[LabeledLDADocumentParams]): LazyIterable[(String, Array[Double])]
It now should be clear to you that the return of InferCVB0LabeledLDADocumentTopicDistributions cannot be used directly to feed EstimateLabeledLDAPerWordTopicDistributions.
I never used stanford nlp but this is by design how the api works, so you only need to convert your scalanlp.collection.LazyIterable[(String, Array[Double])] into Iterable[(String, scalala.collection.sparse.SparseArray[Double])] before calling the function.
If you look at the scaladoc on how to do this conversion it's pretty simple. Inside the package stage, in package.scala I can read import scalanlp.collection.LazyIterable;
So I know where to look, and in fact inside http://www.scalanlp.org/docs/core/data/#scalanlp.collection.LazyIterable you have a toIterable method which turns a LazyIterable into an Iterable, still you have to transform your internal array into a SparseArray
Again, I look into the package.scala for the stage package inside tmt and I see: import scalala.collection.sparse.SparseArray; And I look for scalala documentation :
http://www.scalanlp.org/docs/scalala/0.4.1-SNAPSHOT/#scalala.collection.sparse.SparseArray
It turns out that the constructors seems complicated to me, so it sounds much like I would have to look into the companion object for a factory method. It turns out that the method I am looking for is there, and it's called apply like as usual in Scala.
def
apply [T] (values: T*)(implicit arg0: ClassManifest[T], arg1: DefaultArrayValue[T]): SparseArray[T]
By using this, you can write a function with the following signature:
def f: Array[Double] => SparseArray[Double]
Once this has done, you can turn your result of InferCVB0LabeledLDADocumentTopicDistributions into a non-lazy iterable of sparse Array with one line of code:
result.toIterable.map { case (name, values => (name, f(values)) }