How to get the alias of a Spark Column as String? - scala

If I declare a Column in a val, like this:
import org.apache.spark.sql.functions._
val col: org.apache.spark.sql.Column = count("*").as("col_name")
col is of type org.apache.spark.sql.Column. Is there a way to access its name ("col_name")?
Something like:
col.getName() // returns "col_name"
In this case, col.toString returns "count(1) AS col_name"

Try below code.
scala> val cl = count("*").as("col_name")
cl: org.apache.spark.sql.Column = count(1) AS `col_name`
scala> cl.expr.argString
res14: String = col_name
scala> cl.expr.productElement(1).asInstanceOf[String]
res24: String = col_name
scala> val cl = count("*").cast("string").as("column_name")
cl: org.apache.spark.sql.Column = CAST(count(1) AS STRING) AS `column_name`
scala> cl.expr.argString
res113: String = column_name
From the above code if you alter .as & .cast It will give you wrong result.
You can also use json4s to extract name from expr.toJSON
scala> import org.json4s._
import org.json4s._
scala> import org.json4s.jackson.JsonMethods._
import org.json4s.jackson.JsonMethods._
scala> implicit val formats = DefaultFormats
formats: org.json4s.DefaultFormats.type = org.json4s.DefaultFormats$#16cccda5
scala> val cl = count("*").as("column_name").cast("string") // Used cast last.
cl: org.apache.spark.sql.Column = CAST(count(1) AS `column_name` AS STRING)
scala> (parse(cl.expr.toJSON) \\ "name").extract[String]
res104: String = column_name

One another easy way is, column names will always be covered by `` these characters. You can use either regex or split the string and get index 1 element.
with split,
col.toString.split("`")(1)
with regex,
val pattern = "`(.*)`".r
pattern.findFirstMatchIn(col.toString).get.group(1)
Advantage doing like this is even you include something like .cast("string") to you column it will still work.

Related

Why does play-json lose precision while reading/parsing?

In the following example (scala 2.11 and play-json 2.13)
val j ="""{"t":2.2599999999999997868371792719699442386627197265625}"""
println((Json.parse(j) \ "t").as[BigDecimal].compare(BigDecimal("2.2599999999999997868371792719699442386627197265625")))
The output is -1. Shouldn't they be equal ? On printing the parsed value, it prints rounded off value:
println((Json.parse(j) \ "t").as[BigDecimal]) gives 259999999999999786837179271969944
The problem is that by default play-json configures the Jackson parser with the MathContext set to DECIMAL128. You can fix this by setting the play.json.parser.mathContext system property to unlimited. For example, in a Scala REPL that would look like this:
scala> System.setProperty("play.json.parser.mathContext", "unlimited")
res0: String = null
scala> val j ="""{"t":2.2599999999999997868371792719699442386627197265625}"""
j: String = {"t":2.2599999999999997868371792719699442386627197265625}
scala> import play.api.libs.json.Json
import play.api.libs.json.Json
scala> val res = (Json.parse(j) \ "t").as[BigDecimal]
res: BigDecimal = 2.2599999999999997868371792719699442386627197265625
scala> val expected = BigDecimal("2.2599999999999997868371792719699442386627197265625")
expected: scala.math.BigDecimal = 2.2599999999999997868371792719699442386627197265625
scala> res.compare(expected)
res1: Int = 0
Note that setProperty should happen first, before any reference to Json. In normal (non-REPL) use you'd set the property via -D on the command line or whatever.
Alternatively you could use Jawn's play-json parsing support, which just works as expected off the shelf:
scala> val j ="""{"t":2.2599999999999997868371792719699442386627197265625}"""
j: String = {"t":2.2599999999999997868371792719699442386627197265625}
scala> import org.typelevel.jawn.support.play.Parser
import org.typelevel.jawn.support.play.Parser
scala> val res = (Parser.parseFromString(j).get \ "t").as[BigDecimal]
res: BigDecimal = 2.2599999999999997868371792719699442386627197265625
Or for that matter you could switch to circe:
scala> import io.circe.Decoder, io.circe.jawn.decode
import io.circe.Decoder
import io.circe.jawn.decode
scala> decode(j)(Decoder[BigDecimal].prepare(_.downField("t")))
res0: Either[io.circe.Error,BigDecimal] = Right(2.2599999999999997868371792719699442386627197265625)
…which handles a range of number-related corner cases more responsibly than play-json in my view. For example:
scala> val big = "1e2147483648"
big: String = 1e2147483648
scala> io.circe.jawn.parse(big)
res0: Either[io.circe.ParsingFailure,io.circe.Json] = Right(1e2147483648)
scala> play.api.libs.json.Json.parse(big)
java.lang.NumberFormatException
at java.math.BigDecimal.<init>(BigDecimal.java:491)
at java.math.BigDecimal.<init>(BigDecimal.java:824)
at scala.math.BigDecimal$.apply(BigDecimal.scala:287)
at play.api.libs.json.jackson.JsValueDeserializer.parseBigDecimal(JacksonJson.scala:146)
...
But that's out of scope for this question.
To be honest I'm not sure why play-json defaults to DECIMAL128 for the MathContext, but that's a question for the play-json maintainers, and is also out of scope here.

How to create DataFrame: The number of columns doesn't match

I get the error:
java.lang.IllegalArgumentException: requirement failed: The number of
columns doesn't match. Old column names (4): _1, _2, _3, _4 New column
names (1): 'srcId', 'srcLabel', 'dstId', 'dstLabel'
in this code:
val columnNames = """'srcId', 'srcLabel', 'dstId', 'dstLabel'"""
import spark.sqlContext.implicits._
var df = Seq.empty[(String, String, String, String)]
.toDF(columnNames)
The problem with your approach is that columnNames is a string while you have defined tuple4 of empty strings. So you will have to split the columnNames string into four strings and pass to toDF
Correct way is to do it as following
val columnNames = """'srcId', 'srcLabel', 'dstId', 'dstLabel'"""
var df = Seq.empty[(String, String, String, String)]
.toDF(columnNames.split(","): _*)
which should give you an empty dataframe as
+-------+-----------+--------+-----------+
|'srcId'| 'srcLabel'| 'dstId'| 'dstLabel'|
+-------+-----------+--------+-----------+
+-------+-----------+--------+-----------+
I hope the answer is helpful
scala> val columnNames = Seq("srcId", "srcLabel", "dstId", "dstLabel")
columnNames: Seq[String] = List(srcId, srcLabel, dstId, dstLabel)
scala> var d = Seq.empty[(String, String, String, String)].toDF(columnNames: _*)
d: org.apache.spark.sql.DataFrame = [srcId: string, srcLabel: string ... 2 more fields]

difference between pipe and comma delimiter in spark-scala

Can someone tell me why do we have two separate ways of representing pipe(|) and comma(,). Like
sc.textFile(file).map( x => x.split(","))
for comma, and
sc.textFile(file).map( x => x.split('|'))
for pipe.
Keeping both in double quotes, its failing with pipe and comma is giving me correct result.
Below is the full code which I am running
package com.rakesh.singh
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
object MPMovie {
def namex ( x : String) = {
val fields = x.split('|')
val id = fields(0).toInt
val name = fields(1).toString
(id , name)
}
def main(rakesh : Array[String]) = {
Logger.getLogger("yoyo").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("Movies").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("F:/Raakesh/ml-100k/movies.data")
val names = sc.textFile("F:/Raakesh/ml-100k/names.data")
val mappednames = names.map(namex)
val splited = rdd.map(x => (x.split("\t")(1).toInt,1))
//.map(x => (x,1))
val counteachmovie = splited.reduceByKey( (a ,b )=> a + b).map( x => (x._2 , x._1))
val mpm = counteachmovie.max()
println(s"the final value of mpm is $mpm")
mappednames.foreach(println)
val finalname = mappednames.lookup(mpm._2)(0)
println(s"the final value of mpm is $finalname")
}
}
and data files are
movies.data
196 101 3 881250949
186 101 3 891717742
22 103 1 878887116
244 102 2 880606923
names:Data
101|Sajan
102|Mela
103|Hum
There are two different split methods:
The split(",") method comes originally from String.split(regex: String), it works with arbitrary regexes as separators, e.g.
scala> "helloABCworldCABfooBBACCAbar".split("[ABC]+")
res0: Array[String] = Array(hello, world, foo, bar)
The other split('|') comes from StringOps.split(separator: Char), and is rather like a generic Scala-collection operation. It doesn't work with regex, but it works on all StringLike collections, for example on StringBuilders:
scala> val b = new StringBuilder
b: StringBuilder =
scala> b ++= "hello|"
res2: b.type = hello|
scala> b ++= "world"
res3: b.type = hello|world
scala> b.split('|')
res4: Array[String] = Array(hello, world)
The "|" doesn't work with the first method, because it's a nonsensical "OR"-regex. In order to use the pipe | with the split(regex: String) version, you either have to escape it like this "\\|" or (often easier) to enclose it into "[|]"-character class.

How to iterate scala wrappedArray? (Spark)

I perform the following operations:
val tempDict = sqlContext.sql("select words.pName_token,collect_set(words.pID) as docids
from words
group by words.pName_token").toDF()
val wordDocs = tempDict.filter(newDict("pName_token")===word)
val listDocs = wordDocs.map(t => t(1)).collect()
listDocs: Array
[Any] = Array(WrappedArray(123, 234, 205876618, 456))
My question is how do I iterate over this wrapped array or convert this into a list?
The options I get for the listDocs are apply, asInstanceOf, clone, isInstanceOf, length, toString, and update.
How do I proceed?
Here is one way to solve this.
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val data = Seq((Seq(1,2,3),Seq(4,5,6),Seq(7,8,9)))
val df = sqlContext.createDataFrame(data)
val first = df.first
// use a pattern match to deferral the type
val mapped = first.getAs[WrappedArray[Int]](0)
// now we can use it like normal collection
mapped.mkString("\n")
// get rows where has array
val rows = df.collect.map {
case Row(a: Seq[Any], b: Seq[Any], c: Seq[Any]) =>
(a, b, c)
}
rows.mkString("\n")

Scala function does not return a value

I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}
foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]
Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail