So I have defined here a class:
class CsvEntry(val organisation: String, val yearAndQuartal : String, val medKF:Int, val trueOrFalse: Int, val name: String, val money:String){
override def toString = s"$organisation, $yearAndQuartal, $medKF, $trueOrFalse, $name, $money"
}
Lets assume the value of "medKF" is wheter 2,4 or 31.
How can I filter all CsvEntries to only give me an output for these where the "medKF" is 2? (with foreach println)?
Creating two random entries:
val a = new CsvEntry("s", "11", 12, 0, "asdasd", "1")
val b = new CsvEntry("s", "11", 2, 0, "asdasd", "234,123.23")
Then filter:
List(a,b).withFilter(_.medKF == 2).foreach(println)
To sum those entries:
List(a,b).map(_.medKF).sum
To add the money:
def moneyToCent(money: String): Long = (money.replace(",","").toDouble*100).toLong
List(a,b).withFilter(_.medKF == 2).map(x => moneyToCent(x.money)).sum
Related
I want to parse a column to get split values using Seq of an object
case class RawData(rawId: String, rawData: String)
case class SplitData(
rawId: String,
rawData: String,
split1: Option[Int],
split2: Option[String],
split3: Option[String],
split4: Option[String]
)
def rawDataParser(unparsedRawData: Seq[RawData]): Seq[RawData] = {
unparsedrawData.map(rawData => {
val split = rawData.address.split(", ")
rawData.copy(
split1 = Some(split(0).toInt),
split2 = Some(split(1)),
split3 = Some(split(2)),
split4 = Some(split(3))
)
})
}
val rawDataDF= Seq[(String, String)](
("001", "Split1, Split2, Split3, Split4"),
("002", "Split1, Split2, Split3, Split4")
).toDF("rawDataID", "rawData")
val rawDataDS: Dataset[RawData] = rawDataDF.as[RawData]
I need to use rawDataParser function to parse my rawData. However, the parameter to the function is of type Seq. I am not sure how should I convert rawDataDS as an input to function to parse the raw data. some form of guidance to solve this is appreciated.
Each DataSet is further divided into partitions. You can use mapPartitions with a mapping Iterator[T] => Iterator[U] to convert a DataSet[T] into a DataSet[U].
So, you can just use your addressParser as the argument for mapPartition.
val rawAddressDataDS =
spark.read
.option("header", "true")
.csv(csvFilePath)
.as[AddressRawData]
val addressDataDS =
rawAddressDataDS
.map { rad =>
AddressData(
addressId = rad.addressId,
address = rad.address,
number = None,
road = None,
city = None,
country = None
)
}
.mapPartitions { unparsedAddresses =>
addressParser(unparsedAddresses.toSeq).toIterator
}
I have a DataSet[Metric] and transform it to a KeyValueGroupedDataset (grouping by metricId) in order to then perform reduceGroups.
The problem that I've faced is that when there is just one record with some metricId, like metric3 in the example below, it is returned as-is and the processTime field is not getting updated. However when there is more than one record with the same metricId, they are getting reduced and the processTime field is updated correctly.
I guess that it's happening since reduceGroups needs at least 2 records in a group and otherwise just returns the single record unchanged.
But I can't figure out how to achieve updating the processTime field when there is a single record in a group?
case class Metric (
metricId: String,
rank: Int,
features: List[Feature]
processTime: Timestamp
)
case class Feature (
featureId: String,
name: String,
value: String
)
val f1 = Feature(1, "f1", "v1")
val f2 = Feature(1, "f2", "v2")
val f3 = Feature(2, "f3", "v3")
val metric1 = Metric("1", 1, List(f1, f2, f3), Timestamp.valueOf("2019-07-01 00:00:00"))
val metric2 = Metric("1", 2, List(f3, f2), Timestamp.valueOf("2019-07-01 00:00:00"))
val metric3 = Metric("2", 1, List(f1, f2), Timestamp.valueOf("2019-07-21 00:00:00"))
val metricsList = List(metric1, metric2, metric3)
val groupedMetrics: KeyValueGroupedDataset[String, Metric] = metricsList.groupByKey(x => x.metricId)
val aggregatedMetrics: Dataset[(String, Metric)] = groupedMetrics.reduceGroups {
(m1: Metric, m2: Metric) =>
val theMetric: Metric = if (m2.rank >= m1.rank) {
m2
} else {
m1
}
Metric(
m2.metricId,
m2.rank,
m2.features ++ m1.features
Timestamp.valueOf(LocalDateTime.now()),
)
}
I am trying to define a member function in a class that would be used as UDF while parsing data from a json file. I am using trait to a define a set of methods and a class to override those methods.
trait geouastr {
def getGeoLocation(ipAddress: String): Map[String, String]
def uaParser(ua: String): Map[String, String]
}
class GeoUAData(appName: String, sc: SparkContext, conf: SparkConf, combinedCSV: String) extends geouastr with Serializable {
val spark = SparkSession.builder.config(conf).getOrCreate()
val GEOIP_FILE_COMBINED = combinedCSV;
val logger = LogFactory.getLog(this.getClass)
val allDF = spark.
read.
option("header","true").
option("inferSchema", "true").
csv(GEOIP_FILE_COMBINED).cache
val emptyMap = Map(
"country" -> "",
"state" -> "",
"city" -> "",
"zipCode" -> "",
"latitude" -> 0.0.toString(),
"longitude" -> 0.0.toString())
override def getGeoLocation(ipAddress: String): Map[String, String] = {
val ipLong = ipToLong(ipAddress)
try {
logger.error("Entering UDF " + ipAddress + " allDF " + allDF.count())
val resultDF = allDF.
filter(allDF("network").cast("long") <= ipLong.get).
filter(allDF("broadcast") >= ipLong.get).
select(allDF("country_name"), allDF("subdivision_1_name"),allDF("city_name"),
allDF("postal_code"),allDF("latitude"),allDF("longitude"))
val matchingDF = resultDF.take(1)
val matchRow = matchingDF(0)
logger.error("Lookup for " + ipAddress + " Map " + matchRow.toString())
val geoMap = Map(
"country" -> nullCheck(matchRow.getAs[String](0)),
"state" -> nullCheck(matchRow.getAs[String](1)),
"city" -> nullCheck(matchRow.getAs[String](2)),
"zipCode" -> nullCheck(matchRow.getAs[String](3)),
"latitude" -> matchRow.getAs[Double](4).toString(),
"longitude" -> matchRow.getAs[Double](5).toString())
} catch {
case (nse: NoSuchElementException) => {
logger.error("No such element", nse)
emptyMap
}
case (npe: NullPointerException) => {
logger.error("NPE for " + ipAddress + " allDF " + allDF.count(),npe)
emptyMap
}
case (ex: Exception) => {
logger.error("Generic exception " + ipAddress,ex)
emptyMap
}
}
}
def nullCheck(input: String): String = {
if(input != null) input
else ""
}
override def uaParser(ua: String): Map[String, String] = {
val client = Parser.get.parse(ua)
return Map(
"os"->client.os.family,
"device"->client.device.family,
"browser"->client.userAgent.family)
}
def ipToLong(ip: String): Option[Long] = {
Try(ip.split('.').ensuring(_.length == 4)
.map(_.toLong).ensuring(_.forall(x => x >= 0 && x < 256))
.zip(Array(256L * 256L * 256L, 256L * 256L, 256L, 1L))
.map { case (x, y) => x * y }
.sum).toOption
}
}
I notice uaParser to be working fine, while getGeoLocation is returning emptyMap(running into NPE). Adding snippet that shows how i am using this in main method.
val appName = "SampleApp"
val conf: SparkConf = new SparkConf().setAppName(appName)
val sc: SparkContext = new SparkContext(conf)
val spark = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
val geouad = new GeoUAData(appName, sc, conf, args(1))
val uaParser = Sparkudf(geouad.uaParser(_: String))
val geolocation = Sparkudf(geouad.getGeoLocation(_: String))
val sampleRdd = sc.textFile(args(0))
val json = sampleRdd.filter(_.nonEmpty)
import spark.implicits._
val sampleDF = spark.read.json(json)
val columns = sampleDF.select($"user-agent", $"source_ip")
.withColumn("sourceIp", $"source_ip")
.withColumn("geolocation", geolocation($"source_ip"))
.withColumn("uaParsed", uaParser($"user-agent"))
.withColumn("device", ($"uaParsed") ("device"))
.withColumn("os", ($"uaParsed") ("os"))
.withColumn("browser", ($"uaParsed") ("browser"))
.withColumn("country" , ($"geolocation")("country"))
.withColumn("state" , ($"geolocation")("state"))
.withColumn("city" , ($"geolocation")("city"))
.withColumn("zipCode" , ($"geolocation")("zipCode"))
.withColumn("latitude" , ($"geolocation")("latitude"))
.withColumn("longitude" , ($"geolocation")("longitude"))
.drop("geolocation")
.drop("uaParsed")
Questions:
1. Should we switch from class to object for defining UDFs? (i can keep it as singleton)
2. Can class member function be used as UDF?
3. When such a UDF is invoked, will class member like allDF remain initialized?
4. Val declared as member variable - will it get initialized at the time of construction of geouad?
I am new to Scala, Thanks in advance for guidance/suggestions.
No, switching from class to object is not necessary for defining UDF, it is only different while calling the UDF.
Yes, you can use class member function as UDF, but first you need to register the function as an UDF.
spark.sqlContext.udf.register("registeredName", Class Method _)
No, other methods are initialized when calling one UDF
Yes, the class variable val will be initialized at the time of calling geouad and performing some actions.
Everyone!
I have 3 classes:
case class Foo(id: String, bars: Option[List[Bar]])
case class Bar(id: String, buzzes: Option[List[Buz]])
case class Buz(id: String, name: String)
And the collection:
val col = Option[List[Foo]]
I need to get:
val search: String = "find me"
val (x: Option[Foo], y: Option[Bar], z: Option[Buz]) = where buz.name == search ???
Help please :)
upd:
i have the json
{
"foos": [{
"id": "...",
"bars": [{
"id": "...",
"buzzes": [{
"id": "...",
"name": "find me"
}]
}]
}]
}
and the name in current context will be unique.
my first thought was - transform collection into list of tuples - like this:
{
(foo)(bar)(buz),
(foo)(bar)(buz),
(foo)(bar)(buz)
}
and filter by buz with name == search
But,i don`t know how to :)
A big problem here is that after digging down far enough to find what you're looking for, the result type is going to reflect that morass of types: Option[List[Option[List...etc.
It's going to be easier (not necessarily better) to save off the find as a side effect.
val bz1 = Buz("bz1", "blah")
val bz2 = Buz("bz2", "target")
val bz3 = Buz("bz3", "bliss")
val br1 = Bar("br1", Option(List(bz1,bz3)))
val br2 = Bar("br2", None)
val br3 = Bar("br3", Option(List(bz1,bz2)))
val fo1 = Foo("fo1", Option(List(br1,br2)))
val fo2 = Foo("fo2", None)
val fo3 = Foo("fo3", Option(List(br2,br3)))
val col: Option[List[Foo]] = Option(List(fo1,fo2,fo3))
import collection.mutable.MutableList
val res:MutableList[(String,String,String)] = MutableList()
col.foreach(_.foreach(f =>
f.bars.foreach(_.foreach(br =>
br.buzzes.foreach(_.collect{
case bz if bz.name == "target" => res.+=((f.id,br.id,bz.id))})))))
res // res: MutableList[(String, String, String)] = MutableList((fo3,br3,bz2))
I wrote code like this:
val hashingTF = new HashingTF()
val tfv: RDD[Vector] = sparkContext.parallelize(articlesList.map { t => hashingTF.transform(t.words) })
tfv.cache()
val idf = new IDF().fit(tfv)
val rate: RDD[Vector] = idf.transform(tfv)
How to get top 5 keywords from the "rate" RDD for each articlesList item?
ADD:
articlesList contains objects:
case class ArticleInfo (val url: String, val author: String, val date: String, val keyWords: List[String], val words: List[String])
words contains all words from article.
I do not understand the structure of rate, in the documentation says:
#return an RDD of TF-IDF vectors
My solution is:
(articlesList, rate.collect()).zipped.foreach { (art,tfidf) =>
val keywords = new mutable.TreeSet[(String, Double)]
art.words.foreach { word =>
val wordHash = hashingTF.indexOf(word)
val wordTFIDF = tfidf.apply(wordHash)
if (keywords.size == KEYWORD_COUNT) {
val minimum = keywords.minBy(_._2)
if (minimum._2 < wordHash) {
keywords.remove(minimum)
keywords.add((word,wordTFIDF))
}
} else {
keywords.add((word,wordTFIDF))
}
}
art.keyWords = keywords.toList.map(_._1)
}