I am trying to define a member function in a class that would be used as UDF while parsing data from a json file. I am using trait to a define a set of methods and a class to override those methods.
trait geouastr {
def getGeoLocation(ipAddress: String): Map[String, String]
def uaParser(ua: String): Map[String, String]
}
class GeoUAData(appName: String, sc: SparkContext, conf: SparkConf, combinedCSV: String) extends geouastr with Serializable {
val spark = SparkSession.builder.config(conf).getOrCreate()
val GEOIP_FILE_COMBINED = combinedCSV;
val logger = LogFactory.getLog(this.getClass)
val allDF = spark.
read.
option("header","true").
option("inferSchema", "true").
csv(GEOIP_FILE_COMBINED).cache
val emptyMap = Map(
"country" -> "",
"state" -> "",
"city" -> "",
"zipCode" -> "",
"latitude" -> 0.0.toString(),
"longitude" -> 0.0.toString())
override def getGeoLocation(ipAddress: String): Map[String, String] = {
val ipLong = ipToLong(ipAddress)
try {
logger.error("Entering UDF " + ipAddress + " allDF " + allDF.count())
val resultDF = allDF.
filter(allDF("network").cast("long") <= ipLong.get).
filter(allDF("broadcast") >= ipLong.get).
select(allDF("country_name"), allDF("subdivision_1_name"),allDF("city_name"),
allDF("postal_code"),allDF("latitude"),allDF("longitude"))
val matchingDF = resultDF.take(1)
val matchRow = matchingDF(0)
logger.error("Lookup for " + ipAddress + " Map " + matchRow.toString())
val geoMap = Map(
"country" -> nullCheck(matchRow.getAs[String](0)),
"state" -> nullCheck(matchRow.getAs[String](1)),
"city" -> nullCheck(matchRow.getAs[String](2)),
"zipCode" -> nullCheck(matchRow.getAs[String](3)),
"latitude" -> matchRow.getAs[Double](4).toString(),
"longitude" -> matchRow.getAs[Double](5).toString())
} catch {
case (nse: NoSuchElementException) => {
logger.error("No such element", nse)
emptyMap
}
case (npe: NullPointerException) => {
logger.error("NPE for " + ipAddress + " allDF " + allDF.count(),npe)
emptyMap
}
case (ex: Exception) => {
logger.error("Generic exception " + ipAddress,ex)
emptyMap
}
}
}
def nullCheck(input: String): String = {
if(input != null) input
else ""
}
override def uaParser(ua: String): Map[String, String] = {
val client = Parser.get.parse(ua)
return Map(
"os"->client.os.family,
"device"->client.device.family,
"browser"->client.userAgent.family)
}
def ipToLong(ip: String): Option[Long] = {
Try(ip.split('.').ensuring(_.length == 4)
.map(_.toLong).ensuring(_.forall(x => x >= 0 && x < 256))
.zip(Array(256L * 256L * 256L, 256L * 256L, 256L, 1L))
.map { case (x, y) => x * y }
.sum).toOption
}
}
I notice uaParser to be working fine, while getGeoLocation is returning emptyMap(running into NPE). Adding snippet that shows how i am using this in main method.
val appName = "SampleApp"
val conf: SparkConf = new SparkConf().setAppName(appName)
val sc: SparkContext = new SparkContext(conf)
val spark = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
val geouad = new GeoUAData(appName, sc, conf, args(1))
val uaParser = Sparkudf(geouad.uaParser(_: String))
val geolocation = Sparkudf(geouad.getGeoLocation(_: String))
val sampleRdd = sc.textFile(args(0))
val json = sampleRdd.filter(_.nonEmpty)
import spark.implicits._
val sampleDF = spark.read.json(json)
val columns = sampleDF.select($"user-agent", $"source_ip")
.withColumn("sourceIp", $"source_ip")
.withColumn("geolocation", geolocation($"source_ip"))
.withColumn("uaParsed", uaParser($"user-agent"))
.withColumn("device", ($"uaParsed") ("device"))
.withColumn("os", ($"uaParsed") ("os"))
.withColumn("browser", ($"uaParsed") ("browser"))
.withColumn("country" , ($"geolocation")("country"))
.withColumn("state" , ($"geolocation")("state"))
.withColumn("city" , ($"geolocation")("city"))
.withColumn("zipCode" , ($"geolocation")("zipCode"))
.withColumn("latitude" , ($"geolocation")("latitude"))
.withColumn("longitude" , ($"geolocation")("longitude"))
.drop("geolocation")
.drop("uaParsed")
Questions:
1. Should we switch from class to object for defining UDFs? (i can keep it as singleton)
2. Can class member function be used as UDF?
3. When such a UDF is invoked, will class member like allDF remain initialized?
4. Val declared as member variable - will it get initialized at the time of construction of geouad?
I am new to Scala, Thanks in advance for guidance/suggestions.
No, switching from class to object is not necessary for defining UDF, it is only different while calling the UDF.
Yes, you can use class member function as UDF, but first you need to register the function as an UDF.
spark.sqlContext.udf.register("registeredName", Class Method _)
No, other methods are initialized when calling one UDF
Yes, the class variable val will be initialized at the time of calling geouad and performing some actions.
Related
I started learning scala and Apache spark. I have an input file as below without the header.
0,name1,33,385 - first record
1,name2,26,221 - second record
unique-id, name, age, friends
1) when trying to filter age which is not 26, the below code is not working.
def parseLine(x : String) =
{
val line = x.split(",").filter(x => x._2 != "26")
}
I also tried like below. both cases it is printing all the values including 26
val friends = line(2).filter(x => x != "26")
2)when trying with index x._3, it is saying index outbound.
val line = x.split(",").filter(x => x._3 != "221")
Why index 3 is having an issue here?
Please find below the complete sample code.
package learning
import org.apache.spark._
import org.apache.log4j._
object Test1 {
def main(args : Array[String]): Unit =
{
val sc = new SparkContext("local[*]", "Test1")
val lines = sc.textFile("D:\\SparkScala\\abcd.csv")
Logger.getLogger("org").setLevel(Level.ERROR)
val testres = lines.map(parseLine)
testres.take(10).foreach(println)
}
def parseLine(x : String) =
{
val line = x.split(",").filter(x => x._2 != "33")
//val line = x.split(",").filter(x => x._3 != "307")
val age = line(1)
val friends = line(3).filter(x => x != "307")
(age,friends)
}
}
how to filter with age or friends in simple way here.
why index 3 is not working here
The issue is that you are trying to filter on the array representing a single line and not on the RDD that contains all the lines.
A possible version could be the following (I also created a case class to hold the data coming from the CSV):
package learning
import org.apache.spark._
import org.apache.log4j._
object Test2 {
// A structured representation of a CSV line
case class Person(id: String, name: String, age: Int, friends: Int)
def main(args : Array[String]): Unit = {
val sc = new SparkContext("local[*]", "Test1")
Logger.getLogger("org").setLevel(Level.ERROR)
sc.textFile("D:\\SparkScala\\abcd.csv") // RDD[String]
.map(line => parse(line)) // RDD[Person]
.filter(person => person.age != 26) // filter out people of 26 years old
.take(10) // collect 10 people from the RDD
.foreach(println)
}
def parse(x : String): Person = {
// Split the CSV string by comma into an array of strings
val line = x.split(",")
// After extracting the fields from the CSV string, create an instance of Person
Person(id = line(0), name = line(1), age = line(2).toInt, friends = line(3).toInt)
}
}
Another possibility would be to use flatMap() and Option[] values instead. In this case you can operate on a single line directly, for instance:
package learning
import org.apache.spark._
import org.apache.log4j._
object Test3 {
// A structured representation of a CSV line
case class Person(id: String, name: String, age: Int, friends: Int)
def main(args : Array[String]): Unit = {
val sc = new SparkContext("local[*]", "Test1")
Logger.getLogger("org").setLevel(Level.ERROR)
sc.textFile("D:\\SparkScala\\abcd.csv") // RDD[String]
.flatMap(line => parse(line)) // RDD[Person] -- you don't need to filter anymore, the flatMap does it for you now
.take(10) // collect 10 people from the RDD
.foreach(println)
}
def parse(x : String): Option[Person] = {
// Split the CSV string by comma into an array of strings
val line = x.split(",")
// After extracting the fields from the CSV string, create an instance of Person only if it's not 26
line(2) match {
case "26" => None
case _ => Some(Person(id = line(0), name = line(1), age = line(2).toInt, friends = line(3).toInt))
}
}
}
I am new to Spark-Scala and trying following thing but I am stuck up and not getting on how to achieve this requirement. I shall be really thankful if someone can really help in this regards.
We have to invoke different rules on different columns of given table. The list of column names and rules is being passed as argument to the program
The resultant of first rule should go as input to the next rule input.
question : How can I execute exec() function in cascading manner with dynamically filling the arguments for as many rules as specified in arguments.
I have developed a code as follows.
object Rules {
def main(args: Array[String]) = {
if (args.length != 3) {
println("Need exactly 3 arguments in format : <sourceTableName> <destTableName> <[<colName>=<Rule> <colName>=<Rule>,...")
println("E.g : INPUT_TABLE OUTPUT_TABLE [NAME=RULE1,ID=RULE2,TRAIT=RULE3]");
System.exit(-1)
}
val conf = new SparkConf().setAppName("My-Rules").setMaster("local");
val sc = new SparkContext(conf);
val srcTableName = args(0).trim();
val destTableName = args(1).trim();
val ruleArguments = StringUtils.substringBetween(args(2).trim(), "[", "]");
val businessRuleMappings = ruleArguments.split(",").map(_.split("=")).map(arr => arr(0) -> arr(1)).toMap;
val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc) ;
val hiveContext : HiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
val dfSourceTbl = hiveContext.table("TEST.INPUT_TABLE");
def exec(dfSource: DataFrame,columnName :String ,funName: String): DataFrame = {
funName match {
case "RULE1" => TransformDF(columnName,dfSource,RULE1);
case "RULE2" => TransformDF(columnName,dfSource,RULE2);
case "RULE3" => TransformDF(columnName,dfSource,RULE3);
case _ =>dfSource;
}
}
def TransformDF(x:String, df:DataFrame, f:(String,DataFrame)=>DataFrame) : DataFrame = {
f(x,df);
}
def RULE1(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE2(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE3(column : String,sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
// How can I call this exec() function with output casacing and arguments for variable number of rules.
val finalResultDF = exec(exec(exec(dfSourceTbl,"NAME","RULE1"),"ID","RULE2"),"TRAIT","RULE3);
finalResultDF.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("DB.destTableName")
}
}
I would write all the rules as functions transforming one dataframe to another:
val rules: Seq[(DataFrame) => DataFrame] = Seq(
RULE1("NAME",_:DataFrame),
RULE2("ID",_:DataFrame),
RULE3("TRAIT",_:DataFrame)
)
Not you can apply them using folding
val finalResultDF = rules.foldLeft(dfSourceTbl)(_ transform _)
I was trying to code some utilities to bulk load data through HFiles from Spark RDDs.
I was taking the pattern of CSVBulkLoadTool from phoenix. I managed to generate some HFiles and load them into HBase, but i can't see the rows using sqlline(e.g using hbase shell it is possible). I would be more than grateful for any suggestions.
BulkPhoenixLoader.scala:
class BulkPhoenixLoader[A <: ImmutableBytesWritable : ClassTag, T <: KeyValue : ClassTag](rdd: RDD[(A, T)]) {
def createConf(tableName: String, inConf: Option[Configuration] = None): Configuration = {
val conf = inConf.map(HBaseConfiguration.create).getOrElse(HBaseConfiguration.create())
val job: Job = Job.getInstance(conf, "Phoenix bulk load")
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
// initialize credentials to possibily run in a secure env
TableMapReduceUtil.initCredentials(job)
val htable: HTable = new HTable(conf, tableName)
// Auto configure partitioner and reducer according to the Main Data table
HFileOutputFormat2.configureIncrementalLoad(job, htable)
conf
}
def bulkSave(tableName: String, outputPath: String, conf:
Option[Configuration]) = {
val configuration: Configuration = createConf(tableName, conf)
rdd.saveAsNewAPIHadoopFile(
outputPath,
classOf[ImmutableBytesWritable],
classOf[Put],
classOf[HFileOutputFormat2],
configuration)
}
}
ExtendedProductRDDFunctions.scala:
class ExtendedProductRDDFunctions[A <: scala.Product](data: org.apache.spark.rdd.RDD[A]) extends
ProductRDDFunctions[A](data) with Serializable {
def toHFile(tableName: String,
columns: Seq[String],
conf: Configuration = new Configuration,
zkUrl: Option[String] =
None): RDD[(ImmutableBytesWritable, KeyValue)] = {
val config = ConfigurationUtil.getOutputConfiguration(tableName, columns, zkUrl, Some(conf))
val tableBytes = Bytes.toBytes(tableName)
val encodedColumns = ConfigurationUtil.encodeColumns(config)
val jdbcUrl = zkUrl.map(getJdbcUrl).getOrElse(getJdbcUrl(config))
val conn = DriverManager.getConnection(jdbcUrl)
val query = QueryUtil.constructUpsertStatement(tableName,
columns.toList.asJava,
null)
data.flatMap(x => mapRow(x, jdbcUrl, encodedColumns, tableBytes, query))
}
def mapRow(product: Product,
jdbcUrl: String,
encodedColumns: String,
tableBytes: Array[Byte],
query: String): List[(ImmutableBytesWritable, KeyValue)] = {
val conn = DriverManager.getConnection(jdbcUrl)
val preparedStatement = conn.prepareStatement(query)
val columnsInfo = ConfigurationUtil.decodeColumns(encodedColumns)
columnsInfo.zip(product.productIterator.toList).zipWithIndex.foreach(setInStatement(preparedStatement))
preparedStatement.execute()
val uncommittedDataIterator = PhoenixRuntime.getUncommittedDataIterator(conn, true)
val hRows = uncommittedDataIterator.asScala.filter(kvPair =>
Bytes.compareTo(tableBytes, kvPair.getFirst) == 0
).flatMap(kvPair => kvPair.getSecond.asScala.map(
kv => {
val byteArray = kv.getRowArray.slice(kv.getRowOffset, kv.getRowOffset + kv.getRowLength - 1) :+ 1.toByte
(new ImmutableBytesWritable(byteArray, 0, kv.getRowLength), kv)
}))
conn.rollback()
conn.close()
hRows.toList
}
def setInStatement(statement: PreparedStatement): (((ColumnInfo, Any), Int)) => Unit = {
case ((c, v), i) =>
if (v != null) {
// Both Java and Joda dates used to work in 4.2.3, but now they must be java.sql.Date
val (finalObj, finalType) = v match {
case dt: DateTime => (new Date(dt.getMillis), PDate.INSTANCE.getSqlType)
case d: util.Date => (new Date(d.getTime), PDate.INSTANCE.getSqlType)
case _ => (v, c.getSqlType)
}
statement.setObject(i + 1, finalObj, finalType)
} else {
statement.setNull(i + 1, c.getSqlType)
}
}
private def getIndexTables(conn: Connection, qualifiedTableName: String) : List[(String, String)]
= {
val table: PTable = PhoenixRuntime.getTable(conn, qualifiedTableName)
val tables = table.getIndexes.asScala.map(x => x.getIndexType match {
case IndexType.LOCAL => (x.getTableName.getString, MetaDataUtil.getLocalIndexTableName(qualifiedTableName))
case _ => (x.getTableName.getString, x.getTableName.getString)
}).toList
tables
}
}
The generated HFiles I load with the utility tool from hbase as follows:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles path/to/hfile tableName
You could just convert your csv file to an RDD of Product and use the .saveToPhoenix method. This is generally how I load csv data into phoenix.
Please see: https://phoenix.apache.org/phoenix_spark.html
How does one access the parameters used to construct a Module from inside the Tester that is testing it?
In the test below I am passing the parameters explicitly both to the Module and to the Tester. I would prefer not to have to pass them to the Tester but instead extract them from the module that was also passed in.
Also I am new to scala/chisel so any tips on bad techniques I'm using would be appreciated :).
import Chisel._
import math.pow
class TestA(dataWidth: Int, arrayLength: Int) extends Module {
val dataType = Bits(INPUT, width = dataWidth)
val arrayType = Vec(gen = dataType, n = arrayLength)
val io = new Bundle {
val i_valid = Bool(INPUT)
val i_data = dataType
val i_array = arrayType
val o_valid = Bool(OUTPUT)
val o_data = dataType.flip
val o_array = arrayType.flip
}
io.o_valid := io.i_valid
io.o_data := io.i_data
io.o_array := io.i_array
}
class TestATests(c: TestA, dataWidth: Int, arrayLength: Int) extends Tester(c) {
val maxData = pow(2, dataWidth).toInt
for (t <- 0 until 16) {
val i_valid = rnd.nextInt(2)
val i_data = rnd.nextInt(maxData)
val i_array = List.fill(arrayLength)(rnd.nextInt(maxData))
poke(c.io.i_valid, i_valid)
poke(c.io.i_data, i_data)
(c.io.i_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
expect(c.io.o_valid, i_valid)
expect(c.io.o_data, i_data)
(c.io.o_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
step(1)
}
}
object TestAObject {
def main(args: Array[String]): Unit = {
val tutArgs = args.slice(0, args.length)
val dataWidth = 5
val arrayLength = 6
chiselMainTest(tutArgs, () => Module(
new TestA(dataWidth=dataWidth, arrayLength=arrayLength))){
c => new TestATests(c, dataWidth=dataWidth, arrayLength=arrayLength)
}
}
}
If you make the arguments dataWidth and arrayLength members of TestA you can just reference them. In Scala this can be accomplished by inserting val into the argument list:
class TestA(val dataWidth: Int, val arrayLength: Int) extends Module ...
Then you can reference them from the test as members with c.dataWidth or c.arrayLength
Below Scala class parses a file using JDOM and populates the values from the file into a Scala immutable Map. Using the + operator on the Map does not seem to have any effect as the Map is always zero.
import java.io.File
import org.jsoup.nodes.Document
import org.jsoup.Jsoup
import org.jsoup.select.Elements
import org.jsoup.nodes.Element
import scala.collection.immutable.TreeMap
class JdkElementDetail() {
var fileLocation: String = _
def this(fileLocation: String) = {
this()
this.fileLocation = fileLocation;
}
def parseFile : Map[String , String] = {
val jdkElementsMap: Map[String, String] = new TreeMap[String , String];
val input: File = new File(fileLocation);
val doc: Document = Jsoup.parse(input, "UTF-8", "http://example.com/");
val e: Elements = doc.getElementsByAttribute("href");
val href: java.util.Iterator[Element] = e.iterator();
while (href.hasNext()) {
var objectName = href.next();
var hrefValue = objectName.attr("href");
var name = objectName.text();
jdkElementsMap + name -> hrefValue
println("size is "+jdkElementsMap.size)
}
jdkElementsMap
}
}
println("size is "+jdkElementsMap.size) always prints "size is 0"
Why is the size always zero, am I not adding to the Map correctly?
Is the only fix for this to convert jdkElementsMap to a var and then use the following?
jdkElementsMap += name -> hrefValue
Removing the while loop here is my updated object:
package com.parse
import java.io.File
import org.jsoup.nodes.Document
import org.jsoup.Jsoup
import org.jsoup.select.Elements
import org.jsoup.nodes.Element
import scala.collection.immutable.TreeMap
import scala.collection.JavaConverters._
class JdkElementDetail() {
var fileLocation: String = _
def this(fileLocation: String) = {
this()
this.fileLocation = fileLocation;
}
def parseFile : Map[String , String] = {
var jdkElementsMap: Map[String, String] = new TreeMap[String , String];
val input: File = new File(fileLocation);
val doc: Document = Jsoup.parse(input, "UTF-8", "http://example.com/");
val elements: Elements = doc.getElementsByAttribute("href");
val elementsScalaIterator = elements.iterator().asScala
elementsScalaIterator.foreach {
keyVal => {
var hrefValue = keyVal.attr("href");
var name = keyVal.text();
println("size is "+jdkElementsMap.size)
jdkElementsMap += name -> hrefValue
}
}
jdkElementsMap
}
}
Immutable data structures -- be they lists or maps -- are just that: immutable. You don't ever change them, you create new data structures based on changes to the old ones.
If you do val x = jdkElementsMap + (name -> hrefValue), then you'll get the new map on x, while jdkElementsMap continues to be the same.
If you change jdkElementsMap into a var, then you could do jdkEleemntsMap = jdkElementsMap + (name -> hrefValue), or just jdkElementsMap += (name -> hrefValue). The latter will also work for mutable maps.
Is that the only way? No, but you have to let go of while loops to achieve the same thing. You could replace these lines:
val href: java.util.Iterator[Element] = e.iterator();
while (href.hasNext()) {
var objectName = href.next();
var hrefValue = objectName.attr("href");
var name = objectName.text();
jdkElementsMap + name -> hrefValue
println("size is "+jdkElementsMap.size)
}
jdkElementsMap
With a fold, such as in:
import scala.collection.JavaConverters.asScalaIteratorConverter
e.iterator().asScala.foldLeft(jdkElementsMap) {
case (accumulator, href) => // href here is not an iterator
val objectName = href
val hrefValue = objectName.attr("href")
val name = objectName.text()
val newAccumulator = accumulator + (name -> hrefValue)
println("size is "+newAccumulator.size)
newAccumulator
}
Or with recursion:
def createMap(hrefIterator: java.util.Iterator[Element],
jdkElementsMap: Map[String, String]): Map[String, String] = {
if (hrefIterator.hasNext()) {
val objectName = hrefIterator.next()
val hrefValue = objectName.attr("href")
val name = objectName.text()
val newMap = jdkElementsMap + name -> hrefValue
println("size is "+newMap.size)
createMap(hrefIterator, newMap)
} else {
jdkElementsMap
}
}
createMap(e.iterator(), new TreeMap[String, String])
Performance-wise, the fold will be rather slower, and the recursion should be very slightly faster.
Mind you, Scala does provide mutable maps, and not just to be able to say it has them: if they fit better you problem, then go ahead and use them! If you want to learn how to use the immutable ones, then the two approaches above are the ones you should learn.
The map is immutable, so any modifications will return the modified map. jdkElementsMap + (name -> hrefValue) returns a new map containing the new pair, but you are discarding the modified map after it is created.
EDIT: It looks like you can convert Java iterables to Scala iterables, so you can then fold over the resulting sequence and accumulate a map:
import scala.collection.JavaConverters._
val e: Elements = doc.getElementsByAttribute("href");
val jdkElementsMap = e.asScala
.foldLeft(new TreeMap[String , String])((map, href) => map + (href.text() -> href.attr("href"))
if you don't care what kind of map you create you can use toMap:
val jdkElementsMap = e.asScala
.map(href => (href.text(), href.attr("href")))
.toMap