How to efficiently extract a value from HiveContext Query - scala

I am running a query through my HiveContext
Query:
val hiveQuery = s"SELECT post_domain, post_country, post_geo_city, post_geo_region
FROM $database.$table
WHERE year=$year and month=$month and day=$day and hour=$hour and event_event_id='$uniqueIdentifier'"
val hiveQueryObj:DataFrame = hiveContext.sql(hiveQuery)
Originally, I was extracting each value from the column with:
hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
However, I was told to avoid this because it makes too many connections to Hive. I am pretty new to this area so I'm not sure how to extract the column values efficiently. How can I perform the same logic in a more efficient way?
I plan to implement this in my code
val arr = Array("post_domain", "post_country", "post_geo_city", "post_geo_region")
arr.foreach(column => {
// expected Map
val ex = expected.get(column).get
val actual = hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
assert(actual.equals(ex))
}

Related

How to apply filters on spark scala dataframe view?

I am pasting a snippet here where I am facing issues with the BigQuery Read. The "wherePart" has more number of records and hence BQ call is invoked again and again. Keeping the filter outside of BQ Read would help. The idea is, first read the "mainTable" from BQ, store it in a spark view, then apply the "wherePart" filter to this view in spark.
["subDate" is a function to subtract one date from another and return the number of days in between]
val Df = getFb(config, mainTable, ds)
def getFb(config: DataFrame, mainTable: String, ds: String) : DataFrame = {
val fb = config.map(row => Target.Pfb(
row.getAs[String]("m1"),
row.getAs[String]("m2"),
row.getAs[Seq[Int]]("days")))
.collect
val wherePart = fb.map(x => (x.m1, x.m2, subDate(ds, x.days.max - 1))).
map(x => s"(idata_${x._1} = '${x._2}' AND ds BETWEEN '${x._3}' AND '${ds}')").
mkString(" OR ")
val q = new Q()
val tempView = "tempView"
spark.readBigQueryTable(mainTable, wherePart).createOrReplaceTempView(tempView)
val Df = q.mainTableLogs(tempView)
Df
}
Could someone please help me here.
Are you using the spark-bigquery-connector? If so the right syntax is
spark.read.format("bigquery")
.load(mainTable)
.where(wherePart)
.createOrReplaceTempView(tempView)

How to improve Kudu reads with Spark?

I have a process that given a new input retrieves related information form our Kudu database and then does some computation.
The problem lies in the data retrieval, we have 1.201.524.092 rows and for any computation, it takes forever to start processing the needed ones because the reader needs to give it all to spark.
To read form kudu we do:
def read(tableName: String): Try[DataFrame] = {
val kuduOptions: Map[String, String] = Map(
"kudu.table" -> tableName,
"kudu.master" -> kuduContext.kuduMaster)
SQLContext.read.options(kuduOptions).format("kudu").load
}
And then:
val newInputs = ??? // Dataframe with the new inputs
val currentInputs = read("inputsTable") // This takes too much time!!!!
val relatedCurrent = currentInputs.join(newInputs.select("commonId", Seq("commonId"), "inner")
doThings(newInputs, relatedCurrent)
For example, we only want to introduce a single new input. Well, it has to scan the full table to find the currentInputs which makes a Shuffle Write of 81.6 GB / 1201524092 rows.
How can I improve this?
Thanks,
You can collect the new input and after that you can use it in a where clause.
Using this way you can easily hit an OOM, but it can make your query very fast because it's going to benefit of predicate pushdown
val collectedIds = newInputs.select("commonId").collect
val filtredCurrentInputs = currentInputs.where($"commonId".isin(collectedIds))

How to iterate Big Query TableResult correctly?

I have a complex join query in Big Query and need to run in a spark job. This is the current code:
val bigquery = BigQueryOptions.newBuilder().setProjectId(bigQueryConfig.bigQueryProjectId)
.setCredentials(credentials)
.build().getService
val query =
//some complex query
val queryConfig: QueryJobConfiguration =
QueryJobConfiguration.newBuilder(
query)
.setUseLegacySql(false)
.setPriority(QueryJobConfiguration.Priority.BATCH) //(tried with and without)
.build()
val jobId: JobId = JobId.newBuilder().setRandomJob().build()
val queryJob: Job = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build).waitFor()
val result = queryJob.getQueryResults()
val output = result.iterateAll().iterator().asScala.to[Seq].map { row: FieldValueList =>
//create case class from the row
}
It keeps running into this error:
Exceeded rate limits: Your project: XXX exceeded quota for tabledata.list bytes per second per project.
Is there a way to better iterate through the results? I have tried to do setPriority(QueryJobConfiguration.Priority.BATCH) on the query job configuration, but it doesn't improve results. Also tried to reduce the number of spark executors to 1, but of no use.
Instead of reading the query results directly, you can use the spark-bigquery-connector to read them into a DataFrame:
val queryConfig: QueryJobConfiguration =
QueryJobConfiguration.newBuilder(
query)
.setUseLegacySql(false)
.setPriority(QueryJobConfiguration.Priority.BATCH) //(tried with and without)
.setDestinationTable(TableId.of(destinationDataset, destinationTable))
.build()
val jobId: JobId = JobId.newBuilder().setRandomJob().build()
val queryJob: Job = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build).waitFor()
val result = queryJob.getQueryResults()
// read into DataFrame
val data = spark.read.format("bigquery")
.option("dataset",destinationDataset)
.option("table" destinationTable)
.load()
We resolved the situation by providing a custom page size on the TableResult

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

Processing Apache Spark GraphX multiple subgraphs

I have a parent Graph that I want to filter into multiple subgraphs, so I can apply a function to each subgraph and extract some data. My code looks like this:
val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = ...
val myEdges = ...
val myGraph = Graph(myVertices, myEdges)
val myResults : RDD[(<Tuple>)] = myTerms.map { x => mySubgraphFunction(myGraph, x) }
Where mySubgraphFunction is a function that creates a subgraph, performs a calculation, and returns a tuple of result data.
When I run this, I get a Java null pointer exception at the point that mySubgraphFunction calls GraphX.subgraph. If I call collect on the RDD of terms, I can get this to work (also added persist on the RDDs for performance):
val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myEdges = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myGraph = Graph(myVertices, myEdges)
val myResults : Array[(<Tuple>)] = myTerms.collect().map { x =>
mySubgraphFunction(myGraph, x) }
Is there a way to get this to work where I don't have to call collect() (i.e. make this a distributed operation)? I'm creating ~1k subgraphs and the performance is slow.