Spark Scala: build a new column using a function using another dataframe - scala

Here's my issue: I have a first dataframe which is basically a list of cities, and the country they reside in. I have a second dataframe, with a list of users, and the cities they reside in.
I'd like to add a "country" column to the second dataframe, where its value would be based on the "city" column of course, but the city names can me typed differently (for example Washington and washington would both have to give me USA).
I though the best way to do that would be to create a foo(country: String) : String which would return the country by parsing the first dataframe, but I can't find a way to use this function while creating my new column.

first put in lower case the city column of both dataframes, since you go to join on key city and after, effect the capitalize of the first letter, this code should do what you are looking for:
object Main {
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import sparkSession.implicits._
val citiesDF = Seq(
("London", "England"), ("Washington", "USA")
)
.toDF("city", "country")
.withColumn("city", lower(col("city")))
val usersDF = Seq(
("Andy", "London"), ("Mark", "Washington"), ("Bob", "washington")
)
.toDF("name", "city")
.withColumn("city", lower(col("city")))
val resultDF = citiesDF.join(usersDF, Seq("city"))
.withColumn("city", initcap(col("city")))
resultDF.show()
}
}

Related

Scala RDD get earliest date by group

I have a case class RDD in Scala and need to find the earliest date by each group (patientID).
Here is the input:
patientID date
000000047-01 2008-03-21T21:00:00Z
000000047-01 2007-10-24T19:45:00Z
000000485-01 2011-06-17T21:00:00Z
000000485-01 2006-02-22T18:45:00Z
The expected should be:
patientID date
000000047-01 2007-10-24T19:45:00Z
000000485-01 2006-02-22T18:45:00Z
I tried something like following but didn't work
val out = medication.groupBy(x => x.patientID).sortBy(x => x.date).take(1)
Okay!
So I understand your question correctly you want the top from every record, if that's the case then here I have created the solution.
val dataDF = Seq(
("000000047-01", "2008-03-21T21:00:00Z"),
("000000047-01" , "2007-10-24T19:45:00Z"),
("000000485-01", "2011-06-17T21:00:00Z"),
("000000485-01", "2006-02-22T18:45:00Z"))
import spark.implicits._
val dfWithSchema = dataDF.toDF("patientId", "date")
val winSpec = Window.partitionBy("patientId").orderBy("date")
val rank_df = dfWithSchema.withColumn("rank", rank().over(winSpec)).orderBy(col("patientId"))
val result = rank_df.select(col("patientId"),col("date")).where(col("rank") === 1)
result.show()
Please ignore the steps for creating the DF with the schema if you have already schema defined with your data.

How can I map a function on dataFrame column values which returns a dataFrame?

I have a spark DataFrame, df1, which contains several columns, one of them is with IDs of patients. I want to take this column and perform a function that sends http request for information regarding every ID, say medical test. This information is then parsed from json and returned by the function as DataFrame of multiple tests. I want to do this for all the IDs so that I have a second DataFrame, df2, with all medical tests information for the IDs in df1.
I tried the following code, which I think is not optimal especially for large number of patients. My problem is that I cannot handle the results in the form of Array[org.apache.spark.sql.DataFrame]. Note this is a sample code, in real life I might have 100 tests for one ID and only 3 for another.
import scala.util.Random._
val df1 = Seq(
("9031x", 32),
("1102z", 12),
("3048o", 54)
).toDF("ID", "age")
// a function that takes the string and returns a DataFrame
def getPatientInfo(ID: String): org.apache.spark.sql.DataFrame = {
val r = scala.util.Random
val df2 = Seq(
("test1", r.nextInt(100), r.nextInt(40)+1980, r.nextString(4)),
("test2", r.nextInt(100), r.nextInt(40)+1980, r.nextString(3)),
("test3", r.nextInt(100), r.nextInt(40)+1980, r.nextString(5))
).toDF("testName", "year", "result", "Notes")
df2
}
// convert the ID to Array[String]
val ID = df1.collect().map(row => row.getString(0))
// apply the function foreach ID
val medicalRecords = for (i <- ID) yield {getPatientInfo(i)}
Are there any other optimal approaches?
TL;DR;
It is not possible DataFrame.map (or equivalent method) cannot use SparkSession or distributed data structures.
If you want make it work, use your favorite JSON parser instead and redefine getPatient as either:
def getPatientInfo(ID: String): Seq[Row]
or
def getPatientInfo(ID: String): T
where T is a case class and replace:
df1.flatMap(row => getPatientInfo(row.getString(0)))
(adding Encoder if necessary).

Spark: HBase Bulk Load using Scala

We have a text files of 100K records each and we need to read the file line by line and insert it's value into hbase.
The file is '|' delimited.
Sample textFile example:
SLNO|Name|City|Pincode
1|ABC|Pune|400104
2|BMN|Delhi|100065
Each column will have different column family.
We are trying to implement this in Spark-Scala using HBase Bulk load.
We came across this link suggesting bulk load :
http://www.openkb.info/2015/01/how-to-use-scala-on-spark-to-load-data.html
With the below syntax for inserting into single column family.
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(conf)
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable])
job.setMapOutputValueClass (classOf[KeyValue])
HFileOutputFormat.configureIncrementalLoad (job, table)
// Generate 10 sample data:
val num = sc.parallelize(1 to 10)
val rdd = num.map(x=>{
val kv: KeyValue = new KeyValue(Bytes.toBytes(x), "cf".getBytes(),
"c1".getBytes(), "value_xxx".getBytes() )
(new ImmutableBytesWritable(Bytes.toBytes(x)), kv)
})
// Directly bulk load to Hbase/MapRDB tables.
rdd.saveAsNewAPIHadoopFile("/tmp/xxxx19", classOf[ImmutableBytesWritable],
classOf[KeyValue], classOf[HFileOutputFormat], job.getConfiguration())
Can anyone advice on the bulk load insertion for multi-column family.
Do have a look at rdd.saveAsNewAPIHadoopDataset, to insert the data into the hbase table.
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparkToHive").enableHiveSupport().getOrCreate()
import spark.implicits._
val config = HBaseConfiguration.create()
config.set("hbase.zookeeper.quorum", "ip's")
config.set("hbase.zookeeper.property.clientPort","2181")
config.set(TableInputFormat.INPUT_TABLE, "tableName")
val newAPIJobConfiguration1 = Job.getInstance(config)
newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "tableName")
newAPIJobConfiguration1.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
val df: DataFrame = Seq(("foo", "1", "foo1"), ("bar", "2", "bar1")).toDF("key", "value1", "value2")
val hbasePuts= df.rdd.map((row: Row) => {
val put = new Put(Bytes.toBytes(row.getString(0)))
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("value1"), Bytes.toBytes(row.getString(1)))
put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("value2"), Bytes.toBytes(row.getString(2)))
(new ImmutableBytesWritable(), put)
})
hbasePuts.saveAsNewAPIHadoopDataset(newAPIJobConfiguration1.getConfiguration())
}
Ref : https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/

Insert Spark dataframe into hbase

I have a dataframe and I want to insert it into hbase. I follow this documenation .
This is how my dataframe look like:
--------------------
|id | name | address |
|--------------------|
|23 |marry |france |
|--------------------|
|87 |zied |italie |
--------------------
I create a hbase table using this code:
val tableName = "two"
val conf = HBaseConfiguration.create()
if(!admin.isTableAvailable(tableName)) {
print("-----------------------------------------------------------------------------------------------------------")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("z1".getBytes()))
admin.createTable(tableDesc)
}else{
print("Table already exists!!--------------------------------------------------------------------------------------")
}
And now how may I insert this dataframe into hbase ?
In another example I succeed to insert into hbase using this code:
val myTable = new HTable(conf, tableName)
for (i <- 0 to 1000) {
var p = new Put(Bytes.toBytes(""+i))
p.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(""+(i*5)))
p.add("z1".getBytes(), "age".getBytes(), Bytes.toBytes("2017-04-20"))
p.add("z2".getBytes(), "job".getBytes(), Bytes.toBytes(""+i))
p.add("z2".getBytes(), "salary".getBytes(), Bytes.toBytes(""+i))
myTable.put(p)
}
myTable.flushCommits()
But now I am stuck, how to insert each record of my dataframe into my hbase table.
Thank you for your time and attention
An alternate is to look at rdd.saveAsNewAPIHadoopDataset, to insert the data into the hbase table.
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparkToHive").enableHiveSupport().getOrCreate()
import spark.implicits._
val config = HBaseConfiguration.create()
config.set("hbase.zookeeper.quorum", "ip's")
config.set("hbase.zookeeper.property.clientPort","2181")
config.set(TableInputFormat.INPUT_TABLE, "tableName")
val newAPIJobConfiguration1 = Job.getInstance(config)
newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "tableName")
newAPIJobConfiguration1.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
val df: DataFrame = Seq(("foo", "1", "foo1"), ("bar", "2", "bar1")).toDF("key", "value1", "value2")
val hbasePuts= df.rdd.map((row: Row) => {
val put = new Put(Bytes.toBytes(row.getString(0)))
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("value1"), Bytes.toBytes(row.getString(1)))
put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("value2"), Bytes.toBytes(row.getString(2)))
(new ImmutableBytesWritable(), put)
})
hbasePuts.saveAsNewAPIHadoopDataset(newAPIJobConfiguration1.getConfiguration())
}
Ref : https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/
Below is a full example using the spark hbase connector from Hortonworks available in Maven.
This example shows
how to check if HBase table is existing
create HBase table if not existing
Insert DataFrame into HBase table
import org.apache.hadoop.hbase.client.{ColumnFamilyDescriptorBuilder, ConnectionFactory, TableDescriptorBuilder}
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog
object Main extends App {
case class Employee(key: String, fName: String, lName: String, mName: String,
addressLine: String, city: String, state: String, zipCode: String)
// as pre-requisites the table 'employee' with column families 'person' and 'address' should exist
val tableNameString = "default:employee"
val colFamilyPString = "person"
val colFamilyAString = "address"
val tableName = TableName.valueOf(tableNameString)
val colFamilyP = colFamilyPString.getBytes
val colFamilyA = colFamilyAString.getBytes
val hBaseConf = HBaseConfiguration.create()
val connection = ConnectionFactory.createConnection(hBaseConf);
val admin = connection.getAdmin();
println("Check if table 'employee' exists:")
val tableExistsCheck: Boolean = admin.tableExists(tableName)
println(s"Table " + tableName.toString + " exists? " + tableExistsCheck)
if(tableExistsCheck == false) {
println("Create Table employee with column families 'person' and 'address'")
val colFamilyBuild1 = ColumnFamilyDescriptorBuilder.newBuilder(colFamilyP).build()
val colFamilyBuild2 = ColumnFamilyDescriptorBuilder.newBuilder(colFamilyA).build()
val tableDescriptorBuild = TableDescriptorBuilder.newBuilder(tableName)
.setColumnFamily(colFamilyBuild1)
.setColumnFamily(colFamilyBuild2)
.build()
admin.createTable(tableDescriptorBuild)
}
// define schema for the dataframe that should be loaded into HBase
def catalog =
s"""{
|"table":{"namespace":"default","name":"employee"},
|"rowkey":"key",
|"columns":{
|"key":{"cf":"rowkey","col":"key","type":"string"},
|"fName":{"cf":"person","col":"firstName","type":"string"},
|"lName":{"cf":"person","col":"lastName","type":"string"},
|"mName":{"cf":"person","col":"middleName","type":"string"},
|"addressLine":{"cf":"address","col":"addressLine","type":"string"},
|"city":{"cf":"address","col":"city","type":"string"},
|"state":{"cf":"address","col":"state","type":"string"},
|"zipCode":{"cf":"address","col":"zipCode","type":"string"}
|}
|}""".stripMargin
// define some test data
val data = Seq(
Employee("1","Horst","Hans","A","12main","NYC","NY","123"),
Employee("2","Joe","Bill","B","1337ave","LA","CA","456"),
Employee("3","Mohammed","Mohammed","C","1Apple","SanFran","CA","678")
)
// create SparkSession
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("HBaseConnector")
.getOrCreate()
// serialize data
import spark.implicits._
val df = spark.sparkContext.parallelize(data).toDF
// write dataframe into HBase
df.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "3")) // create 3 regions
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
}
This worked for me while I had the relevant site-xmls ("core-site.xml", "hbase-site.xml", "hdfs-site.xml") available in my resources.
using answer for code formatting purposes
Doc tells:
sc.parallelize(data).toDF.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5"))
.format("org.apache.hadoop.hbase.spark ")
.save()
where sc.parallelize(data).toDF is your DataFrame. Doc example turns scala collection to dataframe using sc.parallelize(data).toDF
You already have your DataFrame, just try to call
yourDataFrame.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5"))
.format("org.apache.hadoop.hbase.spark ")
.save()
And it should work. Doc is pretty clear...
UPD
Given a DataFrame with specified schema, above will create an HBase
table with 5 regions and save the DataFrame inside. Note that if
HBaseTableCatalog.newTable is not specified, the table has to be
pre-created.
It's about data partitioning. Each HBase table can have 1...X regions. You should carefully pick number of regions. Low regions number is bad. High region numbers is also bad.

Spark: Sort records in groups?

I have a set of records which I need to:
1) Group by 'date', 'city' and 'kind'
2) Sort every group by 'prize
In my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
val recs = Array (
Record("n1", "d1", "k1", "c1", 10),
Record("n1", "d1", "k1", "c1", 9),
Record("n1", "d1", "k1", "c1", 8),
Record("n2", "d2", "k2", "c2", 1),
Record("n2", "d2", "k2", "c2", 2),
Record("n2", "d2", "k2", "c2", 3)
)
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
val rsGrp = rs.groupBy(r => (r.day, r.kind, r.city)).map(_._2)
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
x.sortByKey()
}
}
When I try to sort group I get an error:
value sortByKey is not a member of org.apache.spark.rdd.RDD[List[(Int,
Sort.Record)]]
What is wrong? How to sort?
You need define a Key and then mapValues to sort them.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
// Define your data
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.setMaster("local")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
// Generate pair RDD neccesary to call groupByKey and group it
val key: RDD[((String, String, String), Iterable[Record])] = rs.keyBy(r => (r.day, r.city, r.kind)).groupByKey
// Once grouped you need to sort values of each Key
val values: RDD[((String, String, String), List[Record])] = key.mapValues(iter => iter.toList.sortBy(_.prize))
// Print result
values.collect.foreach(println)
}
}
groupByKey is expensive, it has 2 implications:
Majority of the data get shuffled in the remaining N-1 partitions in average.
All of the records of the same key get loaded in memory in the single executor potentially causing memory errors.
Depending of your use case you have different better options:
If you don't care about the ordering, use reduceByKey or aggregateByKey.
If you want to just group and sort without any transformation, prefer using repartitionAndSortWithinPartitions (Spark 1.3.0+ http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions) but be very careful of what partitioner you specify and test it because you are now relying on side effects that may change behaviour in a different environment. See also examples in this repository: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala.
If you are either applying a transformation or a non reducible aggregation (fold or scan) applied to the iterable of sorted records, then check out this library: spark-sorted https://github.com/tresata/spark-sorted. It provides 3 APIs for paired rdds: mapStreamByKey, foldLeftByKey and scanLeftByKey.
Replace map with flatMap
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
this will give you a
org.apache.spark.rdd.RDD[(Int, Record)] = FlatMappedRDD[10]
and then you can call sortBy(_._1) on the RDD above.
As an alternative to #gasparms solution, I think one can try a filter followed by rdd.sortyBy operation. You filter each record that meets key criteria. Pre requisite is that you need to keep track of all your keys(filter combinations). You can also build it as you traverse through records.