Converting Spark Dataframe to RDD in scala - scala

I am cheking for better approch to convert Dataframe to RDD. Right now I am converting dataframe to collection and looping collection to prepare RDD. But we know looping is not good practice.
val randomProduct = scala.collection.mutable.MutableList[Product]()
val results = hiveContext.sql("select id,value from details");
val collection = results.collect();
var i = 0;
results.collect.foreach(t => {
val product = new Product(collection(i)(0).asInstanceOf[Long], collection(i)(1).asInstanceOf[String]);
i = i+ 1;
randomProduct += product
})
randomProduct
//returns RDD[Product]
Please suggest me to make it standard & stable format which works for huge amount of data.

val results = hiveContext.sql("select id,value from details");
results.rdd.map( row => new Product( row.getLong(0), row.getString(1) ) ) // RDD[Product]

Related

Unable to flatten array of DataFrames

I have an array of DataFrames that I obtain by using randomSplit() in this manner:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
I'll be iterating over folds using a for loop, where I will be dropping the ith entry inside folds and store it separately. Then I will be using all the others as another DataFrame as in my code below:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
for (j <- ts.indices) {
trainSet = trainSet.union(ts(j))
}
}
While this does serve my purpose, I was also trying another approach where I would still separate folds into ts and testSet, and then use the flatten function for the remaining inside ts to create another DataFrame using something like this:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = ts.flatten
}
But at the initialization of the trainSet line, I get an error that: No Implicits Found for parameter asTrav: Dataset[Row] => Traversable[U_]. I have also done import spark.implicits._ after initializing the SparkSession.
My end goal with the creation of trainSet after flatten is to retrieve a DataFrame created after joining (union) the other Dataset[Row]s inside ts. I'm not sure where I'm going wrong.
I'm using Spark 2.4.5 with Scala 2.11.12
EDIT 1: Added how I read the Dataframe
I'm not sure what's your intention here but instead of using mutable variables and flattening you can do recursive iteration like this:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
val testSet = spark.createDataFrame(Seq.empty)
val trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
go(folds, Array.empty)
def go(items: Array[Dataset[Row]], result: Array[Dataset[Row]]): Array[Dataset[Row]] = items match {
case arr # Array(_, _*) =>
val res = arr.map { t =>
trainSet.union(t)
}
go(arr.tail, result ++ res)
case Array() => result
}
As I have seen the use case of testSet, there is no usage of it in the method body
I have replaced that for loop with a simple reduce:
val trainSet = ts.reduce((a,b) => a.union(b))

Create Spark DataFrame from list row keys

I have a list of HBase row keys in form or Array[Row] and want to create a Spark DataFrame out of the rows that are fetched from HBase using these RowKeys.
Am thinking of something like:
def getDataFrameFromList(spark: SparkSession, rList : Array[Row]): DataFrame = {
val conf = HBaseConfiguration.create()
val mlRows : List[RDD[String]] = new ArrayList[RDD[String]]
conf.set("hbase.zookeeper.quorum", "dev.server")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("zookeeper.znode.parent","/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, "hbase_tbl1")
rList.foreach( r => {
var rStr = r.toString()
conf.set(TableInputFormat.SCAN_ROW_START, rStr)
conf.set(TableInputFormat.SCAN_ROW_STOP, rStr + "_")
// read one row
val recsRdd = readHBaseRdd(spark, conf)
mlRows.append(recsRdd)
})
// This works, but it is only one row
//val resourcesDf = spark.read.json(recsRdd)
var resourcesDf = <Code here to convert List[RDD[String]] to DataFrame>
//resourcesDf
spark.emptyDataFrame
}
I can do recsRdd.collect() in the for loop and convert it to string and append that json to an ArrayList[String but am not sure if its efficient, to call collect() in a for loop like this.
readHBaseRdd is using newAPIHadoopRDD to get data from HBase
def readHBaseRdd(spark: SparkSession, conf: Configuration) = {
val hBaseRDD = spark.sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result])
hBaseRDD.map {
case (_: ImmutableBytesWritable, value: Result) =>
Bytes.toString(value.getValue(Bytes.toBytes("cf"),
Bytes.toBytes("jsonCol")))
}
}
}
Use spark.union([mainRdd, recsRdd]) instead of a list or RDDs (mlRows)
And why read only one row from HBase? Try to have the largest interval as possible.
Always avoid calling collect(), do it only for debug/tests.

Unable to insert more than 3 columns in a column family [duplicate]

I am trying to create hfiles to do bulk load into Hbase and it keeps throwing the error with the row key even though everything looks fine.
I am using the following code:
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv")
import sqlContext.implicits._
val DF2 = df.filter($"company".isNotNull)
.dropDuplicates(Array("company"))
.sortWithinPartitions("company").sort("company")
val rdd = DF2.flatMap(x => {
val rowKey = Bytes.toBytes(x(0).toString)
for (i <- 0 to cols.length - 1) yield {
val index = x.fieldIndex(new String(cols(i)))
val value = if (x.isNullAt(index)) "".getBytes else x(index).toString.getBytes
(new ImmutableBytesWritable(rowKey), new KeyValue(rowKey, COLUMN_FAMILY, cols(i), value))
}
})
rdd.saveAsNewAPIHadoopFile("HDFS LOcation", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], fconf)
and I am using the following data
company,date,open,high,low,close,volume
ABG,01-Jan-2010,11.53,11.53,11.53,11.53,0
ABM,01-Jan-2010,20.66,20.66,20.66,20.66,0
ABR,01-Jan-2010,1.99,1.99,1.99,1.99,0
ABT,01-Jan-2010,53.99,53.99,53.99,53.99,0
ABX,01-Jan-2010,39.38,39.38,39.38,39.38,0
ACC,01-Jan-2010,28.1,28.1,28.1,28.1,0
ACE,01-Jan-2010,50.4,50.4,50.4,50.4,0
ACG,01-Jan-2010,8.25,8.25,8.25,8.25,0
ADC,01-Jan-2010,27.25,27.25,27.25,27.25,0
It throws the error as
java.io.IOException: Added a key not lexically larger than previous. Current cell = ADC/data:high/1505862570671/Put/vlen=5/seqid=0, lastCell = ADC/data:open/1505862570671/Put/vlen=5/seqid=0
at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:265)
at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:992)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:199)
I even tried sorting the data but still the error is thrown.
After spending couple of hours I found the solution, rootcause is that the columns are not sorted.
Since Hfile needs keyvalue in lexicographically sorted order and in your case while writing HFileOutputFormat2->AbstractHFileWriter found Added a key not lexically larger than previous. Current cell. You have already applied sorting at row level once you sort the columns also it would work.
Question here with good explanation why-hbase-keyvaluesortreducer-need-to-sort-all-keyvalue.
Solution:
//sort columns
val cols = companyDs.columns.sorted
//Rest of the code is same
val output = companyDs.rdd.flatMap(x => {
val rowKey = Bytes.toBytes(x(0).toString)
val hkey = new ImmutableBytesWritable(rowKey)
for (i <- 0 to cols.length - 1) yield {
val index = x.fieldIndex(new String(cols(i)))
val value = if (x.isNullAt(index)) "".getBytes else x(index).toString.getBytes
val kv = new KeyValue(rowKey,COLUMN_FAMILY, cols(i).getBytes(),System.currentTimeMillis()+i ,x(i).toString.getBytes())
(hkey,kv)
}
})
output.saveAsNewAPIHadoopFile("<path>"
, classOf[ImmutableBytesWritable], classOf[KeyValue],
classOf[HFileOutputFormat2], config)

Spark issues in creating hfiles- Added a key not lexically larger than previous cell

I am trying to create hfiles to do bulk load into Hbase and it keeps throwing the error with the row key even though everything looks fine.
I am using the following code:
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv")
import sqlContext.implicits._
val DF2 = df.filter($"company".isNotNull)
.dropDuplicates(Array("company"))
.sortWithinPartitions("company").sort("company")
val rdd = DF2.flatMap(x => {
val rowKey = Bytes.toBytes(x(0).toString)
for (i <- 0 to cols.length - 1) yield {
val index = x.fieldIndex(new String(cols(i)))
val value = if (x.isNullAt(index)) "".getBytes else x(index).toString.getBytes
(new ImmutableBytesWritable(rowKey), new KeyValue(rowKey, COLUMN_FAMILY, cols(i), value))
}
})
rdd.saveAsNewAPIHadoopFile("HDFS LOcation", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], fconf)
and I am using the following data
company,date,open,high,low,close,volume
ABG,01-Jan-2010,11.53,11.53,11.53,11.53,0
ABM,01-Jan-2010,20.66,20.66,20.66,20.66,0
ABR,01-Jan-2010,1.99,1.99,1.99,1.99,0
ABT,01-Jan-2010,53.99,53.99,53.99,53.99,0
ABX,01-Jan-2010,39.38,39.38,39.38,39.38,0
ACC,01-Jan-2010,28.1,28.1,28.1,28.1,0
ACE,01-Jan-2010,50.4,50.4,50.4,50.4,0
ACG,01-Jan-2010,8.25,8.25,8.25,8.25,0
ADC,01-Jan-2010,27.25,27.25,27.25,27.25,0
It throws the error as
java.io.IOException: Added a key not lexically larger than previous. Current cell = ADC/data:high/1505862570671/Put/vlen=5/seqid=0, lastCell = ADC/data:open/1505862570671/Put/vlen=5/seqid=0
at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:265)
at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:992)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:199)
I even tried sorting the data but still the error is thrown.
After spending couple of hours I found the solution, rootcause is that the columns are not sorted.
Since Hfile needs keyvalue in lexicographically sorted order and in your case while writing HFileOutputFormat2->AbstractHFileWriter found Added a key not lexically larger than previous. Current cell. You have already applied sorting at row level once you sort the columns also it would work.
Question here with good explanation why-hbase-keyvaluesortreducer-need-to-sort-all-keyvalue.
Solution:
//sort columns
val cols = companyDs.columns.sorted
//Rest of the code is same
val output = companyDs.rdd.flatMap(x => {
val rowKey = Bytes.toBytes(x(0).toString)
val hkey = new ImmutableBytesWritable(rowKey)
for (i <- 0 to cols.length - 1) yield {
val index = x.fieldIndex(new String(cols(i)))
val value = if (x.isNullAt(index)) "".getBytes else x(index).toString.getBytes
val kv = new KeyValue(rowKey,COLUMN_FAMILY, cols(i).getBytes(),System.currentTimeMillis()+i ,x(i).toString.getBytes())
(hkey,kv)
}
})
output.saveAsNewAPIHadoopFile("<path>"
, classOf[ImmutableBytesWritable], classOf[KeyValue],
classOf[HFileOutputFormat2], config)

Spark scala remove columns containing only null values

Is there a way to remove the columns of a spark dataFrame that contain only null values ?
(I am using scala and Spark 1.6.2)
At the moment I am doing this:
var validCols: List[String] = List()
for (col <- df_filtered.columns){
val count = df_filtered
.select(col)
.distinct
.count
println(col, count)
if (count >= 2){
validCols ++= List(col)
}
}
to build the list of column containing at least two distinct values, and then use it in a select().
Thank you !
I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.
for (String column:df.columns()){
long count = df.select(column).distinct().count();
if(count == 1 && df.select(column).first().isNullAt(0)){
df = df.drop(column);
}
}
I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.
Here's a scala example to remove null columns that only queries that data once (faster):
def removeNullColumns(df:DataFrame): DataFrame = {
var dfNoNulls = df
val exprs = df.columns.map((_ -> "count")).toMap
val cnts = df.agg(exprs).first
for(c <- df.columns) {
val uses = cnts.getAs[Long]("count("+c+")")
if ( uses == 0 ) {
dfNoNulls = dfNoNulls.drop(c)
}
}
return dfNoNulls
}
A more idiomatic version of #swdev answer:
private def removeNullColumns(df:DataFrame): DataFrame = {
val exprs = df.columns.map((_ -> "count")).toMap
val cnts = df.agg(exprs).first
df.columns
.filter(c => cnts.getAs[Long]("count("+c+")") == 0)
.foldLeft(df)((df, col) => df.drop(col))
}
If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe.
scala snippet:
originalDataFrame.write(tempJsonPath)
val lightDataFrame = spark.read.json(tempJsonPath)
here's #timo-strotmann solution in pySpark syntax:
for column in df.columns:
count = df.select(column).distinct().count()
if count == 1 and df.first()[column] is None:
df = df.drop(column)