How to convert RDD[Result] to RDD[Row] in spark? - scala

In our application, we are connecting spark with HBase, using the following code:
val hBaseRDD: RDD[(ImmutableBytesWritable, Result)] =
sparkSession.sparkContext.newAPIHadoopRDD(
conf,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result]
)
val resultRDD: RDD[Result] = hBaseRDD.map(tuple => tuple._2)
But this provides us with an RDD of type Result.
We need RDD of type 'Row' to create DataFrame out of this RDD.
How can we do the same?
Thanks

Related

Why can't I use combineByKey in Spark?

I write this code in Spark 2.4.5 :
df_join is a dataframe.
var comByKeyResult: Dataset[((String, String), (Double, Int))] = df_join
.map(x => ((x(1).toString, x(3).toString), (x(9).toString.toDouble, x(1).toString.toInt)))
When I try to write comByKeyResult.combineByKey, the method combineByKey is not available. Why ?
I import the following libraries : import org.apache.spark.rdd._. Should I have to add other librariries or packages?
combineByKey is a transformation operation on PairRDD
You need to convert your dataframe/dataset to rdd then map it to a pairRdd
In your case , just a small change:
val yourPairRdd = df_join
.rdd
.map(x => ((x(1).toString, x(3).toString), (x(9).toString.toDouble, x(1).toString.toInt)))
//yourPairRdd.combineByKey

how to get Dataframe from list spark scala

I have a list of RDD. I have iterated the rdd and for each elemet of rdd I am doing some parsing logic. Finally I getting
val mRdd = nRdd.map {
ele => //parsing logic, I have the below field
colum = Array[String] // example ['id','name','dept']<br>
c_type = Array[String] // example ['Int','String','String']<br>
value = ArrayBuffer[String] // [1,lucy,it][2,denis,cs]<br>
}
How I can get the list of dataframe in mRdd
I tried a logic to create dataframe, in this case I have to rdd first. But I can't create rdd inside rdd.
I am new in spark. I am using spark 1.6.3
Please help me
In order to convert an RDD into a Dataframe, you would need to do the following:
Approach 1 - Use createDaframe function:
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val schema = StructType(Seq(
StructField("id", IntegerType),
StructField("name", StringType),
StructField("dept", StringType)
))
createDataframe(parsedRDD, schema)
}
Read more about this approach here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
Approach 2 - Use toDF implicit function:
import sqlContext.implicits._
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val columns = Seq("id", "name", "dept")
parsedRDD.toDF(columns: _*)
}

How to define schema of streaming dataset dynamically to write to csv?

I have a streaming dataset, reading from kafka and trying to write to CSV
case class Event(map: Map[String,String])
def decodeEvent(arrByte: Array[Byte]): Event = ...//some implementation
val eventDataset: Dataset[Event] = spark
.readStream
.format("kafka")
.load()
.select("value")
.as[Array[Byte]]
.map(decodeEvent)
Event holds Map[String,String] inside and to write to CSV I'll need some schema.
Let's say all the fields are of type String and so I tried the example from spark repo
val columns = List("year","month","date","topic","field1","field2")
val schema = new StructType() //Prepare schema programmatically
columns.foreach { field => schema.add(field, "string") }
val rowRdd = eventDataset.rdd.map { event => Row.fromSeq(
columns.map(c => event.getOrElse(c, "")
)}
val df = spark.sqlContext.createDataFrame(rowRdd, schema)
This gives error at runtime on line "eventDataset.rdd":
Caused by: org.apache.spark.sql.AnalysisException: Queries with
streaming sources must be executed with writeStream.start();;
Below doesn't work because '.map' has a List[String] not Tuple
eventDataset.map(event => columns.map(c => event.getOrElse(c,""))
.toDF(columns:_*)
Is there a way to achieve this with programmatic schema and structured streaming datasets?
I'd use much simpler approach:
import org.apache.spark.sql.functions._
eventDataset.select(columns.map(
c => coalesce($"map".getItem(c), lit("")).alias(c)
): _*).writeStream.format("csv").start(path)
but if you want something closer to the current solution skip RDD conversion
import org.apache.spark.sql.catalyst.encoders.RowEncoder
eventDataset.rdd.map(event =>
Row.fromSeq(columns.map(c => event.getOrElse(c,"")))
)(RowEncoder(schema)).writeStream.format("csv").start(path)

How to convert RDD[Row] to RDD[String]

I have a DataFrame called source, a table from mysql
val source = sqlContext.read.jdbc(jdbcUrl, "source", connectionProperties)
I have converted it to rdd by
val sourceRdd = source.rdd
but its RDD[Row] I need RDD[String]
to do transformations like
source.map(rec => (rec.split(",")(0).toInt, rec)), .subtractByKey(), etc..
Thank you
You can use Row. mkString(sep: String): String method in a map call like this :
val sourceRdd = source.rdd.map(_.mkString(","))
You can change the "," parameter by whatever you want.
Hope this help you, Best Regards.
What is your schema?
If it's just a String, you can use:
import spark.implicits._
val sourceDS = source.as[String]
val sourceRdd = sourceDS.rdd // will give RDD[String]
Note: use sqlContext instead of spark in Spark 1.6 - spark is a SparkSession, which is a new class in Spark 2.0 and is a new entry point to SQL functionality. It should be used instead of SQLContext in Spark 2.x
You can also create own case classes.
Also you can map rows - here source is of type DataFrame, we use partial function in map function:
val sourceRdd = source.rdd.map { case x : Row => x(0).asInstanceOf[String] }.map(s => s.split(","))

Convert HadoopRDD to DataFrame

In EMR Spark, I have a HadoopRDD
org.apache.spark.rdd.RDD[(org.apache.hadoop.io.Text, org.apache.hadoop.dynamodb.DynamoDBItemWritable)] = HadoopRDD[0] at hadoopRDD
I want to convert this to DataFrame org.apache.spark.sql.DataFrame.
Does anyone know how to do this?
First convert it to simple types. Let's say your DynamoDBItemWritable has just one string column:
val simple: RDD[(String, String)] = rdd.map {
case (text, dbwritable) => (text.toString, dbwritable.getString(0))
}
Then you can use toDF to get a DataFrame:
import sqlContext.implicits._
val df: DataFrame = simple.toDF()