Convert HadoopRDD to DataFrame - scala

In EMR Spark, I have a HadoopRDD
org.apache.spark.rdd.RDD[(org.apache.hadoop.io.Text, org.apache.hadoop.dynamodb.DynamoDBItemWritable)] = HadoopRDD[0] at hadoopRDD
I want to convert this to DataFrame org.apache.spark.sql.DataFrame.
Does anyone know how to do this?

First convert it to simple types. Let's say your DynamoDBItemWritable has just one string column:
val simple: RDD[(String, String)] = rdd.map {
case (text, dbwritable) => (text.toString, dbwritable.getString(0))
}
Then you can use toDF to get a DataFrame:
import sqlContext.implicits._
val df: DataFrame = simple.toDF()

Related

unable to store row elements of a dataset, via mapPartitions(), in variables

I am trying to create a Spark Dataset, and then using mapPartitions, trying to access each of its elements and store those in variables. Using below piece of code for the same:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sql("select col1,col2,col3 from table limit 10")
val schema = StructType(Seq(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType)))
val encoder = RowEncoder(schema)
df.mapPartitions{iterator => { val myList = iterator.toList
myList.map(x=> { val value1 = x.getString(0)
val value2 = x.getString(1)
val value3 = x.getString(2)}).iterator}} (encoder)
The error I am getting against this code is:
<console>:39: error: type mismatch;
found : org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[Unit]
val value3 = x.getString(2)}).iterator}} (encoder)
Eventually, I am targeting to store the row elements in variables, and perform some operation with these. Not sure what am I missing here. Any help towards this would be highly appreciated!
Actually, there are several problems with your code:
Your map-statement has no return value, therefore Unit
If you return a tuple of String from mapPartitions, you don't need a RowEncoder (because you don't return a Row, but a Tuple3 which does not need a encoder because its a Product)
You can write your code like this:
df
.mapPartitions{itr => itr.map(x=> (x.getString(0),x.getString(1),x.getString(2)))}
.toDF("col1","col2","col3") // Convert Dataset to Dataframe, get desired field names
But you could just use a simple select statement in DataFrame API, no need for mapPartitions here
df
.select($"col1",$"col2",$"col3")

How to convert RDD[Result] to RDD[Row] in spark?

In our application, we are connecting spark with HBase, using the following code:
val hBaseRDD: RDD[(ImmutableBytesWritable, Result)] =
sparkSession.sparkContext.newAPIHadoopRDD(
conf,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result]
)
val resultRDD: RDD[Result] = hBaseRDD.map(tuple => tuple._2)
But this provides us with an RDD of type Result.
We need RDD of type 'Row' to create DataFrame out of this RDD.
How can we do the same?
Thanks

How to convert RDD[Row] to RDD[String]

I have a DataFrame called source, a table from mysql
val source = sqlContext.read.jdbc(jdbcUrl, "source", connectionProperties)
I have converted it to rdd by
val sourceRdd = source.rdd
but its RDD[Row] I need RDD[String]
to do transformations like
source.map(rec => (rec.split(",")(0).toInt, rec)), .subtractByKey(), etc..
Thank you
You can use Row. mkString(sep: String): String method in a map call like this :
val sourceRdd = source.rdd.map(_.mkString(","))
You can change the "," parameter by whatever you want.
Hope this help you, Best Regards.
What is your schema?
If it's just a String, you can use:
import spark.implicits._
val sourceDS = source.as[String]
val sourceRdd = sourceDS.rdd // will give RDD[String]
Note: use sqlContext instead of spark in Spark 1.6 - spark is a SparkSession, which is a new class in Spark 2.0 and is a new entry point to SQL functionality. It should be used instead of SQLContext in Spark 2.x
You can also create own case classes.
Also you can map rows - here source is of type DataFrame, we use partial function in map function:
val sourceRdd = source.rdd.map { case x : Row => x(0).asInstanceOf[String] }.map(s => s.split(","))

Spark, Scala - column type determine

I can load data from database, and I do some process with this data.
The problem is some table has date column as 'String', but some others trait it as 'timestamp'.
I cannot know what type of date column is until loading data.
> x.getAs[String]("date") // could be error when date column is timestamp type
> x.getAs[Timestamp]("date") // could be error when date column is string type
This is how I load data from spark.
spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", table)
.option("user", user)
.option("password", password)
.load()
Is there any way to trait them together? or convert it as string always?
You can pattern-match on the type of the column (using the DataFrame's schema) to decide whether to parse the String into a Timestamp or just use the Timestamp as is - and use the unix_timestamp function to do the actual conversion:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
// preparing some example data - df1 with String type and df2 with Timestamp type
val df1 = Seq(("a", "2016-02-01"), ("b", "2016-02-02")).toDF("key", "date")
val df2 = Seq(
("a", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-01").getTime)),
("b", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-02").getTime))
).toDF("key", "date")
// If column is String, converts it to Timestamp
def normalizeDate(df: DataFrame): DataFrame = {
df.schema("date").dataType match {
case StringType => df.withColumn("date", unix_timestamp($"date", "yyyy-MM-dd").cast("timestamp"))
case _ => df
}
}
// after "normalizing", you can assume date has Timestamp type -
// both would print the same thing:
normalizeDate(df1).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
normalizeDate(df2).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
Here are a few things you can try:
(1) Start utilizing the inferSchema function during load if you have a version that supports it. This will have spark figure the data type of columns, this doesn't work in all scenarios. Also look at the input data, if you have quotes I advise adding an extra argument to account for them during the load.
val inputDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load(fileLocation)
(2) To identify the data type of a column you can use the below code, it will place all of the column name and data types into their own Arrays of Strings.
val columnNames : Array[String] = inputDF.columns
val columnDataTypes : Array[String] = inputDF.schema.fields.map(x=>x.dataType).map(x=>x.toString)
It has a easy way to address this which is get(i: Int): Any. And it will be map between Spark SQL types and return types automatically. e.g.
val fieldIndex = row.fieldIndex("date")
val date = row.get(fieldIndex)
def parseLocationColumn(df: DataFrame): DataFrame = {
df.schema("location").dataType match {
case StringType => df.withColumn("locationTemp", $"location")
.withColumn("countryTemp", lit("Unknown"))
.withColumn("regionTemp", lit("Unknown"))
.withColumn("zoneTemp", lit("Unknown"))
case _ => df.withColumn("locationTemp", $"location.location")
.withColumn("countryTemp", $"location.country")
.withColumn("regionTemp", $"location.region")
.withColumn("zoneTemp", $"location.zone")
}
}

Get Cluster_ID and the rest of table using Spark MLlib KMeans

I have this dataset (I'm putting some a few rows):
11.97,1355,401
3.49,25579,12908
9.29,129186,10882
28.73,10153,22356
3.69,22872,9798
13.49,160371,2911
24.36,106764,867
3.99,163670,16397
19.64,132547,401
And I'm trying to assign all this rows to 4 clusters using K-Means. For that I'm using the code that I see in this post: Spark MLLib Kmeans from dataframe, and back again
val data = sc.textFile("/user/cloudera/TESTE1")
val idPointRDD = data.map(s => (s(0), Vectors.dense(s(1).toInt,s(2).toInt))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 4, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)
val idCluster = idClusterRDD.toDF("purchase","id","product","cluster")
I'm getting this outputs:
scala> import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> val data = sc.textFile("/user/cloudera/TESTE")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/TESTE MapPartitionsRDD[7] at textFile at <console>:29
scala> val idPointRDD = data.map(s => (s(0), Vectors.dense(s(1).toInt,s(2).toInt))).cache()
idPointRDD: org.apache.spark.rdd.RDD[(Char, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[8] at map at <console>:31
But when I run it I'm getting the following error:
java.lang.UnsupportedOperationException: Schema for type Char is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:715)
How can I solve this problem?
Many thanks!
Here is the thing. You are actually reading a CSV of values into an RDD of String and not converting it properly to numeric values. Instead since a string is a collection of character when you call upon s(0) per example this actually works converts the Char value to an integer or a double but it's not what you are actually looking for.
You need to split your val data : RDD[String]
val data : RDD[String] = ???
val idPointRDD = data.map {
s =>
s.split(",") match {
case Array(x,y,z) => Vectors.dense(x.toDouble, Integer.parseInt(y).toDouble,Integer.parseInt(z).toDouble)
}
}.cache()
This should work for you !