Using Scala, create DataFrame or RDD from Java ResultSet - scala

I don't want to create Dataframe or RDD directly using spark.read method. I want to form a dataframe or RDD from a java resultset (has 5,000,00 records). Appreciate if you provide a diligent solution.

First using RowFactory, we can create rows. Secondly, all the rows can be converted into Dataframe using SQLContext.createDataFrame method. Hope, this will help you too :).
import java.sql.Connection
import java.sql.ResultSet
import org.apache.spark.sql.RowFactory
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType
var resultSet: ResultSet = null
val rowList = new scala.collection.mutable.MutableList[Row]
var cRow: Row = null
//Resultset is created from traditional Java JDBC.
val resultSet = DbConnection.createStatement().execute("Sql")
//Looping resultset
while (resultSet.next()) {
//adding two columns into a "Row" object
cRow = RowFactory.create(resultSet.getObject(1), resultSet.getObject(2))
//adding each rows into "List" object.
rowList += (cRow)
}
val sconf = new SparkConf
sconf.setAppName("")
sconf.setMaster("local[*]")
var sContext: SparkContext = new SparkContext(sConf)
var sqlContext: SQLContext = new SQLContext(sContext)
//creates a dataframe
DF = sqlContext.createDataFrame(sContext.parallelize(rowList ,2), getSchema())
DF.show() //show the dataframe.
def getSchema(): StructType = {
val DecimalType = DataTypes.createDecimalType(38, 10)
val schema = StructType(
StructField("COUNT", LongType, false) ::
StructField("TABLE_NAME", StringType, false) :: Nil)
//Returning the schema to define dataframe columns.
schema
}

Related

UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row

I am trying to create a dataFrame. It seems that spark is unable to create a dataframe from a scala.Tuple2 type. How can I do it? I am new to scala and spark.
Below is a part of the error trace from the code run
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:666)
..........
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:299)
at SparkMapReduce$.runMapReduce(SparkMapReduce.scala:46)
at Entrance$.queryLoader(Entrance.scala:64)
at Entrance$.paramsParser(Entrance.scala:43)
at Entrance$.main(Entrance.scala:30)
at Entrance.main(Entrance.scala)
Below is the code that is a part of the entire program. The problem occurs in the line above the exclamation marks in a comment
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
object SparkMapReduce {
Logger.getLogger("org.spark_project").setLevel(Level.WARN)
Logger.getLogger("org.apache").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("com").setLevel(Level.WARN)
def runMapReduce(spark: SparkSession, pointPath: String, rectanglePath: String): DataFrame =
{
var pointDf = spark.read.format("csv").option("delimiter",",").option("header","false").load(pointPath);
pointDf = pointDf.toDF()
pointDf.createOrReplaceTempView("points")
pointDf = spark.sql("select ST_Point(cast(points._c0 as Decimal(24,20)),cast(points._c1 as Decimal(24,20))) as point from points")
pointDf.createOrReplaceTempView("pointsDf")
// pointDf.show()
var rectangleDf = spark.read.format("csv").option("delimiter",",").option("header","false").load(rectanglePath);
rectangleDf = rectangleDf.toDF()
rectangleDf.createOrReplaceTempView("rectangles")
rectangleDf = spark.sql("select ST_PolygonFromEnvelope(cast(rectangles._c0 as Decimal(24,20)),cast(rectangles._c1 as Decimal(24,20)), cast(rectangles._c2 as Decimal(24,20)), cast(rectangles._c3 as Decimal(24,20))) as rectangle from rectangles")
rectangleDf.createOrReplaceTempView("rectanglesDf")
// rectangleDf.show()
val joinDf = spark.sql("select rectanglesDf.rectangle as rectangle, pointsDf.point as point from rectanglesDf, pointsDf where ST_Contains(rectanglesDf.rectangle, pointsDf.point)")
joinDf.createOrReplaceTempView("joinDf")
// joinDf.show()
import spark.implicits._
val joinRdd = joinDf.rdd
val resmap = joinRdd.map(x=>(x, 1))
val reduced = resmap.reduceByKey(_+_)
val final_datablock = reduced.collect()
val trying : List[Float] = List()
print(final_datablock)
// .toDF("rectangles", "count")
// val dataframe_final1 = spark.createDataFrame(reduced)
val dataframe_final2 = spark.createDataFrame(reduced).toDF("rectangles", "count")
// ^ !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Line above creates problem
// You need to complete this part
var result = spark.emptyDataFrame
return result // You need to change this part
}
}
Your first column of reduced has a type of ROW and you do not specified it when converting from RDD to DF. A dataframe must have a schema. So you need to use the following method by defining a right schema for your RDD to covert to DataFrame.
createDataFrame(RDD<Row> rowRDD, StructType schema)
for example:
val schema = new StructType()
.add(Array(
StructField("._1a",IntegerType),
StructField("._1b", ArrayType(StringType))
))
.add(StructField("count", IntegerType, true))

How to declare an empty dataset in Spark?

I am new in Spark and Spark dataset. I was trying to declare an empty dataset using emptyDataset but it was asking for org.apache.spark.sql.Encoder. The data type I am using for the dataset is an object of case class Tp(s1: String, s2: String, s3: String).
All you need is to import implicit encoders from SparkSession instance before you create empty Dataset: import spark.implicits._
See full example here
EmptyDataFrame
package com.examples.sparksql
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object EmptyDataFrame {
def main(args: Array[String]){
//Create Spark Conf
val sparkConf = new SparkConf().setAppName("Empty-Data-Frame").setMaster("local")
//Create Spark Context - sc
val sc = new SparkContext(sparkConf)
//Create Sql Context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Import Sql Implicit conversions
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
//Create Schema RDD
val schema_string = "name,id,dept"
val schema_rdd = StructType(schema_string.split(",").map(fieldName => StructField(fieldName, StringType, true)) )
//Create Empty DataFrame
val empty_df = sqlContext.createDataFrame(sc.emptyRDD[Row], schema_rdd)
//Some Operations on Empty Data Frame
empty_df.show()
println(empty_df.count())
//You can register a Table on Empty DataFrame, it's empty table though
empty_df.registerTempTable("empty_table")
//let's check it ;)
val res = sqlContext.sql("select * from empty_table")
res.show
}
}
Alternatively you can convert an empty list into a Dataset:
import sparkSession.implicits._
case class Employee(name: String, id: Int)
val ds: Dataset[Employee] = List.empty[Employee].toDS()

Convert csv file to dataframe in Spark 1.5.2 without databricks

I am trying to convert a csv file to a dataframe in Spark 1.5.2 with Scala without the use of the library databricks, as it is a community project and this library is not available. My approach was the following:
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim))
var header = rows.first()
var data = rows.filter(_(0) != header(0))
var df = sc.makeRDD(1 to data.count().toInt).map(i => (data.take(i).drop(i-1)(0)(0), data.take(i).drop(i-1)(0)(1), data.take(i).drop(i-1)(0)(2), data.take(i).drop(i-1)(0)(3), data.take(i).drop(i-1)(0)(4))).toDF(header(0), header(1), header(2), header(3), header(4))
This code, even though it is quite a mess, works without returning any error messages. The problem comes when trying to display the data inside dfin order to verify the correctness of this method and later try to do some queries in df. The error code I am getting after executing df.show() is SPARK-5063. My questions are:
1) Why is it not possible to print the content of df?
2) Is there any other more straightforward method to convert a csv to a dataframe in Spark 1.5.2 without using the library databricks?
For spark 1.5.x can be used code snippet below to convert input into DF
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the DataClass interface with 5 fields.
case class DataClass(id: Int, name: String, surname: String, bdate: String, address: String)
// Create an RDD of DataClass objects and register it as a table.
val peopleData = sc.textFile("input.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim, p(2).trim, p(3).trim, p(4).trim)).toDF()
peopleData.registerTempTable("dataTable")
val peopleDataFrame = sqlContext.sql("SELECT * from dataTable")
peopleDataFrame.show()
Spark 1.5
You can create like this:
SparkSession spark = SparkSession
.builder()
.appName("RDDtoDF_Updated")
.master("local[2]")
.config("spark.some.config.option", "some-value")
.getOrCreate();
StructType schema = DataTypes
.createStructType(new StructField[] {
DataTypes.createStructField("eid", DataTypes.IntegerType, false),
DataTypes.createStructField("eName", DataTypes.StringType, false),
DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
DataTypes.createStructField("eGen", DataTypes.StringType,true)});
String filepath = "F:/Hadoop/Data/EMPData.txt";
JavaRDD<Row> empRDD = spark.read()
.textFile(filepath)
.javaRDD()
.map(line -> line.split("\\,"))
.map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));
Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
empDF.groupBy("edept").max("esal").show();
Using Spark with Scala.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
var hiveCtx = new HiveContext(sc)
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))
var header = rows.first()
val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))
val df = hiveCtx.createDataframe(rows,schema)
This should work.
But for creating dataframe, would recommend you to use Spark-CSV.

Not able to create parquet files in hdfs using spark shell

I want to create parquet file in hdfs and then read it through hive as external table. I'm struck with stage failures in spark-shell while writing parquet files.
Spark Version: 1.5.2
Scala Version: 2.10.4
Java: 1.7
Input file:(employee.txt)
1201,satish,25
1202,krishna,28
1203,amith,39
1204,javed,23
1205,prudvi,23
In Spark-Shell:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val schema = StructType(schemaString.split(" ").map(fieldName ⇒ StructField(fieldName, StringType, true)))
val rowRDD = employee.map(_.split(",")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
When I type the last command I get,
ERROR
SPARK APPLICATION MANAGER
I even tried increasing the executor memory, its still failing.
Also Importantly , finalDF.show() is producing the same error.
So, I believe I have made a logical error here.
Thanks for supporting
The issue here is you are creating a schema with all the fields/columns type defaulted to StringType. But while passing the values in the schema, the value of Id and Age is being converted to Integer as per the code.Hence, throwing the Matcherror while running.
The data types of columns in the schema should match the data type of values being passed to it. Try the below code.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
//val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types._;
val schema = StructType(StructField("id", IntegerType, true) :: StructField("name", StringType, true) :: StructField("age", IntegerType, true) :: Nil)
val rowRDD = employee.map(_.split(" ")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
This code should run fine.

Creating a empty table using SchemaRDD in Scala [duplicate]

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.
Lets assume you want a data frame with the following schema:
root
|-- k: string (nullable = true)
|-- v: integer (nullable = false)
You simply define schema for a data frame and use empty RDD[Row]:
import org.apache.spark.sql.types.{
StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row
val schema = StructType(
StructField("k", StringType, true) ::
StructField("v", IntegerType, false) :: Nil)
// Spark < 2.0
// sqlContext.createDataFrame(sc.emptyRDD[Row], schema)
spark.createDataFrame(sc.emptyRDD[Row], schema)
PySpark equivalent is almost identical:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("k", StringType(), True), StructField("v", IntegerType(), False)
])
# or df = sc.parallelize([]).toDF(schema)
# Spark < 2.0
# sqlContext.createDataFrame([], schema)
df = spark.createDataFrame([], schema)
Using implicit encoders (Scala only) with Product types like Tuple:
import spark.implicits._
Seq.empty[(String, Int)].toDF("k", "v")
or case class:
case class KV(k: String, v: Int)
Seq.empty[KV].toDF
or
spark.emptyDataset[KV].toDF
As of Spark 2.0.0, you can do the following.
Case Class
Let's define a Person case class:
scala> case class Person(id: Int, name: String)
defined class Person
Import spark SparkSession implicit Encoders:
scala> import spark.implicits._
import spark.implicits._
And use SparkSession to create an empty Dataset[Person]:
scala> spark.emptyDataset[Person]
res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]
Schema DSL
You could also use a Schema "DSL" (see Support functions for DataFrames in org.apache.spark.sql.ColumnName).
scala> val id = $"id".int
id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)
scala> val name = $"name".string
name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)
scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType
scala> val mySchema = StructType(id :: name :: Nil)
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema)
emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> emptyDF.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
Java version to create empty DataSet:
public Dataset<Row> emptyDataSet(){
SparkSession spark = SparkSession.builder().appName("Simple Application")
.config("spark.master", "local").getOrCreate();
Dataset<Row> emptyDataSet = spark.createDataFrame(new ArrayList<>(), getSchema());
return emptyDataSet;
}
public StructType getSchema() {
String schemaString = "column1 column2 column3 column4 column5";
List<StructField> fields = new ArrayList<>();
StructField indexField = DataTypes.createStructField("column0", DataTypes.LongType, true);
fields.add(indexField);
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
return schema;
}
import scala.reflect.runtime.{universe => ru}
def createEmptyDataFrame[T: ru.TypeTag] =
hiveContext.createDataFrame(sc.emptyRDD[Row],
ScalaReflection.schemaFor(ru.typeTag[T].tpe).dataType.asInstanceOf[StructType]
)
case class RawData(id: String, firstname: String, lastname: String, age: Int)
val sourceDF = createEmptyDataFrame[RawData]
Here you can create schema using StructType in scala and pass the Empty RDD so you will able to create empty table.
Following code is for the same.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.BooleanType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType
//import org.apache.hadoop.hive.serde2.objectinspector.StructField
object EmptyTable extends App {
val conf = new SparkConf;
val sc = new SparkContext(conf)
//create sparksession object
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
//Created schema for three columns
val schema = StructType(
StructField("Emp_ID", LongType, true) ::
StructField("Emp_Name", StringType, false) ::
StructField("Emp_Salary", LongType, false) :: Nil)
//Created Empty RDD
var dataRDD = sc.emptyRDD[Row]
//pass rdd and schema to create dataframe
val newDFSchema = sparkSession.createDataFrame(dataRDD, schema)
newDFSchema.createOrReplaceTempView("tempSchema")
sparkSession.sql("create table Finaltable AS select * from tempSchema")
}
This is helpful for testing purposes.
Seq.empty[String].toDF()
Here is a solution that creates an empty dataframe in pyspark 2.0.0 or more.
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(),False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
I had a special requirement wherein I already had a dataframe but given a certain condition I had to return an empty dataframe so I returned df.limit(0) instead.
I'd like to add the following syntax which was not yet mentioned:
Seq[(String, Integer)]().toDF("k", "v")
It makes it clear that the () part is for values. It's empty, so the dataframe is empty.
This syntax is also beneficial for adding null values manually. It just works, while other options either don't or are overly verbose.
As of Spark 2.4.3
val df = SparkSession.builder().getOrCreate().emptyDataFrame