converting textFile to dataFrame dynamically - scala

I am trying to convert input from a text file to dataframe using a schema file which is read at run time.
My input text file looks like this:
John,23
Charles,34
The schema file looks like this:
name:string
age:integer
This is what I tried:
object DynamicSchema {
def main(args: Array[String]) {
val inputFile = args(0)
val schemaFile = args(1)
val schemaLines = Source.fromFile(schemaFile, "UTF-8").getLines().map(_.split(":")).map(l => l(0) -> l(1)).toMap
val spark = SparkSession.builder()
.master("local[*]")
.appName("Dynamic Schema")
.getOrCreate()
import spark.implicits._
val input = spark.sparkContext.textFile(args(0))
val schema = spark.sparkContext.broadcast(schemaLines)
val nameToType = {
Seq(IntegerType,StringType)
.map(t => t.typeName -> t).toMap
}
println(nameToType)
val fields = schema.value
.map(field => StructField(field._1, nameToType(field._2), nullable = true)).toSeq
val schemaStruct = StructType(fields)
val rowRDD = input
.map(_.split(","))
.map(attributes => Row.fromSeq(attributes))
val peopleDF = spark.createDataFrame(rowRDD, schemaStruct)
peopleDF.printSchema()
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")
results.show()
}
}
Though the printSchema gives the desired result, result.show errors out. I think the age field actually needs to be converted using toInt. Is there a way to achieve the same when the schema is only available at runtime?

Replace
val input = spark.sparkContext.textFile(args(0))
with
val input = spark.read.schema(schemaStruct).csv(args(0))
and move it after schema definition.

Related

Output is not showing, spark scala

Output is showing the schema, but output of sql query is not visible. I dont understand where I am doing wrong.
object ex_1 {
def parseLine(line:String): (String, String, Int, Int) = {
val fields = line.split(" ")
val project_code = fields(0)
val project_title = fields(1)
val page_hits = fields(2).toInt
val page_size = fields(3).toInt
(project_code, project_title, page_hits, page_size)
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "Weblogs")
val lines = sc.textFile("F:/Downloads_F/pagecounts.out")
val parsedLines = lines.map(parseLine)
println("hello")
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val RDD1 = parsedLines.toDF("project","page","pagehits","pagesize")
RDD1.printSchema()
RDD1.createOrReplaceTempView("logs")
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
val results = min1.collect()
results.foreach(println)
println("bye")
spark.stop()
}
}
As confirmed in the comments, using the show method displays the result of spark.sql(..).
Since spark.sql returns a DataFrame, calling show is the ideal way to display the data. Where you where calling collect, previously, is not advised:
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
..
..
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
// where `false` prevents the output from being truncated.
min1.show(false)
println("bye")
spark.stop()
Even if your DataFrame is empty you will still see a table output including the column names (i.e: the schema); whereas .collect() and println would print nothing in this scenario.

Scala/Spark - Create Dataset with one column from another Dataset

I am trying to create a Dataset with only one column from Case Class.
Below is the code:
case class vectorData(value: Array[String], vectors: Vector)
def main(args: Array[String]) {
val spark = SparkSession.builder
.appName("Hello world!")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//blah blah and read data etc.
val word2vec = new Word2Vec()
.setInputCol("value").setOutputCol("vectors")
.setVectorSize(5).setMinCount(0).setWindowSize(5)
val dataset = spark.createDataset(data)
val model = word2vec.fit(dataset)
val encoder = org.apache.spark.sql.Encoders.product[vectorData]
val result = model.transform(dataset).as(encoder)
//val output: Dataset[Vector] = ???
}
As shown in last line of the code, I want the output to be only the 2nd column which has Vector type with vectors data.
I tried with:
val output = result.map(o => o.vectors)
But this line highlighted error No implicit arguments of type: Encoder[Vector]
How to resolve this?
I think line:
implicit val vectorEncoder: Encoder[Vector] = org.apache.spark.sql.Encoders.product[Vector]
should make
val output = result.map(o => o.vectors)
correct

Spark Scala | create Dataframe Dyanmically

I would like to create dataframe names dynamically from a collection.
Please see below:
val set1 = Set("category1","category2","category3")
The following is a UDF which takes a string x from the set as input and generate the dataframe accordingly:
def catDfgen(x: String): DataFrame = {
spark.sql(s"select * from table where col1 = '$x'")
}
Now I need help here, to create not only DataFrame but also the DataFrame name should be dynamically generated in order to achieve
val category1DF = catDfgen($x)
val category2DF = catDfgen($x)
...etc. Would it be possible to do it using the code below?
set1.map( x => val $x+"DF" = catDfgen($x))
If not please suggest an effective method.
Suman, I believe the below might help your use-case
import org.apache.spark.sql.{DataFrame, SparkSession}
object Test extends App {
val spark: SparkSession = SparkSession.builder().master("local").getOrCreate()
val set1 = Set("category1","category2","category3")
val dfs: Map[String, DataFrame] = set1.map(x =>
(s"${x}DF", spark.sql(s"select * from table where col1 = '$x'").alias(s"${x}DF").toDF())
).toMap
dfs("category1DF").show()
spark.stop()
}

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?
Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"
I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()
If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)
A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")
I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!

Spark: Save Dataframe in ORC format

In the previous version, we used to have a 'saveAsOrcFile()' method on RDD. This is now gone! How do I save data in DataFrame in ORC File format?
def main(args: Array[String]) {
println("Creating Orc File!")
val sparkConf = new SparkConf().setAppName("orcfile")
val sc = new SparkContext(sparkConf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val people = sc.textFile("/apps/testdata/people.txt")
val schemaString = "name age"
val schema = StructType(schemaString.split(" ").map(fieldName => {if(fieldName == "name") StructField(fieldName, StringType, true) else StructField(fieldName, IntegerType, true)}))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), new Integer(p(1).trim)))
//# Infer table schema from RDD**
val peopleSchemaRDD = hiveContext.createDataFrame(rowRDD, schema)
//# Create a table from schema**
peopleSchemaRDD.registerTempTable("people")
val results = hiveContext.sql("SELECT * FROM people")
results.map(t => "Name: " + t.toString).collect().foreach(println)
// Now I want to save this Dataframe(peopleSchemaRDD) in ORC Format. How do I do that?
}
Since Spark 1.4 you can simply use DataFrameWriter and set format to orc:
peopleSchemaRDD.write.format("orc").save("people")
or
peopleSchemaRDD.write.orc("people")