Spark - scala - How to convert DataFrame to custom object? - scala

Here is block code. In the code snippet I am reading multi line json and converting into Emp object.
def main(args: Array[String]): Unit = {
val filePath = Configuration.folderPath + "emp_unformatted.json"
val sparkConfig = new SparkConf().setMaster("local[2]").setAppName("findEmp")
val sparkContext = new SparkContext(sparkConfig)
val sqlContext = new SQLContext(sparkContext)
val formattedJsonData = sqlContext.read.option("multiline", "true").json(filePath)
val res = formattedJsonData.rdd.map(empParser)
for (e <- res.take(2)) println(e.name + " " + e.company + " " + e.about)
}
case class Emp(name: String, company: String, email: String, address: String, about: String)
def empParser(row: Row): Emp =
{
new Emp(row.getAs("name"), row.getAs("company"), row.getAs("email"), row.getAs("address"), row.getAs("about"))
}
My question is the line "formattedJsonData.rdd.map(empParser)" approach is correct? I am converting to RDD of Emp Object.
1. is that right approach.
2. Suppose I have 1L, 1M records, in that case any performance isssue.
3. have any better option to convert collection of emp

If you are using spark 2, You can use dataset which is also type-safe plus it provides performance benefits of DataFrames.
val df = sqlSession.read.option("multiline", "true").json(filePath)
import sqlSession.implicits._
val ds: Dataset[Emp] = df.as[Emp]

Related

Store Schema of Read File Into csv file in spark scala

i am reading a csv file using inferschema option enabled in data frame using below command.
df2 = spark.read.options(Map("inferSchema"->"true","header"->"true")).csv("s3://Bucket-Name/Fun/Map/file.csv")
df2.printSchema()
Output:
root
|-- CC|Fun|Head|Country|SendType: string (nullable = true)
Now I would like to store the above output only into a csv file having just these column names and datatype of these columns like below.
column_name,datatype
CC,string
Fun,string
Head,string
Country,string
SendType,string
I tried writing this into a csv using below option, but this is writing the file with entire data.
df2.coalesce(1).write.format("csv").mode("append").save("schema.csv")
regards
mahi
df.schema.fields to get fields & its datatype.
Check below code.
scala> val schema = df.schema.fields.map(field => (field.name,field.dataType.typeName)).toList.toDF("column_name","datatype")
schema: org.apache.spark.sql.DataFrame = [column_name: string, datatype: string]
scala> schema.show(false)
+---------------+--------+
|column_name |datatype|
+---------------+--------+
|applicationName|string |
|id |string |
|requestId |string |
|version |long |
+---------------+--------+
scala> schema.write.format("csv").save("/tmp/schema")
Try something like below use coalesce(1) and .option("header","true") to output with header
import java.io.FileWriter
object SparkSchema {
def main(args: Array[String]): Unit = {
val fw = new FileWriter("src/main/resources/csv.schema", true)
fw.write("column_name,datatype\n")
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(("", "", "", 1l)).toDF("applicationName", "id", "requestId", "version")
val columnList : List[(String, String)] = df.schema.fields.map(field => (field.name, field.dataType.typeName))
.toList
try {
val outString = columnList.map(col => {
col._1 + "," + col._2
}).mkString("\n")
fw.write(outString)
}
finally fw.close()
val newColumnList : List[(String, String)] = List(("newColumn","integer"))
val finalColList = columnList ++ newColumnList
writeToS3("s3://bucket/newFileName.csv",finalColList)
}
def writeToS3(s3FileNameWithpath : String,finalColList : List[(String,String)]) {
val outString = finalColList.map(col => {
col._1 + "," + col._2
}).mkString("\\n")
import org.apache.hadoop.fs._
import org.apache.hadoop.conf.Configuration
val conf = new Configuration()
conf.set("fs.s3a.access.key", "YOUR ACCESS KEY")
conf.set("fs.s3a.secret.key", "YOUR SECRET KEY")
val dest = new Path(s3FileNameWithpath)
val fs = dest.getFileSystem(conf)
val out = fs.create(dest, true)
out.write( outString.getBytes )
out.close()
}
}
An alternative to #QuickSilver's and #Srinivas' solutions, which they should both work, is to use the DDL representation of the schema. With df.schema.toDDL you get:
CC STRING, fun STRING, Head STRING, Country STRING, SendType STRING
which is the string representation of the schema then you can split and replace as shown next:
import java.io.PrintWriter
val schema = df.schema.toDDL.split(",")
// Array[String] = Array(`CC` STRING, `fun` STRING, `Head` STRING, `Country` STRING, `SendType` STRING)
val writer = new PrintWriter("/tmp/schema.csv")
writer.write("column_name,datatype\n")
schema.foreach{ r => writer.write(r.replace(" ", ",") + "\n") }
writer.close()
To write to S3 you can use Hadoop API as QuickSilver already implemented or a 3rd party library such as MINIO:
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
minioClient.putObject("YOUR_BUCKET","schema.csv", "/tmp/schema.csv", null)
Or even better by generating a string, storing it into a buffer and then send it via InputStream to S3:
import java.io.ByteArrayInputStream
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
val schema = df.schema.toDDL.split(",")
val schemaBuffer = new StringBuilder
schemaBuffer ++= "column_name,datatype\n"
schema.foreach{ r => schemaBuffer ++= r.replace(" ", ",") + "\n" }
val inputStream = new ByteArrayInputStream(schemaBuffer.toString.getBytes("UTF-8"))
minioClient.putObject("YOUR_BUCKET", "schema.csv", inputStream, new PutObjectOptions(inputStream.available(), -1))
inputStream.close
#PySpark
df_schema = spark.createDataFrame([(i.name, str(i.dataType)) for i in df.schema.fields], ['column_name', 'datatype'])
df_schema.show()
This will create new dataFrame for schema of existing dataframe
UseCase:
Useful when you want create table with Schema of the dataframe & you cannot use below code as pySpark user may not be authorized to execute DDL commands on database.
df.createOrReplaceTempView("tmp_output_table")
spark.sql("""drop table if exists schema.output_table""")
spark.sql("""create table schema.output_table as select * from tmp_output_table""")
In Pyspark - You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes. Follow this link for more details pyspark.sql.DataFrame.dtypes
Having said that, try using below code -
data = df.dtypes
cols = ["col_name", "datatype"]
df = spark.createDataFrame(data=data,schema=cols)
df.show()

issue when using scala filter function in rdd

I started learning scala and Apache spark. I have an input file as below without the header.
0,name1,33,385 - first record
1,name2,26,221 - second record
unique-id, name, age, friends
1) when trying to filter age which is not 26, the below code is not working.
def parseLine(x : String) =
{
val line = x.split(",").filter(x => x._2 != "26")
}
I also tried like below. both cases it is printing all the values including 26
val friends = line(2).filter(x => x != "26")
2)when trying with index x._3, it is saying index outbound.
val line = x.split(",").filter(x => x._3 != "221")
Why index 3 is having an issue here?
Please find below the complete sample code.
package learning
import org.apache.spark._
import org.apache.log4j._
object Test1 {
def main(args : Array[String]): Unit =
{
val sc = new SparkContext("local[*]", "Test1")
val lines = sc.textFile("D:\\SparkScala\\abcd.csv")
Logger.getLogger("org").setLevel(Level.ERROR)
val testres = lines.map(parseLine)
testres.take(10).foreach(println)
}
def parseLine(x : String) =
{
val line = x.split(",").filter(x => x._2 != "33")
//val line = x.split(",").filter(x => x._3 != "307")
val age = line(1)
val friends = line(3).filter(x => x != "307")
(age,friends)
}
}
how to filter with age or friends in simple way here.
why index 3 is not working here
The issue is that you are trying to filter on the array representing a single line and not on the RDD that contains all the lines.
A possible version could be the following (I also created a case class to hold the data coming from the CSV):
package learning
import org.apache.spark._
import org.apache.log4j._
object Test2 {
// A structured representation of a CSV line
case class Person(id: String, name: String, age: Int, friends: Int)
def main(args : Array[String]): Unit = {
val sc = new SparkContext("local[*]", "Test1")
Logger.getLogger("org").setLevel(Level.ERROR)
sc.textFile("D:\\SparkScala\\abcd.csv") // RDD[String]
.map(line => parse(line)) // RDD[Person]
.filter(person => person.age != 26) // filter out people of 26 years old
.take(10) // collect 10 people from the RDD
.foreach(println)
}
def parse(x : String): Person = {
// Split the CSV string by comma into an array of strings
val line = x.split(",")
// After extracting the fields from the CSV string, create an instance of Person
Person(id = line(0), name = line(1), age = line(2).toInt, friends = line(3).toInt)
}
}
Another possibility would be to use flatMap() and Option[] values instead. In this case you can operate on a single line directly, for instance:
package learning
import org.apache.spark._
import org.apache.log4j._
object Test3 {
// A structured representation of a CSV line
case class Person(id: String, name: String, age: Int, friends: Int)
def main(args : Array[String]): Unit = {
val sc = new SparkContext("local[*]", "Test1")
Logger.getLogger("org").setLevel(Level.ERROR)
sc.textFile("D:\\SparkScala\\abcd.csv") // RDD[String]
.flatMap(line => parse(line)) // RDD[Person] -- you don't need to filter anymore, the flatMap does it for you now
.take(10) // collect 10 people from the RDD
.foreach(println)
}
def parse(x : String): Option[Person] = {
// Split the CSV string by comma into an array of strings
val line = x.split(",")
// After extracting the fields from the CSV string, create an instance of Person only if it's not 26
line(2) match {
case "26" => None
case _ => Some(Person(id = line(0), name = line(1), age = line(2).toInt, friends = line(3).toInt))
}
}
}

How to deal with contexts in Spark/Scala when using map()

I'm not very familiar with Scala, neither with Spark, and I'm trying to develop a basic test to understand how DataFrames actually work. My objective is to update my myDF based on values of some registries of another table.
Well, on the one hand, I have my App:
object TestApp {
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
implicit val hiveContext : SQLContext = new HiveContext(sc)
val test: Test = new Test()
test.test
}
}
On the other hand, I have my Test class :
class Test(implicit sqlContext: SQLContext) extends Serializable {
val hiveContext: SQLContext = sqlContext
import hiveContext.implicits._
def test(): Unit = {
val myDF = hiveContext.read.table("myDB.Customers").sort($"cod_a", $"start_date".desc)
myDF.map(myMap).take(1)
}
def myMap(row: Row): Row = {
def _myMap: (String, String) = {
val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment")
var target: (String, String) = casoX(investmentDF, row.getAs[String]("cod_a"), row.getAs[String]("cod_p"))
target
}
def casoX(df: DataFrame, codA: String, codP: String)(implicit hiveContext: SQLContext): (String, String) = {
var rows: Array[Row] = null
if (codP != null) {
println(df)
rows = df.filter($"cod_a" === codA && $"cod_p" === codP).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
} else {
rows = df.filter($"cod_a" === codA).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
}
if (rows.length > 0) (row(0).asInstanceOf[String], row(1).asInstanceOf[String]) else null
}
val target: (String, String) = _myMap
Row(row(0), row(1), row(2), row(3), row(4), row(5), row(6), target._1, target._2, row(9))
}
}
Well, when I execute it, I have a NullPointerException on the instruction val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment"), and more precisely hiveContext.read
If I analyze hiveContext in the "test" function, I can access to its SparkContext, and I can load my DF without any problem.
Nevertheless if I analyze my hiveContext object just before getting the NullPointerException, its sparkContext is null, and I suppose due to sparkContext is not Serializable (and as I am in a map function, I'm loosing part of my hiveContext object, am I right?)
Anyway, I don't know what's wrong exactly with my code, and how should I alter it to get my investmentDF without any NullPointerException?
Thanks!

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)
No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types