Converting a Spark's DataFrame column to List[String] in Scala - scala

I am working on Movie Lens data set. In one the the csv files, the data is structured as:
movieId movieTitle genres
and genres again is a list of | separated values, the field is nullable.
I am trying to get a unique list of all the genres so that I can rearrange the data as following:
movieId movieTitle genre1 genre2 ... genreN
and a row, which has genre as genre1 | genre2 will look like:
1 Title1 1 1 0 ... 0
So far, I have been able to read the csv file using the following code:
val conf = new SparkConf().setAppName(App.name).setMaster(App.sparkMaster)
val context = new SparkContext(conf)
val sparkSession = SparkSession.builder()
.appName(App.name)
.config("header", "true")
.config(conf = conf)
.getOrCreate()
val movieFrame: DataFrame = sparkSession.read.csv(moviesPath)
If I try something like:
movieFrame.rdd.map(row ⇒ row(2).asInstanceOf[String]).collect()
Then I get the following exception:
java.lang.ClassNotFoundException: com.github.babbupandey.ReadData$$anonfun$1
Then, in addition, I tried providing the schema explicitly using the following code:
val moviesSchema: StructType = StructType(Array(StructField("movieId", StringType, nullable = true),
StructField("title", StringType, nullable = true),
StructField("genres", StringType, nullable = true)))
and tried:
val movieFrame: DataFrame = sparkSession.read.schema(moviesSchema).csv(moviesPath)
and then I got the same exception.
Is there any way in which I can the set of genres as a List or a Set so I can further massage the data into the desired format? Any help will be appreciated.

Here is how I got the set of genres:
val genreList: Array[String] = for (row <- movieFrame.select("genres").collect) yield row.getString(0)
val genres: Array[String] = for {
g ← genreList
genres ← g.split("\\|")
} yield genres
val genreSet : Set[String] = genres.toSet

This worked to give an Array[Array[String]]
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect()
To get Array[String]
val genres = genreLst.flatten
or
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect().flatten

Related

Type Mismatch Spark Scala

I am trying to create an empty dataframe an using it on a function but I am having the following error all time:
Required: DataFrame
Found: Dataset[DataFrame]
This is how I am doing it:
//Create empty DataFrame
val schema = StructType(
StructField("g", StringType, true) ::
StructField("tg", StringType, true) :: Nil)
var df1 = spark.createDataFrame(spark.sparkContext
.emptyRDD[Row], schema)
//or
var df1 = spark.emptyDataFrame
Then I try to use it calling a functions as you can see here:
df1 = kvrdd1_toDF.map(x => function1(x, df1))
And this is the function:
def function1(input: org.apache.spark.sql.Row, df: DataFrame): DataFrame = {
val v1 = spark.sparkContext.parallelize(Seq("g","tg"))
var df3 = v1.toDF("g","tg")
if (df.take(1).isEmpty){
df3 = Seq((input.get(2), "nn")).toDF("g", "tg")
} else {
df3 = df3.union(df)
}
df3
}
What am I doing wrong?
You have a DataFrame which is an alias for Dataset[Row]. You map that Row to a DataFrame so that's how you end up with a Dataset[DataFrame]. I don't know what you are trying to do but it will never work. The functions (and all its dependencies) you use to map the contents of a Dataset are serialized and distributed over your spark cluster. You can't use another DataFrame or a SparkSession or SparkContext in such a function.

Reading ambiguous column name in Spark sql Dataframe using scala

I have duplicate columns in text file and when I try to load that text file using spark scala code, it gets loaded successfully into data frame and I can see the first 20 rows by df.Show()
Full Code:-
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
df.registerTempTable("Sample_File")
df.Show()
Till this point my code works fine.
But as soon as I try below code then it gives me error.
val results = hivesql.sql("Select id,sequence,sequence from Sample_File")
so I have 2 columns with same name in text file i.e sequence
How can I access that two columns.. I tried with sequence#2 but still not working
Spark Version:-1.6.0
Scala Version:- 2.10.5
result of df.printschema()
|-- id: string (nullable = true)
|-- sequence: string (nullable = true)
|-- sequence: string (nullable = true)
I second #smart_coder's approach, I have a slightly different approach though. Please find it below.
You need to have unique column names to do query from hivesql.sql.
you can rename the column names dynamically by using below code:
Your code:
val df = hivesql.createDataFrame(rowRDD, schema)
After this point, we need to remove ambiguity, below is the solution:
var list = df.schema.map(_.name).toList
for(i <- 0 to list.size -1){
val cont = list.count(_ == list(i))
val col = list(i)
if(cont != 1){
list = list.take(i) ++ List(col+i) ++ list.drop(i+1)
}
}
val df1 = df.toDF(list: _*)
// you would get the output as below:
result of df1.printschema()
|-- id: string (nullable = true)
|-- sequence1: string (nullable = true)
|-- sequence: string (nullable = true)
So basically, we are getting all the column names as a list, then checking if any column is repeating more than once,
if a column is repeating, we are appending the column name with the index, then we create a new dataframe d1 with the new list with renamed column names.
I have tested this in Spark 2.4, but it should work in 1.6 as well.
The below code might help you to resolve your problem. I have tested this in Spark 1.6.3.
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
val colNames = Seq("id","sequence1","sequence2")
val df1 = df.toDF(colNames: _*)
df1.registerTempTable("Sample_File")
val results = hivesql.sql("select id,sequence1,sequence2 from Sample_File")

Error in creating dataframe: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string

I have created a schema with following code
val schema= new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
Created a RDD from
val data = spark.sparkContext.textFile("cities.txt")
Converted to RDD of Row to apply schema
val cities = data.map(line => line.split(";")).map(row => Row.fromSeq(row.zip(schema.toSeq)))
val citiesRDD = spark.sqlContext.createDataFrame(cities, schema)
This gives me an error
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string
You don't need a schema to create a Row, you need the schema when you create the DataFrame. You also need to introduce some logic how to convert your splitted line (which produces 3 strings) into integers:
here a minimal solution without exception-handling:
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
val schema = new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
val cities = data.map(line => {
val Array(city,female,male) = line.split(";")
Row(
city,
female.toInt,
male.toInt
)
}
)
val citiesDF = sqlContext.createDataFrame(cities, schema)
I normally use case-classes to create a dataframe, because spark can infer the schema from the case class:
// "schema" for dataframe, define outside of main method
case class MyRow(city:Option[String],female:Option[Int],male:Option[Int])
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
import sqlContext.implicits._
val citiesDF = data.map(line => {
val Array(city,female,male) = line.split(";")
MyRow(
Some(city),
Some(female.toInt),
Some(male.toInt)
)
}
).toDF()

converting textFile to dataFrame dynamically

I am trying to convert input from a text file to dataframe using a schema file which is read at run time.
My input text file looks like this:
John,23
Charles,34
The schema file looks like this:
name:string
age:integer
This is what I tried:
object DynamicSchema {
def main(args: Array[String]) {
val inputFile = args(0)
val schemaFile = args(1)
val schemaLines = Source.fromFile(schemaFile, "UTF-8").getLines().map(_.split(":")).map(l => l(0) -> l(1)).toMap
val spark = SparkSession.builder()
.master("local[*]")
.appName("Dynamic Schema")
.getOrCreate()
import spark.implicits._
val input = spark.sparkContext.textFile(args(0))
val schema = spark.sparkContext.broadcast(schemaLines)
val nameToType = {
Seq(IntegerType,StringType)
.map(t => t.typeName -> t).toMap
}
println(nameToType)
val fields = schema.value
.map(field => StructField(field._1, nameToType(field._2), nullable = true)).toSeq
val schemaStruct = StructType(fields)
val rowRDD = input
.map(_.split(","))
.map(attributes => Row.fromSeq(attributes))
val peopleDF = spark.createDataFrame(rowRDD, schemaStruct)
peopleDF.printSchema()
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")
results.show()
}
}
Though the printSchema gives the desired result, result.show errors out. I think the age field actually needs to be converted using toInt. Is there a way to achieve the same when the schema is only available at runtime?
Replace
val input = spark.sparkContext.textFile(args(0))
with
val input = spark.read.schema(schemaStruct).csv(args(0))
and move it after schema definition.

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?
Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"
I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()
If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)
A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")
I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!