Reading ambiguous column name in Spark sql Dataframe using scala - scala

I have duplicate columns in text file and when I try to load that text file using spark scala code, it gets loaded successfully into data frame and I can see the first 20 rows by df.Show()
Full Code:-
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
df.registerTempTable("Sample_File")
df.Show()
Till this point my code works fine.
But as soon as I try below code then it gives me error.
val results = hivesql.sql("Select id,sequence,sequence from Sample_File")
so I have 2 columns with same name in text file i.e sequence
How can I access that two columns.. I tried with sequence#2 but still not working
Spark Version:-1.6.0
Scala Version:- 2.10.5
result of df.printschema()
|-- id: string (nullable = true)
|-- sequence: string (nullable = true)
|-- sequence: string (nullable = true)

I second #smart_coder's approach, I have a slightly different approach though. Please find it below.
You need to have unique column names to do query from hivesql.sql.
you can rename the column names dynamically by using below code:
Your code:
val df = hivesql.createDataFrame(rowRDD, schema)
After this point, we need to remove ambiguity, below is the solution:
var list = df.schema.map(_.name).toList
for(i <- 0 to list.size -1){
val cont = list.count(_ == list(i))
val col = list(i)
if(cont != 1){
list = list.take(i) ++ List(col+i) ++ list.drop(i+1)
}
}
val df1 = df.toDF(list: _*)
// you would get the output as below:
result of df1.printschema()
|-- id: string (nullable = true)
|-- sequence1: string (nullable = true)
|-- sequence: string (nullable = true)
So basically, we are getting all the column names as a list, then checking if any column is repeating more than once,
if a column is repeating, we are appending the column name with the index, then we create a new dataframe d1 with the new list with renamed column names.
I have tested this in Spark 2.4, but it should work in 1.6 as well.

The below code might help you to resolve your problem. I have tested this in Spark 1.6.3.
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
val colNames = Seq("id","sequence1","sequence2")
val df1 = df.toDF(colNames: _*)
df1.registerTempTable("Sample_File")
val results = hivesql.sql("select id,sequence1,sequence2 from Sample_File")

Related

Scala explode followed by UDF on a dataframe fails

I have a scala dataframe with the following schema:
root
|-- time: string (nullable = true)
|-- itemId: string (nullable = true)
|-- itemFeatures: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
I want to explode the itemFeatures column and then send my dataframe to a UDF. But as soon as I include the explode, calling the UDF results in this error:
org.apache.spark.SparkException: Task not serializable
I can't figure out why???
Environment: Scala 2.11.12, Spark 2.4.4
Full example:
val dataList = List(
("time1", "id1", "map1"),
("time2", "id2", "map2"))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode("itemFeatures"))
val doNextThingUDF: UserDefinedFunction = udf(doNextThing _)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time"))
where my UDF looks like this:
val doNextThing(time: String): String = {
time+"blah"
}
If I remove the explode, everything works fine, or if I don't call the UDF after the explode, everything works fine. I could imagine Spark is somehow unable to send each row to a UDF if it is dynamically executing the explode and doesn't know how many rows that are going to exist, but even when I add ex dfExploded.cache() and dfExploded.count() I still get the error. Is this a known issue? What am I missing?
I think the issue come from how you define your donextThing function. Also
there is couple of typos in your "full example".
Especially the itemFeatures column is a string in your example, I understand it should be a Map.
But here is a working example:
val dataList = List(
("time1", "id1", Map("map1" -> 1)),
("time2", "id2", Map("map2" -> 2)))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode($"itemFeatures"))
val doNextThing = (time: String) => {time+"blah"}
val doNextThingUDF = udf(doNextThing)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time")))

Update DataFrame col names from a list avoiding using var

I have a list of defined columns as:
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "text", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
And a file (csv with header columns Products Selled, Total Value) which is readed as dataframe.
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(filePath)
// csv file have header as colNames
var finalDf = df
.withColumn("row_id", monotonically_increasing_id)
.select(cols
.map(_.name.trim)
.map(col): _*)
// convert df col names as colCodes (for kudu table columns)
cols.foreach(col => finalDf = finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))
In last line, I change the dataframe column name from Products Selled into products_selled. Due of this, finalDf is a var.
I want to know if is a solution to declare finalDf as val, and not var.
I tried something like below code, but withColumnRenamed return a new DataFrame, but I can not do this outside cols.foreach
cols.foreach(col => finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))
Using select You can rename columns.
renaming columns inside select is faster than foldLeft, check post for comparison.
Try below code.
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "string", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
val colExpr = cols.map(c => trim(col(c.colName)).as(c.colCode.trim))
If you are storing valid column data type in ExcelColumn case class, you can use column data type like below.
val colExpr = cols.map(c => trim(col(c.colName).cast(c.colType)).as(c.colCode.trim))
finalDf.select(colExpr:_*)
The better way is to use foldLeft with withColumnRenamed
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "text", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
val resultDF = cols.foldLeft(df){(acc, name ) =>
acc.withColumnRenamed(name.colName.trim, name.colCode.trim)
}
Original Schema:
root
|-- Products Selled: integer (nullable = false)
|-- Total Value: string (nullable = true)
|-- value: integer (nullable = false)
New Schema:
root
|-- products_selled: integer (nullable = false)
|-- total_value: string (nullable = true)
|-- value: integer (nullable = false)

Processing Array[Byte] in spark dataframes

I have a dataframe df1 as below with schema:
scala> df1.printSchema
root
|-- filecontent: binary (nullable = true)
|-- filename: string (nullable = true)
The DF has filename and its content. The content is GZIPped. I could use something like the below to unzip the data in filecontent and save it to HDFS.
def decompressor(origRow: Row) = {
val filename = origRow.getString(1)
val filecontent = serialise(origRow.getString(0))
val unzippedData = new GZIPInputStream(new ByteArrayInputStream(filecontent))
val hadoop_fs = FileSystem.get(sc.hadoopConfiguration)
val filenamePath = new Path(filename)
val fos = hadoop_fs.create(filenamePath)
org.apache.hadoop.io.IOUtils.copyBytes(unzippedData, fos, sc.hadoopConfiguration)
fos.close()
}
My objective:
Since the filecontent column data in the df1 is a binary i.e Array[byte] i shouldnt distribute the data and have it together and pass it to the function so that it could decompress and save it to a file.
My Question:
How do I not distribute the data (column data)?
How do I make sure the processing happens for 1 row at a time?

Converting a Spark's DataFrame column to List[String] in Scala

I am working on Movie Lens data set. In one the the csv files, the data is structured as:
movieId movieTitle genres
and genres again is a list of | separated values, the field is nullable.
I am trying to get a unique list of all the genres so that I can rearrange the data as following:
movieId movieTitle genre1 genre2 ... genreN
and a row, which has genre as genre1 | genre2 will look like:
1 Title1 1 1 0 ... 0
So far, I have been able to read the csv file using the following code:
val conf = new SparkConf().setAppName(App.name).setMaster(App.sparkMaster)
val context = new SparkContext(conf)
val sparkSession = SparkSession.builder()
.appName(App.name)
.config("header", "true")
.config(conf = conf)
.getOrCreate()
val movieFrame: DataFrame = sparkSession.read.csv(moviesPath)
If I try something like:
movieFrame.rdd.map(row ⇒ row(2).asInstanceOf[String]).collect()
Then I get the following exception:
java.lang.ClassNotFoundException: com.github.babbupandey.ReadData$$anonfun$1
Then, in addition, I tried providing the schema explicitly using the following code:
val moviesSchema: StructType = StructType(Array(StructField("movieId", StringType, nullable = true),
StructField("title", StringType, nullable = true),
StructField("genres", StringType, nullable = true)))
and tried:
val movieFrame: DataFrame = sparkSession.read.schema(moviesSchema).csv(moviesPath)
and then I got the same exception.
Is there any way in which I can the set of genres as a List or a Set so I can further massage the data into the desired format? Any help will be appreciated.
Here is how I got the set of genres:
val genreList: Array[String] = for (row <- movieFrame.select("genres").collect) yield row.getString(0)
val genres: Array[String] = for {
g ← genreList
genres ← g.split("\\|")
} yield genres
val genreSet : Set[String] = genres.toSet
This worked to give an Array[Array[String]]
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect()
To get Array[String]
val genres = genreLst.flatten
or
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect().flatten

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?
Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"
I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()
If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)
A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")
I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!