Spark DataFrame not respecting schema and considering everything as String

Spark DataFrame not respecting schema and considering everything as String - scala

I am facing a problem which I have failed to get over for ages now.
I am on Spark 1.4 and Scala 2.10. I cannot upgrade at this moment (big distributed infrastructure)
I have a file with few hundred columns, only 2 of which are string and rest all are Long. I want to convert this data into a Label/Features dataframe.
I have been able to get it into LibSVM format.
I just cannot get it into a Label/Features format.
The reason being
I am not being able to use the toDF() as shown here
https://spark.apache.org/docs/1.5.1/ml-ensembles.html
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
it it not supported in 1.4
So I first converted the txtFile into a DataFrame where I used something like this
def getColumnDType(columnName:String):StructField = {
if((columnName== "strcol1") || (columnName== "strcol2"))
return StructField(columnName, StringType, false)
else
return StructField(columnName, LongType, false)
}
def getDataFrameFromTxtFile(sc: SparkContext,staticfeatures_filepath: String,schemaConf: String) : DataFrame = {
val sfRDD = sc.textFile(staticfeatures_filepath)//
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// reads a space delimited string from application.properties file
val schemaString = readConf(Array(schemaConf)).get(schemaConf).getOrElse("")
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => getSFColumnDType(fieldName)))
val data = sfRDD
.map(line => line.split(","))
.map(p => Row.fromSeq(p.toSeq))
var df = sqlContext.createDataFrame(data, schema)
//schemaString.split(" ").drop(4)
//.map(s => df = convertColumn(df, s, "int"))
return df
}
When I do a df.na.drop() df.printSchema() with this returned dataframe I get perfect Schema Like this
root
|-- rand_entry: long (nullable = false)
|-- strcol1: string (nullable = false)
|-- label: long (nullable = false)
|-- strcol2: string (nullable = false)
|-- f1: long (nullable = false)
|-- f2: long (nullable = false)
|-- f3: long (nullable = false)
and so on till around f300
But - the sad part is whatever I try to do (see below) with the df, I am always getting a ClassCastException (java.lang.String cannot be cast to java.lang.Long)
val featureColumns = Array("f1","f2",....."f300")
assertEquals(-99,df.select("f1").head().getLong(0))
assertEquals(-99,df.first().get(4))
val transformeddf = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
.transform(df)
So - the bad is - even though the schema says Long - the df is still internally considering everything as String.
Edit
Adding a simple example
Say I have a file like this
1,A,20,P,-99,1,0,0,8,1,1,1,1,131153
1,B,23,P,-99,0,1,0,7,1,1,0,1,65543
1,C,24,P,-99,0,1,0,9,1,1,1,1,262149
1,D,7,P,-99,0,0,0,8,1,1,1,1,458759
and
sf-schema=f0 strCol1 f1 strCol2 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11
(column names really do not matter so you can disregard this details)
All I am trying to do is create a Label/Features kind of dataframe where my 3rd column becomes a label and the 5th to 11th columns become a feature [Vector] column. Such that I can follow the rest of the steps in https://spark.apache.org/docs/latest/ml-classification-regression.html#tree-ensembles.
I have cast the columns too like suggested by zero323
val types = Map("strCol1" -> "string", "strCol2" -> "string")
.withDefault(_ => "bigint")
df = df.select(df.columns.map(c => df.col(c).cast(types(c)).alias(c)): _*)
df = df.drop("f0")
df = df.drop("strCol1")
df = df.drop("strCol2")
But the assert and VectorAssembler still fails.
featureColumns = Array("f2","f3",....."f11")
This is whole sequence I want to do after I have my df
var transformeddf = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
.transform(df)
transformeddf.show(2)
transformeddf = new StringIndexer()
.setInputCol("f1")
.setOutputCol("indexedF1")
.fit(transformeddf)
.transform(transformeddf)
transformeddf.show(2)
transformeddf = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(5)
.fit(transformeddf)
.transform(transformeddf)
The exception trace from ScalaIDE - just when it hits the VectorAssembler is as below
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
at scala.math.Numeric$LongIsIntegral$.toDouble(Numeric.scala:117)
at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5.apply(Cast.scala:364)
at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5.apply(Cast.scala:364)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:436)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypes.scala:75)
at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypes.scala:75)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypes.scala:75)
at org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypes.scala:56)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:72)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:70)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf.eval(ScalaUdf.scala:960)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

You get ClassCastException because this is exactly what should happen. Schema argument is not used for automatic casting (some DataSources may use schema this way, but not methods like createDataFrame). It only declares what are the types of the values which are stored in the rows. It is you responsibility to pass data which matches the schema, not the other way around.
While DataFrame shows schema you've declared it is validated only on runtime, hence the runtime exception.If you want to transform data to specific you have cast data explicitly.
First read all columns as StringType:
val rows = sc.textFile(staticfeatures_filepath)
.map(line => Row.fromSeq(line.split(",")))
val schema = StructType(
schemaString.split(" ").map(
columnName => StructField(columnName, StringType, false)
)
)
val df = sqlContext.createDataFrame(rows, schema)
Next cast selected columns to desired type:
import org.apache.spark.sql.types.{LongType, StringType}
val types = Map("strcol1" -> StringType, "strcol2" -> StringType)
.withDefault(_ => LongType)
val casted = df.select(df.columns.map(c => col(c).cast(types(c)).alias(c)): _*)
Use assembler:
val transformeddf = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
.transform(casted)
You can simply steps 1 and 2 using spark-csv:
// As originally
val schema = StructType(
schemaString.split(" ").map(fieldName => getSFColumnDType(fieldName)))
val df = sqlContext
.read.schema(schema)
.format("com.databricks.spark.csv")
.option("header", "false")
.load(staticfeatures_filepath)
See also Correctly reading the types from file in PySpark

Related

Spark cast all LongType columns to IntegerType dynamically

I dont have schema information because, i have different tables and the dataframe will be created with their data. So i want to detect any LongType column and cast it to IntegerType.
My approach is the create new dataFrame with the new schema which LongType fields converted to IntegerType.
val df = spark.read.format("bigquery").load(sql)
// cast long types to int
val newSchemaArr = df.schema.fields.map(f => if(f.dataType.isInstanceOf[LongType]) StructField(name = f.name, dataType = IntegerType, nullable = f.nullable) else f)
val newSchema = new StructType(newSchemaArr)
val df2 = spark.createDataFrame(df.rdd, newSchema)
// write to hdfs files
df2.write.format("avro").save(destinationPath)
But i got this error, when i writing the data.
Caused by: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of int
Is there any solution to fix, or any another approach to handle this problem?
Spark version: 3.2.0
Scala version: 2.12

The easiest way to do this is to simply cast columns when necessary:
// cast long columns to integer
val columns = df.schema.map {
case StructField(name, LongType, _, _) => col(name).cast(IntegerType)
case f => col(f.name)
}
// write modified columns
df.select(columns: _*).write.format("avro").save(destinationPath)

import org.apache.spark.sql.types.{IntegerType, LongType}
val sch = df.schema
val df2 = sch.fieldNames.foldLeft(df) { (tmpDF, colName) =>
if (tmpDF.schema(colName).dataType == LongType)
tmpDF.withColumn(colName, col(colName).cast(IntegerType))
else tmpDF
}

Reading ambiguous column name in Spark sql Dataframe using scala

I have duplicate columns in text file and when I try to load that text file using spark scala code, it gets loaded successfully into data frame and I can see the first 20 rows by df.Show()
Full Code:-
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
df.registerTempTable("Sample_File")
df.Show()
Till this point my code works fine.
But as soon as I try below code then it gives me error.
val results = hivesql.sql("Select id,sequence,sequence from Sample_File")
so I have 2 columns with same name in text file i.e sequence
How can I access that two columns.. I tried with sequence#2 but still not working
Spark Version:-1.6.0
Scala Version:- 2.10.5
result of df.printschema()
|-- id: string (nullable = true)
|-- sequence: string (nullable = true)
|-- sequence: string (nullable = true)

I second #smart_coder's approach, I have a slightly different approach though. Please find it below.
You need to have unique column names to do query from hivesql.sql.
you can rename the column names dynamically by using below code:
Your code:
val df = hivesql.createDataFrame(rowRDD, schema)
After this point, we need to remove ambiguity, below is the solution:
var list = df.schema.map(_.name).toList
for(i <- 0 to list.size -1){
val cont = list.count(_ == list(i))
val col = list(i)
if(cont != 1){
list = list.take(i) ++ List(col+i) ++ list.drop(i+1)
}
}
val df1 = df.toDF(list: _*)
// you would get the output as below:
result of df1.printschema()
|-- id: string (nullable = true)
|-- sequence1: string (nullable = true)
|-- sequence: string (nullable = true)
So basically, we are getting all the column names as a list, then checking if any column is repeating more than once,
if a column is repeating, we are appending the column name with the index, then we create a new dataframe d1 with the new list with renamed column names.
I have tested this in Spark 2.4, but it should work in 1.6 as well.

The below code might help you to resolve your problem. I have tested this in Spark 1.6.3.
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
val colNames = Seq("id","sequence1","sequence2")
val df1 = df.toDF(colNames: _*)
df1.registerTempTable("Sample_File")
val results = hivesql.sql("select id,sequence1,sequence2 from Sample_File")

DecimalType issue while creating Dataframe

While I am trying to create a dataframe using a decimal type it is throwing me the below error.
I am performing the following steps:
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StringType;
import org.apache.spark.sql.types.DataTypes._;
//created a DecimalType
val DecimalType = DataTypes.createDecimalType(15,10)
//Created a schema
val sch = StructType(StructField("COL1",StringType,true)::StructField("COL2",**DecimalType**,true)::Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x=>x.split(",")).map(x=>Row.fromSeq(x))
val df1= sqlContext.createDataFrame(row,sch)
df1 is getting created without any errors.But, when I issue as df1.collect() action, it is giving me the below error:
scala.MatchError: 0 (of class java.lang.String)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:326)
test_file.txt content:
test1,0
test2,0.67
test3,10.65
test4,-10.1234567890
Is there any issue with the way that I am creating DecimalType?

You should have an instance of BigDecimal to convert to DecimalType.
val DecimalType = DataTypes.createDecimalType(15, 10)
val sch = StructType(StructField("COL1", StringType, true) :: StructField("COL2", DecimalType, true) :: Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x => x.split(",")).map(x => Row(x(0), BigDecimal.decimal(x(1).toDouble)))
val df1 = spark.createDataFrame(row, sch)
df1.collect().foreach { println }
df1.printSchema()
The result looks like this:
[test1,0E-10]
[test2,0.6700000000]
[test3,10.6500000000]
[test4,-10.1234567890]
root
|-- COL1: string (nullable = true)
|-- COL2: decimal(15,10) (nullable = true)

When you read a file as sc.textFile it reads all the values as string, So error is due to applying the schema while creating dataframe
For this you can convert the second value to Decimal before applying schema
val row = src.map(x=>x.split(",")).map(x=>Row(x(0), BigDecimal.decimal(x(1).toDouble)))
Or if you reading a cav file then you can use spark-csv to read csv file and provide the schema while reading the file.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For Spark > 2.0
spark.read
.option("header", true)
.schema(sch)
.csv(file)
Hope this helps!

A simpler way to solve your problem would be to load the csv file directly as a dataframe. You can do that like this:
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false") // no header
.option("inferSchema", "true")
.load("/file/path/")
Or for Spark > 2.0:
val spark = SparkSession.builder.getOrCreate()
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "false") // no headers
.load("/file/path")
Output:
df.show()
+-----+--------------+
| _c0| _c1|
+-----+--------------+
|test1| 0|
|test2| 0.67|
|test3| 10.65|
|test4|-10.1234567890|
+-----+--------------+

Spark : ClassCastException while converting string into date

I am trying to read parquet file in dataframe with following code:
val data = spark.read.schema(schema)
.option("dateFormat", "YYYY-MM-dd'T'hh:mm:ss").parquet(<file_path>)
data.show()
Here's the schema:
def schema: StructType = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", DateType, false)
))
When I try to execute data.show(), it throws the following exception:
Caused by: java.lang.ClassCastException: [B cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.catalyst.expressions.MutableInt.update(SpecificInternalRow.scala:74)
at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:240)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.set(ParquetRowConverter.scala:159)
at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addBinary(ParquetRowConverter.scala:89)
at org.apache.parquet.column.impl.ColumnReaderImpl$2$6.writeValue(ColumnReaderImpl.java:324)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:372)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
Apparently, it's because of date format and DateType in my schema. If I change DateType to StringType, it works fine and outputs the following:
+--------------------+--------------------+----------------------+
| id| text| created_date|
+--------------------+--------------------+----------------------+
|id..................|text................|2017-01-01T00:08:09Z|
I want to read created_date into DateType, do I need to change anything else?

The following works under Spark 2.1. Note the change of the date format and the usage of TimestampType instead of DateType.
val schema = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", TimestampType, false)
))
val data = spark
.read
.schema(schema)
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss'Z'")
.parquet("s3a://thisisnotabucket")
In older versions of Spark (I can confirm this works under 1.5.2), you can create a UDF to do the conversion for you in SQL.
def cvtDt(d: String): java.sql.Date = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Date(fmt.parseDateTime(d).getMillis)
}
def cvtTs(d: String): java.sql.Timestamp = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Timestamp(fmt.parseDateTime(d).getMillis)
}
sqlContext.udf.register("mkdate", cvtDt(_: String))
sqlContext.udf.register("mktimestamp", cvtTs(_: String))
sqlContext.read.parquet("s3a://thisisnotabucket").registerTempTable("dttest")
val query = "select *, mkdate(created_date), mktimestamp(created_date) from dttest"
sqlContext.sql(query).collect.foreach(println)
NOTE: I did this in the REPL, so I had to create the DateTimeFormat pattern on every call to the cvt* methods to avoid serialization issues. If you're doing this is an application, I recommend extracting the formatter into an object.
object DtFmt {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T‌'HH:mm:ss'Z'")
}

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?

Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"

I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()

If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()

You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()

val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)

A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")

I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show

You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark DataFrame not respecting schema and considering everything as String - scala

Related

Spark cast all LongType columns to IntegerType dynamically

Reading ambiguous column name in Spark sql Dataframe using scala

DecimalType issue while creating Dataframe

Spark : ClassCastException while converting string into date

How to create a DataFrame from a text file in Spark

Categories

Resources