I am trying to read CSV file in scala using dataset. And after that I am performing some operation. But my code is throwing error.
Below is my code:
final case class AadharData(date:String,
registrar:String,
agency:String,
state:String,
district:String,
subDistrict:String,
pinCode:Int,
gender:String,
age:Int,
aadharGenerated:Int,
rejected:Int,
mobileNo:Double,
email:String)
val spark = SparkSession.builder().appName("GDP").master("local").getOrCreate()
import spark.implicits._
val a = spark.read.option("header", false).csv("D:\\BGH\\Spark\\aadhaar_data.csv").as[AadharData]
val b = a.map(rec=>{
(rec.registrar,1)
}).groupByKey(f=>f._1).collect()
And I am getting below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`date`' given input columns: [_c0, _c2, _c1, _c3, _c5, _c8, _c9, _c7, _c6, _c11, _c12, _c10, _c4];
Any help is appreciated:
Thanks in advance.
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'date' given input columns: [_c0, _c2, _c1, _c3, _c5, _c8, _c9, _c7, _c6, _c11, _c12, _c10, _c4];
The above error is because you have used header option as false (.option("header", false)) so spark generates column names as _c0, _c1 and so on.. But while typecasting the generated dataframe using a case class you used column names different than the ones already generated. Thus the above error happened.
Solution
You should tell spark sql to generate the names used in the case class and also tell it to inferschema too as
val columnNames = classOf[AadharData].getDeclaredFields.map(x => x.getName)
val a = sqlContext.read.option("header", false).option("inferSchema", true)
.csv("D:\\BGH\\Spark\\aadhaar_data.csv").toDF(columnNames:_*).as[AadharData]
The above error should go away
Related
I am trying to convert a Row in a dataframe to a case class and getting following error
2019-08-19 20:13:08 Executor task launch worker for task 1 ERROR
Executor:91 - Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot
be cast to Models.City
Sample Log = {"Id": "1","City": {"name": "A","state": "B"}}
Below is the code that is reading a text file having data in json format which is throwing above error
case class City(name: String, state: String)
val file = new File("src/test/resources/log.txt")
val logs = spark.
read.
text(file.getAbsolutePath).
select(col("value").
as("body"))
import spark.implicits._
var logDF: DataFrame = spark.read.json(logs.as[String])
logDF.map(row => row.getAs[City]("City").state).show()
Basically I can not perform any operation on the dataframe itself due to some restrictions.
So given a row how can we cast it into a case class (i cannot use match pattern here as case class can have lot of fields and nested case classes)
Thanks in advance.
Any help is greatly appreciated!!
I had the same issue (Spark SQL 3.1.3). The solution was to use Spark to convert Dataframe to Dataset and then access the fields.
import spark.implicits._
var logDF: DataFrame = spark.read.json(logs.as[String])
logDF.select("City").as[City].map(city => city.state).show()
I've got two parquet files, one contains an integer field myField and another contains a double field myField. When attempting to read both the files at once
val basePath = "/path/to/file/"
val fileWithInt = basePath + "intFile.snappy.parquet"
val fileWithDouble = basePath + "doubleFile.snappy.parquet"
val result = spark.sqlContext.read.option("mergeSchema", true).option("basePath", basePath).parquet(Seq(fileWithInt, fileWithDouble): _*).select("myField")
I get the following error
Caused by: org.apache.spark.SparkException: Failed to merge fields 'myField' and 'myField'. Failed to merge incompatible data types IntegerType and DoubleType
When passing an explicit schema
val schema = StructType(Seq(new StructField("myField", IntegerType)))
val result = spark.sqlContext.read.schema(schema).option("mergeSchema", true).option("basePath", basePath).parquet(Seq(fileWithInt, fileWithDouble): _*).select("myField")
It fails with the following
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
When casting up to a double
val schema = StructType(Seq(new StructField("myField", DoubleType)))
I get
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:60)
Does anyone know any ways around this problem other than reprocessing the source data.
Depending on the number of files you are going to read you can use one of these two approachs:
This would be best for a smaller number of parquet files
def merge(spark: SparkSession, paths: Seq[String]): DataFrame = {
import spark.implicits._
paths.par.map {
path =>
spark.read.parquet(path).withColumn("myField", $"myField".cast(DoubleType))
}.reduce(_.union(_))
}
This approach will be better to process a large number of files since it will keep lineage short
def merge2(spark: SparkSession, paths: Seq[String]): DataFrame = {
import spark.implicits._
spark.sparkContext.union(paths.par.map {
path =>
spark.read.parquet(path).withColumn("myField", $"myField".cast(DoubleType)).as[Double].rdd
}.toList).toDF
}
I am a newbie in scala. Please be patient.
I have this code.
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// create spark session
implicit val spark = SparkSession.builder().appName("clustering").getOrCreate()
// read file
val fileName = """file:///some_location/head_sessions_sample.csv"""
// create DF from file
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(fileName)
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
try {
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
a
}
catch {
case e: java.lang.ClassCastException => spark.emptyDataFrame
}
}
val t = inputKmeans(df).filter( _ != null )
t.foreach(r =>
if (r.get(0) != null)
println(r.get(0)))
For the moment, i want to ignore my conversion errors. But somehow, I still have them.
2018-09-24 11:26:22 ERROR Executor:91 - Exception in task 0.0 in stage
4.0 (TID 6) java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double
I dont think there is any point to give a snapshot of the csv. At this point, i just want to ignore conversion errors.
Any ideas why this is happening?
As mentioned in the comment, the issue is because the values are not Double type.
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
Either cast to the Correct DataType i.e Long Type (you can also provide the Schema explicitly using Case Class and apply the schema to DataFrame).
Or use the VectorAssembler to convert the columns into features. This is easier and recommended approach.
import org.apache.spark.ml.feature.VectorAssembler
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
val assembler = new VectorAssembler().setInputCols(Array("start_ts", "duration", "ip_dist")).setOutputCol("features")
val output = assembler.transform(df).select("id", "features")
output
}
i think i discovered the problem. the "try catch" is placed at the level of the DF creation, not at the level of the conversion. in consequence, it catches problems related to DF creation, not conversion issues.
I am new to Spark and am using Scala to create a basic classifier. I am reading from a textfile as a dataset and splitting it into training and test data sets. Then I'm trying to tokenize the training data but it fails with
Caused by: java.lang.IllegalArgumentException: requirement failed: Input type must be string type but got ArrayType(StringType,true).
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.feature.RegexTokenizer.validateInputType(Tokenizer.scala:149)
at org.apache.spark.ml.UnaryTransformer.transformSchema(Transformer.scala:110)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
at com.classifier.classifier_app.App$.<init>(App.scala:91)
at com.classifier.classifier_app.App$.<clinit>(App.scala)
... 1 more
error.
The code is as below:
val input_path = "path//to//file.txt"
case class Sentence(value: String)
val sentencesDS = spark.read.textFile(input_path).as[Sentence]
val Array(trainingData, testData) = sentencesDS.randomSplit(Array(0.7, 0.3))
val tokenizer = new Tokenizer()
.setInputCol("value")
.setOutputCol("words")
val pipeline = new Pipeline().setStages(Array(tokenizer, regexTokenizer, remover, hashingTF, ovr))
val model = pipeline.fit(trainingData)
How do I solve this? Any help is appreciated.
I have defined all the stages in the pipeline but haven't put them here in the code snippet.
The error was resolved when the order of execution in pipeline was changed.
val pipeline = new Pipeline().setStages(Array (indexer, regexTokenizer, remover, hashingTF))
val model = pipeline.fit(trainingData)
The tokenizer was replaced with regexTokenizer.
First I convert a CSV file to a Spark DataFrame using
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/usr/people.csv")
after that type df and return I can see
res30: org.apache.spark.sql.DataFrame = [name: string, age: string, gender: string, deptID: string, salary: string]
Then I use df.registerTempTable("people") to convert df to a Spark SQL table.
But after that when I do people Instead got type table, I got
<console>:33: error: not found: value people
Is it because people is a temporary table?
Thanks
When you register an temp table using the registerTempTable command you used, it will be available inside your SQLContext.
This means that the following is incorrect and will give you the error you are getting :
scala> people.show
<console>:33: error: not found: value people
To use the temp table, you'll need to call it with your sqlContext. Example :
scala> sqlContext.sql("select * from people")
Note : df.registerTempTable("df") will register a temporary table with name df correspond to the DataFrame df you apply the method on.
So persisting on df wont persist the table but the DataFrame, even thought the SQLContext will be using that DataFrame.
The above answer is right for Zeppelin too. If you want to run println to see data, you have to send it back to the driver to see output.
val querystrings = sqlContext.sql("select visitorDMA,
visitorIpAddress, visitorState, allRequestKV
from {redacted}
limit 1000")
querystrings.collect.foreach(entry => {
print(entry.getString(3).toString() + "\n")
})