Error on loading csv file from hdfs using spark sql

Error on loading csv file from hdfs using spark sql - scala

This is the code I'm running:
scala> val telecom = sqlContext.read.format("csv").load("hdfs:///CDR.csv")
but I am getting an error:
<console>:23: error: not found: value sqlContext
val telecom = sqlContext.read.format("csv").load("hdfs:///CDR.csv")

Related

Unable to validate Spark Installation from Shell

i tried to install spark in Mac using Homebrew. I did all the steps as mentioned in https://sparkbyexamples.com/spark/install-apache-spark-on-mac/ . However, when i want to vallidate the spark installation from shell cmd, i got this following output: How to do this, i already did reinstallation but nothing changed. Thank you.
scala> import spark.implicits._
import spark.implicits._
scala> val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
data: Seq[(String, String)] = List((Java,20000), (Python,100000), (Scala,3000))
scala> val df = data.toDF()
java.lang.NoSuchMethodError: 'boolean org.apache.spark.util.Utils$.isInRunningSparkTask()'
at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:201)
at org.apache.spark.sql.types.DataType.sameType(DataType.scala:99)
at org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.$anonfun$haveSameType$1(TypeCoercion.scala:157)
at org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.$anonfun$haveSameType$1$adapted(TypeCoercion.scala:157)
at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85)
at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82)
at scala.collection.immutable.List.forall(List.scala:91)
at org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.haveSameType(TypeCoercion.scala:157)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1124)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1119)
at org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:39)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1130)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1129)
at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:39)
at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:39)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1134)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1134)
at org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:39)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStruct(ExpressionEncoder.scala:306)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStructForTopLevel(ExpressionEncoder.scala:316)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.<init>(ExpressionEncoder.scala:245)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:300)
at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:261)
at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:261)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:32)
... 49 elided
scala> df.show()
<console>:26: error: not found: value df
df.show()
^
The output of df.show().

Error regarding creating a jar file : Spark Scala

I have a already built in code for logistic regression using Apache spark Scala. Now i am going create a jar file from this using IntelliJ IDEA. But i am getting some errors .
First I imported the data using a CSV file. Then i fitted a logistic regression model. After that i evaluated the model. Finally i need to save the model evaluation results to a text file. I am getting an error when i try to write the model evaluation results to a file.
Here is my jar file :
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.FeatureHasher
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
object class1 {
def main(args: Array[String]): Unit = {
val sc: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val df2 = sc.read.options(Map("inferSchema"->"true","sep"->",","header"->"true")).csv(args(0))
val hasher = new FeatureHasher().setInputCols(Array("x1","x2")).setOutputCol("features")
val transformed = hasher.transform(df2)
val lr = new LogisticRegression().setMaxIter(100).setRegParam(0.1).
setElasticNetParam(0.6).setFeaturesCol("features").setLabelCol("automatic")
val Array(train, test) = transformed.randomSplit(Array(0.9, 0.1))
val lrModel = lr.fit(train)
val result = lrModel.transform(test)
val evaluator = new MulticlassClassificationEvaluator()
evaluator.setLabelCol("automatic")
evaluator.setMetricName("accuracy")
val accuracy = evaluator.evaluate(result)
accuracy.saveAsFiles(args(1))
}
}
my error is as follows :
[error] C:\Spark\src\main\scala\WordCount.scala:39:14: value saveAsFiles is not a member of Double
[error] accuracy.saveAsFiles(args(1))
[error] ^
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
This error implied that ,i cannot use saveAsFiles with a double object.
Can someone help me in understanding how to fix this ?
Thank you

accuracy is no longer a DataFrame. It's just a simple double. You can use regular Scala to save it to a file e.g.
Files.write(..., accuracy.toString)

Getting error while converting DynamicFrame to a Spark DataFrame using toDF

I stated using AWS Glue to read data using data catalog and GlueContext and transform as per requirement.
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession
// Data Catalog: database and table name
val dbName = "abcdb"
val tblName = "xyzdt_2017_12_05"
// S3 location for output
val outputDir = "s3://output/directory/abc"
// Read data into a DynamicFrame using the Data Catalog metadata
val stGBDyf = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
val revisedDF = stGBDyf.toDf() // This line getting error
While executing above code I got following error,
Error : Syntax Error: error: value toDf is not a member of
com.amazonaws.services.glue.DynamicFrame val revisedDF =
stGBDyf.toDf() one error found.
I followed this example to convert DynamicFrame to Spark dataFrame.
Please suggest what will be the best way to resolve this problem

There's a typo. It should work fine with capital F in toDF:
val revisedDF = stGBDyf.toDF()

Spark 2.0.2 doesn't seem to think that "groupBy" is returning a DataFrame

This feels sort of silly, but I am migrating from Spark 1.6.1 to Spark 2.0.2. I was using the Databrick CSV library, and am now attempting to use the builtin CSV DataFrameWriter.
The following code:
// Get an SQLContext
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
var sTS = lTimestampToSummarize.toString()
val sS3InputPath = "s3://measurements/" + sTS + "/*"
// Read all measurements - note that all subsequent ETLs will reuse dfRaw
val dfRaw = sqlContext.read.json(sS3InputPath)
// Filter just the user/segment timespent records
val dfSegments = dfRaw.filter("segment_ts <> 0").withColumn("views", lit(1))
// Aggregate views and timespent per user/segment tuples
val dfUserSegments : DataFrame = dfSegments.groupBy("company_id", "division_id", "department_id", "course_id", "user_id", "segment_id")
.agg(sum("segment_ts").alias("segment_ts_sum"), sum("segment_est").alias("segment_est_sum"), sum("views").alias("segment_views"))
// The following will write CSV files to the S3 bucket
val sS3Output = "s3://output/" + sTS + "/usersegment/"
dfUserSegments.write.csv(sS3Output)
Returns this error:
[error] /home/Spark/src/main/scala/Example.scala:75: type mismatch;
[error] found : Unit
[error] required: org.apache.spark.sql.DataFrame
[error] (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
[error] dfUserSegments.write.csv(sS3Output)
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
[error] Total time: 2 s, completed Jun 5, 2017 5:00:12 PM
I know I must be interpreting the error wrong, because I set dfUserSegments to explicitly be a DataFrame, and yet the compiler is telling me that it is of type Unit (no type).
Any help is appreciated.

You don't show the whole method. I guess it's because the method return type is DataFrame, but the last statement in this method is dfUserSegments.write.csv(sS3Output), and csv's return type is Unit.

Error: not found: value lit/when - spark scala

I am using scala, spark, IntelliJ and maven.
I have used below code :
val joinCondition = when($"exp.fnal_expr_dt" >= $"exp.nonfnal_expr_dt",
$"exp.manr_cd"===$"score.MANR_CD")
val score = exprDF.as("exp").join(scoreDF.as("score"),joinCondition,"inner")
and
val score= list.withColumn("scr", lit(0))
But when try to build using maven, getting below errors -
error: not found: value when
and
error: not found: value lit
For $ and === I have used import sqlContext.implicits.StringToColumn and it is working fine. No error occurred at the time of maven build.But for lit(0) and when what I need to import or is there any other way resolve the issue.

Let's consider the following context :
val spark : SparkSession = _ // or val sqlContext: SQLContext = new SQLContext(sc) for 1.x
val list: DataFrame = ???
To use when and lit, you'll need to import the proper functions :
import org.apache.spark.sql.functions.{col, lit, when}
Now you can use them as followed :
list.select(when(col("column_name").isNotNull, lit(1)))
Now you can use lit also in your code :
val score = list.withColumn("scr", lit(0))