Issue with headers on scala from a csv file - scala

I'm trying to upload a csv file using Scala and Apache Spark but, once I specify the schema with a Spark Structype I have this issue trying to indicate the headers of the csv file-
scala> import org.apache.spark
import org.apache.spark
scala> import org.apache.spark.sql
import org.apache.spark.sql
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
scala> import org.apache.spark.sql.types
import org.apache.spark.sql.types
scala> import org.apache.spark.sql.functions
import org.apache.spark.sql.functions
scala> import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.clustering.KMeans
scala> import org.apache.spark.ml.evaluation.ClusteringEvaluator
import org.apache.spark.ml.evaluation.ClusteringEvaluator
scala> import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.VectorAssembler
scala> val sqlContext = new SQLContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#f24a84
scala> import sqlContext.implicits
import sqlContext.implicits
scala> import sqlContext
| val schema = StructType(Array(StructField("ID_CALLE",IntegerType,true),StructField("TIPO", IntegerType, true),StructField("CALLE",IntegerType,true),StructField("NUMERO",IntegerType,true), StructField("LONGITUD",DoubleType,true),StructField("LATITUD",DoubleType,true),StructField("TITULO",IntegerType,true)))
<console>:2: error: '.' expected but ';' found.
val schema = StructType(Array(StructField("ID_CALLE",IntegerType,true),StructField("TIPO", IntegerType, true),StructField("CALLE",IntegerType,true),StructField("NUMERO",IntegerType,true), StructField("LONGITUD",DoubleType,true),StructField("LATITUD",DoubleType,true),StructField("TITULO",IntegerType,true)))

There is small typo error in your code. If you see your code carefully you will find below mistake
scala> import sqlContext
| val schema = StructType(Array(StructField("ID_CALLE",IntegerType,true),StructField("TIPO", IntegerType, true),StructField("CALLE",IntegerType,true),StructField("NUMERO",IntegerType,true), StructField("LONGITUD",DoubleType,true),StructField("LATITUD",DoubleType,true),StructField("TITULO",IntegerType,true)))
Everywhere you are typing new line of code only after scala> but in above code you are typing after |
So just type you code like below
scala> import sqlContext._
scala> val schema = StructType(Array(StructField("ID_CALLE",IntegerType,true),StructField("TIPO", IntegerType, true),StructField("CALLE",IntegerType,true),StructField("NUMERO",IntegerType,true), StructField("LONGITUD",DoubleType,true),StructField("LATITUD",DoubleType,true),StructField("TITULO",IntegerType,true)))

Related

Cannot convert an RDD to Dataframe

I've converted a dataframe to an RDD:
val rows: RDD[Row] = df.orderBy($"Date").rdd
And now I'm trying to convert it back:
val df2 = spark.createDataFrame(rows)
But I'm getting an error:
Edit:
rows.toDF()
Also produces an error:
Cannot resolve symbol toDF
Even though I included this line earlier:
import spark.implicits._
Full code:
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.util._
import org.apache.spark.mllib.rdd.RDDFunctions._
import org.apache.spark.rdd._
object Playground {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("Playground")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
val df = spark.read.csv("D:/playground/mre.csv")
df.show()
val rows: RDD[Row] = df.orderBy($"Date").rdd
val df2 = spark.createDataFrame(rows)
rows.toDF()
}
}
Your IDE is right, SparkSession.createDataFrame needs a second parameter: either a bean class or a schema.
This will fix your problem:
val df2 = spark.createDataFrame(rows, df.schema)

Packaging scala class on databricks (error: not found: value dbutils)

Trying to make a package with a class
package x.y.Log
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.{DataFrame}
import org.apache.spark.sql.functions.{lit, explode, collect_list, struct}
import org.apache.spark.sql.types.{StructField, StructType}
import java.util.Calendar
import java.text.SimpleDateFormat
import org.apache.spark.sql.functions._
import spark.implicits._
class Log{
...
}
Everything runs fine on same notebook, but once I try to create package that I could use in other notebooks I get errors:
<notebook>:11: error: not found: object spark
import spark.implicits._
^
<notebook>:21: error: not found: value dbutils
val notebookPath = dbutils.notebook.getContext().notebookPath.get
^
<notebook>:22: error: not found: value dbutils
val userName = dbutils.notebook.getContext.tags("user")
^
<notebook>:23: error: not found: value dbutils
val userId = dbutils.notebook.getContext.tags("userId")
^
<notebook>:41: error: not found: value spark
var rawMeta = spark.read.format("json").option("multiLine", true).load("/FileStore/tables/xxx.json")
^
<notebook>:42: error: value $ is not a member of StringContext
.filter($"Name".isin(readSources))
Anyone knows how to package this class with these libs?
Assuming you are running Spark 2.x, the statement import spark.implicits._ only works when you have SparkSession object in the scope. The object Implicits is defined inside the SparkSession object. This object extends the SQLImplicits from previous verisons of spark Link to SparkSession code on Github. You can check the link to verify
package x.y.Log
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{lit, explode, collect_list, struct}
import org.apache.spark.sql.types.{StructField, StructType}
import java.util.Calendar
import java.text.SimpleDateFormat
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
class Log{
val spark: SparkSession = SparkSession.builder.enableHiveSupport().getOrCreate()
import spark.implicits._
...[rest of the code below]
}

spark streaming not able to use spark sql

I am facing an issue during spark streaming. I am getting empty records after it gets streamed and passes to the "parse" method.
My code:
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Encoders
import org.apache.spark.streaming._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType,
IntegerType}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType,
IntegerType}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import java.util.regex.Pattern
import java.util.regex.Matcher
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql._
val conf = new SparkConf().setAppName("streamHive").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(5))
val sc=ssc.sparkContext
val lines = ssc.textFileStream("file:///home/sadr/testHive")
case class Prices(name: String, age: String,sex: String, location: String)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
def parse (rdd : org.apache.spark.rdd.RDD[String] ) =
{
var l = rdd.map(_.split(","))
val prices = l.map(p => Prices(p(0),p(1),p(2),p(3)))
val pricesDf = sqlContext.createDataFrame(prices)
pricesDf.registerTempTable("prices")
pricesDf.show()
var x = sqlContext.sql("select count(*) from prices")
x.show()}
lines.foreachRDD { rdd => parse(rdd)}
lines.print()
ssc.start()
My input file:
cat test1.csv
Riaz,32,M,uk
tony,23,M,india
manu,33,M,china
imart,34,F,AUS
I am getting this output:
lines.foreachRDD { rdd => parse(rdd)}
lines.print()
ssc.start()
scala> +----+---+---+--------+
|name|age|sex|location|
+----+---+---+--------+
+----+---+---+--------+
I am using Spark version 2.3....I AM GETTING FOLLOWING ERROR AFTER ADDING X.SHOW()
Not sure if you are actually able to read the streams.
textFileStream reads only the new files added to the directory after the program starts and not the existing ones. Was the file already there?
If yes, remove it from the directory, start the program and copy the file again?

value na is not a member of?

hello i just started to learn scala.
and just follow the tutorial in udemy.
i was followed the same code but give me an error.
i have no idea about that error.
and this my code
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.sql.SparkSession
import org.apache.log4j._
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().getOrCreate()
val data = spark.read.option("header","true").
option("inferSchema","true").
option("delimiter","\t").
format("csv").
load("dataset.tsv").
withColumn("subject", split($"subject", " "))
val logRegDataAll = (data.select(data("label")).as("label"),$"subject")
val logRegData = logRegDataAll.na.drop()
and give me error like this
scala> :load LogisticRegression.scala
Loading LogisticRegression.scala...
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.sql.SparkSession
import org.apache.log4j._
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#1efcba00
data: org.apache.spark.sql.DataFrame = [label: string, subject: array<string>]
logRegDataAll: (org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], org.apache.spark.sql.ColumnName) = ([label: string],subject)
<console>:43: error: value na is not a member of (org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], org.apache.spark.sql.ColumnName)
val logRegData = logRegDataAll.na.drop()
^
thanks for helping
You can see clearly
val logRegDataAll = (data.select(data("label")).as("label"),$"subject")
This returns
(org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], org.apache.spark.sql.ColumnName)
So there is an extra parantheses ) data("label")) which should be data.select(data("label").as("label"),$"subject") in actual.

value lookup is not a member of org.apache.spark.rdd.RDD[(String, String)]

I have got a problem when I tired to compile my scala program with SBT.
I have import the class I need .Here is part of my code.
import java.io.File
import java.io.FileWriter
import java.io.PrintWriter
import java.io.IOException
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.rdd.PairRDDFunctions
import scala.util.Random
......
val data=sc.textFile(path)
val kv=data.map{s=>
val a=s.split(",")
(a(0),a(1))
}.cache()
kv.first()
val start=System.currentTimeMillis()
for(tg<-target){
kv.lookup(tg.toString)
}
The error detail is :
value lookup is not a member of org.apache.spark.rdd.RDD[(String, String)]
[error] kv.lookup(tg.toString)
What confused me is I have import import org.apache.spark.rdd.PairRDDFunctions,
but it doesn't work . And when I run this in Spark-shell ,it runs well.
import org.apache.spark.SparkContext._
to have access to the implicits that let you use PairRDDFunctions on a RDD of type (K,V).
There's no need to directly import PairRDDFunctions