Error when parsing a line from the data into the class. Spark Mllib - scala

I've this code implemented:
scala> import org.apache.spark._
scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD
scala> import org.apache.spark.util.IntParam
import org.apache.spark.util.IntParam
scala> import org.apache.spark.graphx._
import org.apache.spark.graphx._
scala> import org.apache.spark.graphx.util.GraphGenerators
import org.apache.spark.graphx.util.GraphGenerators
scala> case class Transactions(ID:Long,Chain:Int,Dept:Int,Category:Int,Company:Long,Brand:Long,Date:String,ProductSize:Int,ProductMeasure:String,PurchaseQuantity:Int,PurchaseAmount:Double)
defined class Transactions
When I try to run this:
def parseTransactions(str:String): Transactions = {
| val line = str.split(",")
| Transactions(line(0),line(1),line(2),line(3),line(4),line(5),line(6),line(7),line(8),line(9),line(10))
| }
I am obtaining this error: :38: error: type mismatch;
found : String
required: Long
Anyone knows why I'm getting this error? I am doing a social netowork analysis over the schema that I put above.
Many thanks!

You are creating array from "," separated values which returns String array. Cast it to appropriate type before assigning to case class arguments.
val line = str.split(",")
line(0).toLong

Related

Packaging scala class on databricks (error: not found: value dbutils)

Trying to make a package with a class
package x.y.Log
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.{DataFrame}
import org.apache.spark.sql.functions.{lit, explode, collect_list, struct}
import org.apache.spark.sql.types.{StructField, StructType}
import java.util.Calendar
import java.text.SimpleDateFormat
import org.apache.spark.sql.functions._
import spark.implicits._
class Log{
...
}
Everything runs fine on same notebook, but once I try to create package that I could use in other notebooks I get errors:
<notebook>:11: error: not found: object spark
import spark.implicits._
^
<notebook>:21: error: not found: value dbutils
val notebookPath = dbutils.notebook.getContext().notebookPath.get
^
<notebook>:22: error: not found: value dbutils
val userName = dbutils.notebook.getContext.tags("user")
^
<notebook>:23: error: not found: value dbutils
val userId = dbutils.notebook.getContext.tags("userId")
^
<notebook>:41: error: not found: value spark
var rawMeta = spark.read.format("json").option("multiLine", true).load("/FileStore/tables/xxx.json")
^
<notebook>:42: error: value $ is not a member of StringContext
.filter($"Name".isin(readSources))
Anyone knows how to package this class with these libs?
Assuming you are running Spark 2.x, the statement import spark.implicits._ only works when you have SparkSession object in the scope. The object Implicits is defined inside the SparkSession object. This object extends the SQLImplicits from previous verisons of spark Link to SparkSession code on Github. You can check the link to verify
package x.y.Log
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{lit, explode, collect_list, struct}
import org.apache.spark.sql.types.{StructField, StructType}
import java.util.Calendar
import java.text.SimpleDateFormat
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
class Log{
val spark: SparkSession = SparkSession.builder.enableHiveSupport().getOrCreate()
import spark.implicits._
...[rest of the code below]
}

Issue with headers on scala from a csv file

I'm trying to upload a csv file using Scala and Apache Spark but, once I specify the schema with a Spark Structype I have this issue trying to indicate the headers of the csv file-
scala> import org.apache.spark
import org.apache.spark
scala> import org.apache.spark.sql
import org.apache.spark.sql
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
scala> import org.apache.spark.sql.types
import org.apache.spark.sql.types
scala> import org.apache.spark.sql.functions
import org.apache.spark.sql.functions
scala> import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.clustering.KMeans
scala> import org.apache.spark.ml.evaluation.ClusteringEvaluator
import org.apache.spark.ml.evaluation.ClusteringEvaluator
scala> import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.VectorAssembler
scala> val sqlContext = new SQLContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#f24a84
scala> import sqlContext.implicits
import sqlContext.implicits
scala> import sqlContext
| val schema = StructType(Array(StructField("ID_CALLE",IntegerType,true),StructField("TIPO", IntegerType, true),StructField("CALLE",IntegerType,true),StructField("NUMERO",IntegerType,true), StructField("LONGITUD",DoubleType,true),StructField("LATITUD",DoubleType,true),StructField("TITULO",IntegerType,true)))
<console>:2: error: '.' expected but ';' found.
val schema = StructType(Array(StructField("ID_CALLE",IntegerType,true),StructField("TIPO", IntegerType, true),StructField("CALLE",IntegerType,true),StructField("NUMERO",IntegerType,true), StructField("LONGITUD",DoubleType,true),StructField("LATITUD",DoubleType,true),StructField("TITULO",IntegerType,true)))
There is small typo error in your code. If you see your code carefully you will find below mistake
scala> import sqlContext
| val schema = StructType(Array(StructField("ID_CALLE",IntegerType,true),StructField("TIPO", IntegerType, true),StructField("CALLE",IntegerType,true),StructField("NUMERO",IntegerType,true), StructField("LONGITUD",DoubleType,true),StructField("LATITUD",DoubleType,true),StructField("TITULO",IntegerType,true)))
Everywhere you are typing new line of code only after scala> but in above code you are typing after |
So just type you code like below
scala> import sqlContext._
scala> val schema = StructType(Array(StructField("ID_CALLE",IntegerType,true),StructField("TIPO", IntegerType, true),StructField("CALLE",IntegerType,true),StructField("NUMERO",IntegerType,true), StructField("LONGITUD",DoubleType,true),StructField("LATITUD",DoubleType,true),StructField("TITULO",IntegerType,true)))

value na is not a member of?

hello i just started to learn scala.
and just follow the tutorial in udemy.
i was followed the same code but give me an error.
i have no idea about that error.
and this my code
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.sql.SparkSession
import org.apache.log4j._
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().getOrCreate()
val data = spark.read.option("header","true").
option("inferSchema","true").
option("delimiter","\t").
format("csv").
load("dataset.tsv").
withColumn("subject", split($"subject", " "))
val logRegDataAll = (data.select(data("label")).as("label"),$"subject")
val logRegData = logRegDataAll.na.drop()
and give me error like this
scala> :load LogisticRegression.scala
Loading LogisticRegression.scala...
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.sql.SparkSession
import org.apache.log4j._
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#1efcba00
data: org.apache.spark.sql.DataFrame = [label: string, subject: array<string>]
logRegDataAll: (org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], org.apache.spark.sql.ColumnName) = ([label: string],subject)
<console>:43: error: value na is not a member of (org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], org.apache.spark.sql.ColumnName)
val logRegData = logRegDataAll.na.drop()
^
thanks for helping
You can see clearly
val logRegDataAll = (data.select(data("label")).as("label"),$"subject")
This returns
(org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], org.apache.spark.sql.ColumnName)
So there is an extra parantheses ) data("label")) which should be data.select(data("label").as("label"),$"subject") in actual.

How to do grouped full imports in Scala?

So these are all valid way to import in Scala.
scala> import scala.util.matching.Regex
import scala.util.matching.Regex
scala> import scala.util.matching._
import scala.util.matching._
scala> import scala.util.matching.{Regex, UnanchoredRegex}
import scala.util.matching.{Regex, UnanchoredRegex}
But how to do a valid grouped full import?
scala> import scala.util.{control._, matching._}
<console>:1: error: '}' expected but '.' found.
import scala.util.{control._, matching._}
^
You can't use import sub-expression as an import selector. According to specification on Import Clauses
The most general form of an import expression is a list of import selectors
{ x1 => y1,…,xn => yn, _ }
Regarding your question, the closest one-liner is:
scala> import scala.util._, control._, matching._
import scala.util._
import control._
import matching._

value lookup is not a member of org.apache.spark.rdd.RDD[(String, String)]

I have got a problem when I tired to compile my scala program with SBT.
I have import the class I need .Here is part of my code.
import java.io.File
import java.io.FileWriter
import java.io.PrintWriter
import java.io.IOException
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.rdd.PairRDDFunctions
import scala.util.Random
......
val data=sc.textFile(path)
val kv=data.map{s=>
val a=s.split(",")
(a(0),a(1))
}.cache()
kv.first()
val start=System.currentTimeMillis()
for(tg<-target){
kv.lookup(tg.toString)
}
The error detail is :
value lookup is not a member of org.apache.spark.rdd.RDD[(String, String)]
[error] kv.lookup(tg.toString)
What confused me is I have import import org.apache.spark.rdd.PairRDDFunctions,
but it doesn't work . And when I run this in Spark-shell ,it runs well.
import org.apache.spark.SparkContext._
to have access to the implicits that let you use PairRDDFunctions on a RDD of type (K,V).
There's no need to directly import PairRDDFunctions