Cannot resolve reference StructField with such signature - scala

i've copied a working example of and i've changed it a little, but the core is always the same, but i got always this error in the StructField point:
cannot resolve reference StructField with such signature
And also gives me this one, inside the signature:
Type mismatch, expected: Datatype, actual StringType
Here is the part of my code where i got problems:
import org.apache.avro.generic.GenericData.StringType
import org.apache.spark
import org.apache.spark.sql.types.StructField
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.types._
object Test{
def main(args: Array[String]): Unit = {
val file = "/home/ubuntu/spark/MyFile"
val conf = new SparkConf().setAppName("Test")
val sc = new SparkContext(conf)
val read = sc.textFile(file)
val header = read.first().toString
//generate schema from first csv row
val fields = header.split(";").map(fieldName => StructField(fieldName.trim, StringType, true))
val schema = StructType(fields)
}
}
I cannot understand where i'm wrong.
I'm using Spark version 2.0.0
Thanks

It looks like GenericData.StringType is an issue. Use an alias:
import org.apache.avro.generic.GenericData.{StringType => AvroStringType}
or remove this import (you don't use it).

Related

Spark Scala getting null pointer exception

I'm trying to get mass elevation data from tiff image, I have a csv file. csv file contents latitude, longitude and other attributes also. Looping through csv file and getting latitude and longitude and calling elevation method, Code given below. Reference RasterFrames extracting location information problem
package main.scala.sample
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.locationtech.rasterframes.encoders.CatalystSerializer._
import geotrellis.raster._
import geotrellis.vector.Extent
import org.locationtech.jts.geom.Point
import org.apache.spark.sql.functions.col
object SparkSQLExample {
def main(args: Array[String]) {
implicit val spark = SparkSession.builder()
.master("local[*]").appName("RasterFrames")
.withKryoSerialization.getOrCreate().withRasterFrames
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val example = "https://raw.githubusercontent.com/locationtech/rasterframes/develop/core/src/test/resources/LC08_B7_Memphis_COG.tiff"
val rf = spark.read.raster.from(example).load()
val rf_value_at_point = udf((extentEnc: Row, tile: Tile, point: Point) => {
val extent = extentEnc.to[Extent]
Raster(tile, extent).getDoubleValueAtPoint(point)
})
val spark_file:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples")
.getOrCreate()
spark_file.sparkContext.setLogLevel("ERROR")
println("spark read csv files from a directory into RDD")
val rddFromFile = spark_file.sparkContext.textFile("point.csv")
println(rddFromFile.getClass)
def customF(str: String): String = {
val lat = str.split('|')(2).toDouble;
val long = str.split('|')(3).toDouble;
val point = st_makePoint(long, lat)
val test = rf.where(st_intersects(rf_geometry(col("proj_raster")), point))
.select(rf_value_at_point(rf_extent(col("proj_raster")), rf_tile(col("proj_raster")), point) as "value")
return test.toString()
}
val rdd2=rddFromFile.map(f=> customF(f))
rdd2.foreach(t=>println(t))
spark.stop()
}
}
when I'm running getting null pointer exception, any help appreciated
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:64)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3416)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1490)
at org.apache.spark.sql.Dataset.where(Dataset.scala:1518)
at main.scala.sample.SparkSQLExample$.main$scala$sample$SparkSQLExample$$customF$1(SparkSQLExample.scala:49)
The function which is being mapped over the RDD (customF) is not null safe. Try calling customF(null) and see what happens. If it throws an exception, then you will have to make sure that rddFromFile doesn't contain any null/missing values.
It is a little hard to tell if that is exactly where issue is. I think the stack trace of the exception is less helpful than usual because the function is being run in a spark tasks on the workers.
If that is the issue, you could rewrite customF to handle the case where str is null or change the parameter type to Option[String] (and tweak the logic accordingly).
By the way, the same thing allies for UDFs. They need to either
Accept Option types as input
Handle the case where each arg is null or
Only be applied to data with no missing values.

how to fix Scala error with "Not found type"

I'm newbie in Scala, just trying to learn it in Spark. Now I'm writing a Scala app to load csv file from hadoop into dataframe, then I want to add a new column in that dataframe. There is a function to populate the content of that new column, for testing the function just uppercase the column from csv file, the csv file only contains one column: emp_id and it's string.. the function is defined in Object TestService. My IDE is Eclipse. Now I have error: not found: type TestService
Very appreciate if anyone can help me.
\\This is the main:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
import com.poc.spark.service.TestService;
object SparkIntTest {
def main(args:Array[String]){
sys.props.+=(("hadoop.home.dir","C:\\OpenSource\\Hadoop"))
val sparkConf = new SparkConf().setMaster("local").setAppName("employee").set("spark.testing.memory", "2147480000")
val sparkContext = new SparkContext(sparkConf)
val spark = SparkSession.builder().appName("employee").getOrCreate()
val df = spark.read.option("header", "true").csv(".\\src\\main\\resources\\employee.csv")
df.show();
println(df.schema);
val df_Applied = df.withColumn("award_rule",runAllRulesUDF(df("emp_id")))
df_Applied.show();
println(df_Applied.schema)
}
def runAllRulesUDF = udf(new TestService().runAllRulesForUDF(_:String))
}
Here is the Object TestService:
package com.poc.spark.service
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
object TestService {
def runAllRulesForUDF(empid: String): String = {
empid.toUpperCase();
}
}
TestService is an object, which means that it is a statically created singleton. So instead of
new TestService()
You can just say
TestService

How to declare an empty dataset in Spark?

I am new in Spark and Spark dataset. I was trying to declare an empty dataset using emptyDataset but it was asking for org.apache.spark.sql.Encoder. The data type I am using for the dataset is an object of case class Tp(s1: String, s2: String, s3: String).
All you need is to import implicit encoders from SparkSession instance before you create empty Dataset: import spark.implicits._
See full example here
EmptyDataFrame
package com.examples.sparksql
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object EmptyDataFrame {
def main(args: Array[String]){
//Create Spark Conf
val sparkConf = new SparkConf().setAppName("Empty-Data-Frame").setMaster("local")
//Create Spark Context - sc
val sc = new SparkContext(sparkConf)
//Create Sql Context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Import Sql Implicit conversions
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
//Create Schema RDD
val schema_string = "name,id,dept"
val schema_rdd = StructType(schema_string.split(",").map(fieldName => StructField(fieldName, StringType, true)) )
//Create Empty DataFrame
val empty_df = sqlContext.createDataFrame(sc.emptyRDD[Row], schema_rdd)
//Some Operations on Empty Data Frame
empty_df.show()
println(empty_df.count())
//You can register a Table on Empty DataFrame, it's empty table though
empty_df.registerTempTable("empty_table")
//let's check it ;)
val res = sqlContext.sql("select * from empty_table")
res.show
}
}
Alternatively you can convert an empty list into a Dataset:
import sparkSession.implicits._
case class Employee(name: String, id: Int)
val ds: Dataset[Employee] = List.empty[Employee].toDS()

Converting error with RDD operation in Scala

I am new to Scala and I ran into the error while doing some practice.
I tried to convert RDD into DataFrame and following is my code.
package com.sclee.examples
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.IntType
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType};
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("examples").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Long)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val df = personRDD.map({
case Row(val1: String, val2: Long) => Person(val1,val2)
}).toDS()
// val ds = personRDD.toDS()
}
}
I followed the instructions in Spark documentation and also referenced some blogs showing me how to convert rdd into dataframe but the I got the error below.
Error:(20, 27) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val df = personRDD.map({
Although I tried to fix the problem by myself but failed. Any help will be appreciated.
The following code works:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Person(name: String, age: Long)
object SparkTest {
def main(args: Array[String]): Unit = {
// use the SparkSession of Spark 2
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
// this your RDD - just a sample how to create an RDD
val personRDD: RDD[Person] = spark.sparkContext.parallelize(Seq(Person("A",10),Person("B",20)))
// the sparksession has a method to convert to an Dataset
val ds = spark.createDataset(personRDD)
println(ds.count())
}
}
I made the following changes:
use SparkSession instead of SparkContext and SqlContext
move Person class out of the App (I'm not sure why I had to do
this)
use createDataset for conversion
However, I guess it's pretty uncommon to do this conversion and you probably want to read your input directly into an Dataset using the read method

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}
In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.