I am writing the following code to load a file into Spark using newAPIHadoopFile API.
val lines = sc.newAPIHadoopFile("new_actress.list",classOf[TextInputFormat],classOf[Text],classOf[Text])
But I am getting the following error:
scala> val lines = sc.newAPIHadoopFile("new_actress.list",classOf[TextInputFormat],classOf[Text],classOf[Text])
<console>:34: error: inferred type arguments [org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.TextInputFormat] do not conform to method newAPIHadoopFile's type parameter bounds [K,V,F <: org.apache.hadoop.mapreduce.InputFormat[K,V]]
val lines = sc.newAPIHadoopFile("new_actress.list",classOf[TextInputFormat],classOf[Text],classOf[Text])
^
<console>:34: error: type mismatch;
found : Class[org.apache.hadoop.mapred.TextInputFormat](classOf[org.apache.hadoop.mapred.TextInputFormat])
required: Class[F]
val lines = sc.newAPIHadoopFile("new_actress.list",classOf[TextInputFormat],classOf[Text],classOf[Text])
^
<console>:34: error: type mismatch;
found : Class[org.apache.hadoop.io.Text](classOf[org.apache.hadoop.io.Text])
required: Class[K]
val lines = sc.newAPIHadoopFile("new_actress.list",classOf[TextInputFormat],classOf[Text],classOf[Text])
^
<console>:34: error: type mismatch;
found : Class[org.apache.hadoop.io.Text](classOf[org.apache.hadoop.io.Text])
required: Class[V]
val lines = sc.newAPIHadoopFile("new_actress.list",classOf[TextInputFormat],classOf[Text],classOf[Text])
^
What am I doing wrong in the code?
TextInputFormat takes <LongWritable,Text>.
Note: be focused on extends part in both **InputFormat
#InterfaceAudience.Public
#InterfaceStability.Stable
public class TextInputFormat
extends FileInputFormat<LongWritable,Text>
that means you can not set both types for FileInputFormat as Text. If you want to use FileInputFormat you need to do something like:
You can try:
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.LongWritable
val lines = sc.newAPIHadoopFile("test.csv", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
but in case you still want to use both types as Text you can use KeyValueTextInputFormat which is defined as:
#InterfaceAudience.Public #InterfaceStability.Stable public class
KeyValueTextInputFormat extends FileInputFormat<Text,Text>
You can try:
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat
import org.apache.hadoop.io.Text
val lines = sc.newAPIHadoopFile("test.csv", classOf[KeyValueTextInputFormat], classOf[Text], classOf[Text])
Related
I set up a SBT console like...
import org.json4s._
import org.json4s.native.JsonMethods._
import org.json4s.JsonDSL._
case class TagOptionOrNull(tag: String, optionUuid: Option[java.util.UUID], uuid: java.util.UUID)
val t1 = new TagOptionOrNull("t1", Some(java.util.UUID.randomUUID), java.util.UUID.randomUUID)
val t2 = new TagOptionOrNull("t2", None, null)
I'm trying to see json4s's behavior around null vs Option[UUID]. But I can't figure out the invocation to get it to make my case class a String of JSON.
scala> implicit val formats = DefaultFormats
formats: org.json4s.DefaultFormats.type = org.json4s.DefaultFormats$#614275d5
scala> compact(render(t1))
<console>:23: error: type mismatch;
found : TagOptionOrNull
required: org.json4s.JValue
(which expands to) org.json4s.JsonAST.JValue
compact(render(t1))
What am I missing?
Serialization.write should be able to serialise case class like so
import org.json4s.native.Serialization.write
implicit val formats = DefaultFormats ++ JavaTypesSerializers.all
println(write(t1))
which should output
{"tag":"t1","optionUuid":"95645021-f60c-4708-8bf3-9d5609559fdb","uuid":"19cc4979-5836-4edf-aedd-dcb3e96f17d6"}
Note to serialise UUID we need JavaTypeSerializers formats from
libraryDependencies += "org.json4s" %% "json4s-ext" % version
I'm trying to create a function to check if a string is a date. However, the following function got the error.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.sql._
import scala.util.{Success, Try}
def validateDate(date: String): Boolean = {
val df = new java.text.SimpleDateFormat("yyyyMMdd")
val test = Try[Date](df.parse(date))
test match {
case Success(_) => true
case _ => false
}
}
Error:
[error] C:\Users\user1\IdeaProjects\sqlServer\src\main\scala\main.scala:14: type mismatch;
[error] found : java.util.Date
[error] required: java.sql.Date
[error] val test = Try[Date](df.parse(date))
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 2 s, completed May 17, 2017 1:19:33 PM
Is there a simpler way to validate if a string is a date without create a function?
The function is used to validate the command line argument.
if (args.length != 2 || validateDate(args(0))) { .... }
Try[Date](df.parse(date)) You are not interested in type here because you ignore it. So simply omit type parameter. Try(df.parse(date)).
Your function could be shorter. Try(df.parse(date)).isSuccess instead pattern matching.
If your environment contains java 8 then use java.time package always.
import scala.util.Try
import java.time.LocalDate
import java.time.format.DateTimeFormatter
// Move creation of formatter out of function to reduce short lived objects allocation.
val df = DateTimeFormatter.ofPattern("yyyy MM dd")
def datebleStr(s: String): Boolean = Try(LocalDate.parse(s,df)).isSuccess
use this: import java.util.Date
I'm starting to train a Multiple Linear Regression algorithm in Flink.
I'm following the awesome official documentation and quickstart. I am using Zeppelin to develop this code.
If I load the data from a CSV file:
//Read the file:
val data = benv.readCsvFile[(Int, Double, Double, Double)]("/.../quake.csv")
val mapped = data.map {x => new org.apache.flink.ml.common.LabeledVector (x._4, org.apache.flink.ml.math.DenseVector(x._1,x._2,x._3)) }
//Data created:
mapped: org.apache.flink.api.scala.DataSet[org.apache.flink.ml.common.LabeledVector] = org.apache.flink.api.scala.DataSet#7cb37ad3
LabeledVector(6.7, DenseVector(33.0, -52.26, 28.3))
LabeledVector(5.8, DenseVector(36.0, 45.53, 150.93))
LabeledVector(5.8, DenseVector(57.0, 41.85, 142.78))
//Predict with the model created:
Predict with the model createdval predictions:DataSet[org.apache.flink.ml.common.LabeledVector] = mlr.predict(mapped)
If I load the data from LIBSVM file:
val testingDS: DataSet[(Vector, Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
But I got this ERROR:
->CSV:
res13: org.apache.flink.api.scala.DataSet[org.apache.flink.ml.common.LabeledVector] = org.apache.flink.api.scala.DataSet#7cb37ad3
<console>:89: error: type mismatch;
found : org.apache.flink.api.scala.DataSet[Any]
required: org.apache.flink.api.scala.DataSet[org.apache.flink.ml.common.LabeledVector]
Note: Any >: org.apache.flink.ml.common.LabeledVector, but class DataSet is invariant in type T.
You may wish to define T as -T instead. (SLS 4.5)
Error occurred in an application involving default arguments.
val predictions:DataSet[org.apache.flink.ml.common.LabeledVector] = mlr.predict(mapped)
->LIBSVM:
<console>:111: error: type Vector takes type parameters
val testingDS: DataSet[(Vector, Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
Ok, so I wrote:
New Code:
val testingDS: DataSet[(Vector[org.apache.flink.ml.math.Vector], Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
New Error:
<console>:111: error: type mismatch;
found : org.apache.flink.ml.math.Vector
required: scala.collection.immutable.Vector[org.apache.flink.ml.math.Vector]
val testingDS: DataSet[(Vector[org.apache.flink.ml.math.Vector], Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
I would really appreciate your help! :)
You should not import and use the Scala Vector class. Flink ML is shipped with its own Vector. This should work:
val testingDS: DataSet[(org.apache.flink.ml.math.Vector, Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
I am trying to construct a JSON object from a list where key is "products" and value is List[Product] where Product is a case class.But I am getting error that says "type mismatch; found : (String, List[com.mycompnay.ws.client.Product]) required: net.liftweb.json.JObject (which expands to) net.liftweb.json.JsonAST.JObject".
What I have done so far is as below:
val resultJson:JObject = "products" -> resultList
println(compact(render(resultJson)))
You're looking for decompose (doc). See this answer.
I tested the following code and it worked fine:
import net.liftweb.json._
import net.liftweb.json.JsonDSL._
import net.liftweb.json.Extraction._
implicit val formats = net.liftweb.json.DefaultFormats
case class Product(foo: String)
val resultList: List[Product] = List(Product("bar"), Product("baz"))
val resultJson: JObject = ("products" -> decompose(resultList))
println(compact(render(resultJson)))
Result:
{"products":[{"foo":"bar"},{"foo":"baz"}]}
EDIT: Answer: It was a JAR file that created a conflict!
The related post is: Must include log4J, but it is causing errors in Apache Spark shell. How to avoid errors?
Doing the following:
val numOfProcessors:Int = 2
val filePath:java.lang.String = "s3n://somefile.csv"
var rdd:org.apache.spark.rdd.RDD[java.lang.String] = sc.textFile(filePath, numOfProcessors)
I get
error: type mismatch;
found : org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
required: org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
var rdd:org.apache.spark.rdd.RDD[java.lang.String] = sc.textFile(filePath, numOfProcessors)
EDIT: Second case
val numOfProcessors = 2
val filePath = "s3n://somefile.csv"
var rdd = sc.textFile(filePath, numOfProcessors) //OK!
def doStuff(rdd: RDD[String]): RDD[String] = {rdd}
doStuff(rdd)
I get:
error: type mismatch;
found : org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
required: org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
doStuff(rdd)
^
No comment...
Any ideas why I get this error ?
The problem was a JAR file that created a conflict.