illegal start of simple expression in if-statement - scala

I have some issues with scala.
I got this error when I run scalastyle
illegal start of simple expression
This question is linked with other questions about scalastyle illagal error but I tried all solutions proposed but nothing has fixed the error.
Here is the piece of my code
And the error is located in the if-statement at netexr function.
case class EXT (ev: Seq[String])
val schema = StructType(Seq(
StructField("label", DoubleType)
))
def hasVis(ev: Seq[String]): Boolean = ev.toSet.exists(videoV.contains)
def hasCom(ev: Seq[String]): Boolean = ev.toSet.exists(videoC.contains)
def nexter: EXT => Seq[Row] = (extension: EXT) => Seq(new GenericRowWithSchema(
Array(if (hasVis(extension.ev) && hasCom(extension.ev)) 1.0 else 0.0), schema))
Thank you for your help

Related

Transforming RDD[String] to RDD[myclass]

I am trying to transform RDD[String] to RDD[Picture] but could not do it. If I could manage to convert RDD to RDD[Picture] I would use the def hasValidCountry to check if the values latitude and longitude of the picture meta valid. And after that I am trying to check if user Tags are valid with def hasTags in Picture class. The problem I encounter :
Implicit conversion found: row ⇒ augmentString(row): scala.collection.immutable.StringOps
type mismatch; found : String required: Array[String]
value InterestingPics is not a member of Array[Nothing] possible cause: maybe a semicolon is missing before `value InterestingPics'?
My intention is to choose line which has valid country and tags and transform all the line to new RDD[Picture] class.
ScalaFile1 (I have updated the ScalaFile):
object Part2 {
def main(args: Array[String]): Unit = {
var spark: SparkSession = null
try {
spark = SparkSession.builder().appName("Flickr using dataframes").config("spark.master", "local[*]").getOrCreate()
val originalFlickrMeta: RDD[String] = spark.sparkContext.textFile("flickrSample.txt")
val InterestingPics = originalFlickrMeta.map(row => row.split('\t')).map(field => Picture(field(0).toString())
InterestingPics.collect
InterestingPics.take(5).foreach(println)
This works, as an example:
case class case_for_rdd(c1: Int, c2: String, c3: String)
val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv01-4.txt")
val rdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2)))
rdd.collect
More complicated example with reading into RDD from file with array. Array needs a delimiter.
1,10,100,aa|bb|cc
2,20,200,xxxxxx|yyyyyyyy|z|aaa
Some sample code, but use List as otherwise you get to see array addresses, that's what those strange strings are, courtesy of smarter
people here:
case class case_for_rdd(c1: Int, c2: String, c3: String, a4: List[String])
val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv03.txt")
val myCaseRdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2), (field(3).split("\\|").toList)))
myCaseRdd.collect
My advice is to use a DF and the splitting stuff is then easier. Also, manipulation of the rdd via transformation, then the case class is lost. Array with DF api has no such issue.
I have an solution to my question in accordence with help of #thebluephantom. Thank you very much.
val InterestingPics = originalFlickrMeta.map(line => (new Picture(line.split("\t")))).filter(f => f.c != null && f.userTags.length > 0)
InterestingPics.collect().foreach(println)

spark - method Error: The argument types of an anonymous function must be fully known

I know there have been quite a few questions on this, but I've created a simple example that I thought should work,but still does not and I'm not sure I understand why
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStream")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val kafkaDStream = KafkaUtils.createStream(streamingContext,"hubble1:2181","aaa",Map("video"->3))
val wordDStream = kafkaDStream.flatMap(t=>t._2.split(" "))
val mapDStream = wordDStream.map((_,1))
val wordToSumDStream = mapDStream.updateStateByKey{
case (seq,buffer) => {
val sum = buffer.getOrElse(0) + seq.sum
Option(sum)
}
}
wordToSumDStream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Error:(41, 41) missing parameter type for expanded function
The argument types of an anonymous function must be fully known. (SLS 8.5)
Expected type was: (Seq[Int], Option[?]) => Option[?]
val result = flat.updateStateByKey{
Can someone explain why the mapDStream.updateStateByKey method statement does not compile?
Put your logic inside function like below.
def update(seq:Seq[Int],buffer: Option[Int]) = Some(buffer.getOrElse(0) + seq.sum)
val wordToSumDStream = mapDStream.updateStateByKey(update)
Check Example

Saving data to sequence file

I'm trying to do some sort of filtering on Sequence file, and save it back to another sequence file, example:
val subset = ???
val hc = sc.hadoopConfiguration
val serializers = List(
classOf[WritableSerialization].getName,
classOf[ResultSerialization].getName
).mkString(",")
hc.set("io.serializations", serializers)
subset.saveAsNewAPIHadoopFile(
"output/sequence",
classOf[ImmutableBytesWritable],
classOf[Result],
classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
hc
)
After compilation I receive following error:
Class[org.apache.hadoop.mapred.SequenceFileOutputFormat[org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.client.Result]](classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat])
required: Class[_ <: org.apache.hadoop.mapreduce.OutputFormat[_, _]] classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
To my knowledge SequenceFileOuputFormat extends FileOutputFormat which extends OutputFormat, but I am missing something.
Can you please help?
I raised issue with Spark team at https://issues.apache.org/jira/browse/SPARK-25405

Some(null) to Stringtype nullable scala.matcherror

I have an RDD[(Seq[String], Seq[String])] with some null values in the data.
The RDD converted to dataframe looks like this
+----------+----------+
| col1| col2|
+----------+----------+
|[111, aaa]|[xx, null]|
+----------+----------+
Following is the sample code:
val rdd = sc.parallelize(Seq((Seq("111","aaa"),Seq("xx",null))))
val df = rdd.toDF("col1","col2")
val keys = Array("col1","col2")
val values = df.flatMap {
case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
val transposed = values.map(someFunc(keys))
val schema = StructType(keys.map(name => StructField(name, DataTypes.StringType, nullable = true)))
val transposedDf = sc.createDataFrame(transposed, schema)
transposed.show()
It runs fine until the point I create a transposedDF, however as soon as I hit show it throws the following error:
scala.MatchError: null
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:97)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
IF there are no null values in the rdd the code works fine. I do not understand why does it fail when I have any null values, becauase I am specifying the schema of StringType with nullable as true. Am i doing something wrong? I am using spark 1.6.1 and scala 2.10
Pattern match is performed linearly as it appears in the sources, so, this line:
case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
Which doesn't have any restrictions on the values of t1 and t2 never matter match with the null value.
Effectively, put the null check before and it should work.
The issue is that whether you find null or not the first pattern matches. After all, t2: Seq[String] could theoretically be null. While it's true that you can solve this immediately by simply making the null pattern appear first, I feel it is imperative to use the facilities in the Scala language to get rid of null altogether and avoid more bad runtime surprises.
So you could do something like this:
def foo(s: Seq[String]) = if (s.contains(null)) None else Some(s)
//or you could do fancy things with filter/filterNot
df.map {
case (first, second) => (foo(first), foo(second))
}
This will provide you the Some/None tuples you seem to want, but I would see about flattening out those Nones as well.
I think you will need to encode null values to blank or special String before performing assert operations. Also keep in mind that Spark executes lazily. So from the like "val values = df.flatMap" onward everything is executed only when show() is executed.

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.