Error: Could not find or load main class com.sundogsoftware.spark.RatingsCounte - scala

I dont know what is wrong here. When I run I keep getting "Error: Could not find or load main class com.sundogsoftware.spark.RatingsCounter" in my scala IDE.
this is my scala code
package com.sundogsoftware.spark
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
/** Count up how many of each star rating exists in the MovieLens 100K
data set. */
object RatingsCounter {
/** Our main function where the action happens */
def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Create a SparkContext using every core of the local machine, named RatingsCounter
val sc = new SparkContext("local[*]", "RatingsCounter")
// Load up each line of the ratings data into an RDD
val lines = sc.textFile("../ml-100k/u.data")
// Convert each line to a string, split it out by tabs, and extract the third field.
// (The file format is userID, movieID, rating, timestamp)
val ratings = lines.map(x => x.toString().split("\t")(2))
// Count up how many times each value (rating) occurs
val results = ratings.countByValue()
// Sort the resulting map of (rating, count) tuples
val sortedResults = results.toSeq.sortBy(_._1)
// Print each result on its own line.
sortedResults.foreach(println)
}
}
here is my project structure
Here is my run configuration
Here is my scala compiler option selected.
Trying to debug this for a few hours now, nothing seems to be working.
Any pointers will help.

Check out https://wiki.eclipse.org/Eclipse.ini I had to change the vm arg in my eclipse.ini file, and for my JRE options I selected 'Use default JRE (currently 'Java SE 8 [1.8.0_172]')' when I created the scala project. That fixed this error for me.
I am using OS X, so I had to add
-vm
/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/bin/java
above -vmargs

Related

Flink: RowRowConverter seems to fail for nested DataTypes

I am trying to load a complex JSON file (multiple different data types, nested objects/arrays etc) from my local, read them in as a source using the Table API File System Connector, convert them into DataStream, and then do some action afterwards (not shown here for brevity).
The conversion gives me a DataStream of type DataStream[Row], which I need to convert to DataStream[RowData] (for sink purposes, won't go into details here). Thankfully, there's a RowRowConverter utility that helps to do this mapping. It works when I tried a completely flat JSON, but when I introduced Arrays and Maps within the JSON, it no longer works.
Here is the exception that was thrown - a null pointer exception:
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.allocateWriter(ArrayObjectArrayConverter.java:140)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toBinaryArrayData(ArrayObjectArrayConverter.java:114)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:93)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:40)
at org.apache.flink.table.data.conversion.DataStructureConverter.toInternalOrNull(DataStructureConverter.java:61)
at org.apache.flink.table.data.conversion.RowRowConverter.toInternal(RowRowConverter.java:75)
at flink.ReadJsonNestedData$.$anonfun$main$2(ReadJsonNestedData.scala:48)
Interestingly, when I setup my breakpoints and debugger this is what I discovered: RowRowConverter::toInternal, the first time it was called works, will go all the way down to ArrayObjectArrayConverter::allocateWriter()
However, for some strange reason, RowRowConverter::toInternal runs twice, and if I continue stepping through eventually it will come back here, which is where the null pointer exception happens.
Example of the JSON (simplified with only a single nested for brevity). I placed it in my /src/main/resources folder
{"discount":[670237.997082,634079.372133,303534.821218]}
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.table.api.DataTypes
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment
import org.apache.flink.table.data.conversion.RowRowConverter
import org.apache.flink.table.types.FieldsDataType
import org.apache.flink.table.types.logical.RowType
import scala.collection.JavaConverters._
object ReadJsonNestedData {
def main(args: Array[String]): Unit = {
// setup
val jsonResource = getClass.getResource("/NESTED.json")
val jsonFilePath = jsonResource.getPath
val tableName = "orders"
val readJSONTable =
s"""
| CREATE TABLE $tableName (
| `discount` ARRAY<DECIMAL(12, 6)>
| )WITH (
| 'connector' = 'filesystem',
| 'path' = '$jsonFilePath',
| 'format' = 'json'
|)""".stripMargin
val colFields = Array(
"discount"
)
val defaultDataTypes = Array(
DataTypes.ARRAY(DataTypes.DECIMAL(12, 6))
)
val rowType = RowType.of(defaultDataTypes.map(_.getLogicalType), colFields)
val defaultDataTypesAsList = defaultDataTypes.toList.asJava
val dataType = new FieldsDataType(rowType, defaultDataTypesAsList)
val rowConverter = RowRowConverter.create(dataType)
// Job
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val tableEnv = StreamTableEnvironment.create(env)
tableEnv.executeSql(readJSONTable)
val ordersTable = tableEnv.from(tableName)
val dataStream = tableEnv
.toDataStream(ordersTable)
.map(row => rowConverter.toInternal(row))
dataStream.print()
env.execute()
}
}
I would hence like to know:
Why RowRowConverter is not working and how I can remedy it
Why RowRowConverter::toInternal is running twice for the same Row .. which may be the cause of that NullPointerException
If my method of instantiating and using the RowRowConverter is correct based on my code above.
Thank you!
Environment:
IntelliJ 2021.3.2 (Ultimate)
AdoptOpenJDK 1.8
Scala: 2.12.15
Flink: 1.13.5
Flink Libraries Used (for this example):
flink-table-api-java-bridge
flink-table-planner-blink
flink-clients
flink-json
The first call of RowRowConverter::toInternal is an internal implementation for making a deep copy of the StreamRecord emitted by table source, which is independent from the converter in your map function. The reason of the NPE is that the RowRowConverter in the map function is not initialized by calling RowRowConverter::open. You can use RichMapFunction instead to invoke the RowRowConverter::open in RichMapFunction::open.
Thank you to #renqs for the answer.
Here is the code, if anyone is interested.
class ConvertRowToRowDataMapFunction(fieldsDataType: FieldsDataType)
extends RichMapFunction[Row, RowData] {
private final val rowRowConverter = RowRowConverter.create(fieldsDataType)
override def open(parameters: Configuration): Unit = {
super.open(parameters)
rowRowConverter.open(this.getClass.getClassLoader)
}
override def map(row: Row): RowData =
this.rowRowConverter.toInternal(row)
}
// at main function
// ... continue from previous
val dataStream = tableEnv
.toDataStream(personsTable)
.map(new ConvertRowToRowDataMapFunction(dataType))

Save Dataset elements to files with specified file path

I have a dataset of event case class that I would like to save the json string element inside it into a file on s3 with a path like bucketName/service/yyyy/mm/dd/hh/[SomeGuid].gz
So for example, the events case class looks like this:
case class Event(
hourPath: String, // e.g. bucketName/service/yyyy/mm/dd/hh/
json: String // The json line that represents this particular event.
... // Other properties used in earlier transformations.
)
Is there a way to save on the dataset where we write the events that belong to a particular hour into a file on s3?
Calling partitionBy on the DataframeWriter is the closest I can get, but the file path isn't exactly what I want.
You can iterate each item and write it to a file in S3. It's efficient to do it with Spark because it will be executed in parallel.
This code is working for me:
val tempDS = eventsDS.rdd.collect.map(x => saveJSONtoS3(x.hourPath,x.json))
def saveJSONtoS3(path: String, jsonString: String) : Unit = {
val bucketName = path.substring(0,path.indexOf('/'));
val file = path.substring(bucketName.length()+1);
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, file, new ByteArrayInputStream(jsonString.getBytes), meta)
}
You need to import:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.services.s3.model.ObjectMetadata
You need to include aws-java-sdk library.

Scala - Remove header from Pair RDD

I am new in Scala and want to remove header from data. I have below data
recordid,income
1,50000000
2,50070000
3,50450000
5,50920000
and I am using below code to read
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object PAN {
def main(args: Array[String]) {
case class income(recordid : Int, income : Int)
val sc = new SparkContext(new SparkConf().setAppName("income").setMaster("local[2]"))
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt").map(_.split(","))
val income_recs = income_data.map(r => (r(0).toInt, income(r(0).toInt, r(1).toInt)))
}
}
I want to remove header from pair RDD but not getting how.
Thanks.
===============================Edit=========================================
I was playing with below code
val header = income_data.first()
val a = income_data.filter(row => row != header)
a.foreach { println }
but it return below output
[Ljava.lang.String;#1657737
[Ljava.lang.String;#75c5d3
[Ljava.lang.String;#ed63f
[Ljava.lang.String;#13f04a
[Ljava.lang.String;#1048c5d
You technique to remove the header by filtering it out will work fine. The problem is how you are trying to print the array.
Arrays in Scala do not override toString so when you try to print one it uses the default string representation which is just the name and hashcode and usually not very useful.
If you want to print an array, turn it into a string first using the mkString method on string, or use foreach(println)
a.foreach {array => println(array.mkString("[",", ","]")}
or
a.foreach {array => array.foreach{println}}
Will both print out the elements of your array so you can see what they contain.
Keep in mind that when working with Spark, printing inside transformation and actions only works in local mode. Once you move to the cluster, the work will be done on remote executors so you won't be able to see and console output from them.
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt")
income_data.collect().drop(1)
When you create an RDD it will return RDD[String] , then when you collect() on top of it it will return Array[String], drop(number of elements) is a function on top of Array to remove those many rows from RDD.

How to wait SparkContext to finish all process?

How do I see if SparkContext has contents executing and when everything finish I stop it? Because currently I am waiting 30 seconds before to call SparkContext.stop, otherwise my app throws an error.
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.SparkContext
object RatingsCounter extends App {
// set the log level to print only errors
Logger.getLogger("org").setLevel(Level.ERROR)
// create a SparkContext using every core of the local machine, named RatingsCounter
val sc = new SparkContext("local[*]", "RatingsCounter")
// load up each line of the ratings data into an RDD (Resilient Distributed Dataset)
val lines = sc.textFile("src/main/resource/u.data", 0)
// convert each line to s string, split it out by tabs and extract the third field.
// The file format is userID, movieID, rating, timestamp
val ratings = lines.map(x => x.toString().split("\t")(2))
// count up how many times each value occurs
val results = ratings.countByValue()
// sort the resulting map of (rating, count) tuples
val sortedResults = results.toSeq.sortBy(_._1)
// print each result on its own line.
sortedResults.foreach { case (key, value) => println("movie ID: " + key + " - rating times: " + value) }
Thread.sleep(30000)
sc.stop()
}
Spark applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
And since you are extending App, you are getting an unexpected behavior.
You can read more about it in the official documentation about Self Contained Applications.
App uses DelayedInit and can cause initialization issues. With the main method you know what's going on. Excerpt from reddit.
object HelloWorld extends App {
var a = 1
a + 1
override def main(args: Array[String]) {
println(a) // guess what's the value of a ?
}
}

Reading Basic File Attributes in Scala?

I'm trying to get basic file attributes using Scala, and my reference is this Java question:
Determine file creation date in Java
and this piece of code I'm trying to rewrite in Scala:
static void getAttributes(String pathStr) throws IOException {
Path p = Paths.get(pathStr);
BasicFileAttributes view
= Files.getFileAttributeView(p, BasicFileAttributeView.class)
.readAttributes();
System.out.println(view.creationTime()+" is the same as "+view.lastModifiedTime());
}
The thing I just can't figure out is this line of code..I don't understand how to pass a class in this way using scala... or why Java is insisting upon this in the first place instead of using an actual constructed object as the parameter. Can someone please help me write this line of code to function properly? I must be using the wrong syntax
val attr = Files.readAttributes(f,Class[BasicFileAttributeView])
Try this:
def attrs(pathStr:String) =
Files.getFileAttributeView(
Paths.get(pathStr),
classOf[BasicFileAttributes] //corrected
).readAttributes
Get file creation date in Scala, from Basic Files Attributes:
// option 1,
import java.nio.file.{Files, Paths}
import java.nio.file.attribute.BasicFileAttributes
val pathStr = "/tmp/test.sql"
Files.readAttributes(Paths.get(pathStr), classOf[BasicFileAttributes]).creationTime
res3: java.nio.file.attribute.FileTime = 2018-03-06T00:25:52Z
// option 2,
import java.nio.file.{Files, Paths}
import java.nio.file.attribute.BasicFileAttributeView
val pathStr = "/tmp/test.sql"
{
Files
.getFileAttributeView(Paths.get(pathStr), classOf[BasicFileAttributeView])
.readAttributes.creationTime
}
res20: java.nio.file.attribute.FileTime = 2018-03-07T19:00:19Z