Apache Flink ALS with ids in Long instead of Int - scala

I am trying the code of ALS in Flink version 1.1.3 using:
mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-scala \
-DarchetypeVersion=1.1.3 \
-DgroupId=org.apache.flink.quickstart \
-DartifactId=flink-scala-project \
-Dversion=0.1 \
-Dpackage=org.apache.flink.quickstart \
-DinteractiveMode=false
I am following the example code in: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/als.html and changed the Int for the Long in the Dataset
val env = ExecutionEnvironment.getExecutionEnvironment
val csvInput: DataSet[(Long, Long, Double)] = env.readCsvFile[(Long, Long, Double)]("tmp-contactos.csv")
// Setup the ALS learner
val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(100)
// Set the other parameters via a parameter map
val parameters = ParameterMap()
.add(ALS.Lambda, 0.9)
.add(ALS.Seed, 42L)
// Calculate the factorization
als.fit(csvInput, parameters)
But it throws in runetime:
Exception in thread "main" java.lang.RuntimeException: There is no FitOperation defined for org.apache.flink.ml.recommendation.ALS which trains on a DataSet[(Long, Int, Double)]
at org.apache.flink.ml.pipeline.Estimator$$anon$4.fit(Estimator.scala:85)
at org.apache.flink.ml.pipeline.Estimator$class.fit(Estimator.scala:55)
at org.apache.flink.ml.recommendation.ALS.fit(ALS.scala:122)
at org.apache.flink.quickstart.BatchJob$.main(BatchJob.scala:119)
at org.apache.flink.quickstart.BatchJob.main(BatchJob.scala)
It is posible to use Longs instead of Ints??
I searched and found this for the 0.9 version but nothing for 1.1.13:
https://issues.apache.org/jira/browse/FLINK-2211

So far it is not officially supported but I've created a branch where I've fixed this limitation. You can try out this branch. I'll contribute it to Flink so that it should become part of the master in the next time.

Related

Dataproc job Java API method setLoggingConfig has no effect

I'm using groovy script with dependency com.google.cloud:google-cloud-dataproc:2.3.2 and trying to set logging config using code like this:
import com.google.cloud.dataproc.v1.*
...
final def LOGGING_LEVELS = ['com.example.Myclass': LoggingConfig.Level.DEBUG,'org.apache.spark': LoggingConfig.Level.WARN]
final def args = 'programArg1 programArg2'
def sparkJob = SparkJob
.newBuilder()
.addJarFileUris(jarLocation)
.setMainClass(className)
.addAllArgs(args.split(" ") as Iterable<String>)
.setLoggingConfig(LoggingConfig.newBuilder().putAllDriverLogLevels(LOGGING_LEVELS).build())
.build()
def job = Job.newBuilder().setPlacement(jobPlacement).setSparkJob(sparkJob).build()
...
and it doesn't make any effect.
However, when I submit the job via gcloud utility, it works fine:
gcloud dataproc jobs submit spark \
--driver-log-levels com.example.Myclass=DEBUG,org.apache.spark=WARN \
--project=my-project \
--cluster=my-cluster \
--region=us-central1 \
--class=Myclass \
--jars=gs://mybucket/Myclass-0.1-SNAPSHOT.jar \
-- programArg1 programArg2

Spark 3.2.1 fetch HBase data not working with NewAPIHadoopRDD

Below is the sample code snippet that is used for data fetch from HBase. This worked fine with Spark 3.1.2. However after upgrading to Spark 3.2.1, it is not working i.e. returned RDD doesn't contain any value. Also, it is not throwing any exception.
def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): RDD[(String)] = {{
val scan = new Scan
scan.addFamily("family")
scan.addColumn("family","time")
val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", scan, cachingValue, sparkLoggerParams)
val output: RDD[(String)] = rdd.map { row =>
(Bytes.toString(row._2.getRow))
}
output
}
def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: String, tableName: String,
scan: Scan, cachingValue: Int, sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, Result] = {
scan.setCaching(cachingValue)
val scanString = Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray)
val hbaseContext = new SparkHBaseContext(zkIP, zkPort)
val hbaseConfig = hbaseContext.getConfiguration()
hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName)
hbaseConfig.set(TableInputFormat.SCAN, scanString)
sc.newAPIHadoopRDD(
hbaseConfig,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]
).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]]
}
Also, If we fetch using Scan directly without using NewAPIHadoopRDD, it works.
Software versions:
Spark: 3.2.1 prebuilt with user provided Apache Hadoop
Scala: 2.12.10
HBase: 2.4.9
Hadoop: 2.10.1
I found out the solution to this one.
See this upgrade guide from Spark 3.1.x to Spark 3.2.x:
https://spark.apache.org/docs/latest/core-migration-guide.html
Since Spark 3.2, spark.hadoopRDD.ignoreEmptySplits is set to true by default which means Spark will not create empty partitions for empty input splits. To restore the behavior before Spark 3.2, you can set spark.hadoopRDD.ignoreEmptySplits to false.
It can be set like this on spark-submit:
./spark-submit \
--class org.apache.hadoop.hbase.spark.example.hbasecontext.HBaseDistributedScanExample \
--master spark://localhost:7077 \
--conf "spark.hadoopRDD.ignoreEmptySplits=false" \
--jars ... \
/tmp/hbase-spark-1.0.1-SNAPSHOT.jar YourHBaseTable
Alternatively, you can also set these globally at $SPARK_HOME/conf/spark-defaults.conf to apply for every Spark application.
spark.hadoopRDD.ignoreEmptySplits false

Is it possible to read a modified json file without rebuilding the project in Scala?

After I change the configuration in the app.json file, when the project starts, the new changes are not applied and I have to rebuild the project (build new jar file). Is it possible to make this code read the modified app.json without rebuilding whole project?
Below is a package reading data from a app.json file
import net.liftweb.json._
import scala.io._
case class KafkaConfiguration(bootstrap_servers: String,
topic_extractor: String,
topic_vertica: String,
message_max_bytes: String,
group_id: String
)
case class HbaseConfiguration(table_name: String= "test", batch_size: Int)
case class SparkConfiguration(id: String, frequency: Int)
case class VerticaConfiguration(vertica_delimiter: String, vertica_qv: String)
case class Configuration(kafka: KafkaConfiguration,
spark: SparkConfiguration,
hbase: HbaseConfiguration,
vertica: VerticaConfiguration)
object Configuration{
def getConfiguration(filePath: String = "app.json"): Configuration ={
implicit val formats = DefaultFormats
val json = Source.fromURL(getClass.getClassLoader.getResource(filePath), "utf-8").mkString
val jValue = parse(json)
val kafkaConf = jValue.\("kafka").extract[KafkaConfiguration]
val sparkConfig = jValue.\("spark").extract[SparkConfiguration]
val hbaseConfig = jValue.\("hbase").extract[HbaseConfiguration]
val verticaConfig = jValue.\("vertica").extract[VerticaConfiguration]
val configuration: Configuration = new Configuration(kafkaConf, sparkConfig, hbaseConfig,verticaConfig)
configuration
}
}
Script to run my application:
. /etc/spark2/conf/spark-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_201-amd64
export PATH=$PATH:$HOME/bin
export SPARK_KAFKA_VERSION=0.10
CONFIG_FILE="app.json"
COMMAND="spark2-submit \
--master yarn \
--num-executors 4 \
--driver-memory 1g \
--executor-memory 2g \
--deploy-mode cluster \
--name "app1" \
--class DataPointStreaming ./app1.jar $CONFIG_FILE"
echo $COMMAND exec $COMMAND
Since you're using Spark, taking advantage of HOCON's environment variable substitution or even JVM properties might not be workable, as the JAR will be shipped across a cluster.
It is posssible to replace a file in a JAR (or, more formally, build a new JAR based on a given JAR with a file added). Something along these lines in bash can work:
#!/bin/sh
base=$( basename -- "$1" )
filename="${base%.*}"
config=$( basename -- "$2" )
confignoext="${config%.*}"
tmpdir=$( mktemp -d )
tmpdest="$tmpdir/$base"
confdest="$tmpdir/app.json"
cp $1 $tmpdest
cp $2 $confdest
zip -jur $tmpdest $confdest
mkdir -p build/configured
dest=build/configured/$filename-$confignoext.jar
cp $tmpdest $dest
echo $dest
Then you don't have to recompile everything to build a jar, you just keep a set of json files to patch into the jar for deployment (e.g. app-dev.json, app-prod.json, etc.)

Kafka + Spark ERROR MicroBatchExecution: Query

I'm trying to run the program specified in this IBM Developer code pattern. For now, I am only doing the local deployment https://github.com/IBM/kafka-streaming-click-analysis?cm_sp=Developer-_-determine-trending-topics-with-clickstream-analysis-_-Get-the-Code
Since it's a little old, my versions of Kafka and Scala aren't exactly what the code pattern calls for. The versions I am using are:
Scala: 2.4.6
Kafka 0.10.2.1
At the last step, I get the following error:
ERROR MicroBatchExecution: Query [id = f4dfe12f-1c99-427e-9f75-91a77f6e51a7,
runId = c9744709-2484-4ea1-9bab-28e7d0f6b511] terminated with error
org.apache.spark.sql.catalyst.errors.package$TreeNodeException
Along with the execution tree
The steps I am following are as follows:
1. Start Zookeeper
2. Start Kafka
3. cd kafka_2.10-0.10.2.1
4. tail -200 data/2017_01_en_clickstream.tsv | bin/kafka-console-producer.sh --broker-list ip:port --topic clicks --producer.config=config/producer.properties
I have downloaded the dataset and stored it in a directory called data inside of the kafka_2.10-0.10.2.1 directory
cd $SPARK_DIR
bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6
Since SPARK_DIR wasn't set during the Spark installation, I am navigating the the directory containing spark to run this command
scala> import scala.util.Try
scala> case class Click(prev: String, curr: String, link: String, n: Long)
scala> def parseVal(x: Array[Byte]): Option[Click] = {
val split: Array[String] = new Predef.String(x).split("\\t")
if (split.length == 4) {
Try(Click(split(0), split(1), split(2), split(3).toLong)).toOption
} else
None
}
scala> val records = spark.readStream.format("kafka")
.option("subscribe", "clicks")
.option("failOnDataLoss", "false")
.option("kafka.bootstrap.servers", "localhost:9092").load()
scala>
val messages = records.select("value").as[Array[Byte]]
.flatMap(x => parseVal(x))
.groupBy("curr")
.agg(Map("n" -> "sum"))
.sort($"sum(n)".desc)
val query = messages.writeStream
.outputMode("complete")
.option("truncate", "false")
.format("console")
.start()
The last statement, query=... is giving the error mentioned above. Any help would be greatly appreciated. Thanks in advance!
A required library or dependency for interacting with Apache Kafka is missing, so you may need to install the missing library or update to a compatible version

Executing Spark scala program after compilation

I have compiled Spark scala program on command line. But now I want to execute it. I dont want to use Maven or sbt.
the program .I have used the command to execute the
scala -cp ".:sparkDIrector/jars/*" wordcount
But I am getting this error
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
import org.apache.spark._
import org.apache.spark.SparkConf
/** Create a RDD of lines from a text file, and keep count of
* how often each word appears.
*/
object wordcount1 {
def main(args: Array[String]) {
// Set up a SparkContext named WordCount that runs locally using
// all available cores.
println("before conf")
val conf = new SparkConf().setAppName("WordCount")
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
println("after the textfile")
// Create a RDD of lines of text in our book
val input = sc.textFile("book.txt")
println("after the textfile")
// Use flatMap to convert this into an rdd of each word in each line
val words = input.flatMap(line => line.split(' '))
// Convert these words to lowercase
val lowerCaseWords = words.map(word => word.toLowerCase())
// Count up the occurence of each unique word
println("before text file")
val wordCounts = lowerCaseWords.countByValue()
// Print the first 20 results
val sample = wordCounts.take(20)
for ((word, count) <- sample) {
println(word + " " + count)
}
sc.stop()
}
}
It is showing that the error is at location
val conf = new SparkConf().setAppName("WordCount").
Any help?
Starting from Spark 2.0 the entry point is the SparkSession:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.appName("App Name")
.getOrCreate()
Then you can access the SparkContext and read the file with:
spark.sparkContext().textFile(yourFileOrURL)
Remember to stop your session at the end:
spark.stop()
I suggest you to have a look at these examples: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples
Then, to launch your application, you have to use spark-submit:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
In your case, it will be something like:
./bin/spark-submit \
--class wordcount1 \
--master local \
/path/to/your.jar