Scala troubles getting data in a single Json line - scala

I am using Scala spark streaming , and I need to push my results to a kafka topic. I am using .selectExpr("to_json(struct(Column1,Column2,Column3,Column4))as value") .
And I get the result:
{"Column1":"Value_Column1","Column4":"Value_Column4"}
{"Column1":"Value_Column1","Column2":"Value_Column2"}
{"Column1":"Value_Column1","Column3":"Value_COlumn3"}
How should I change the .selectExpr or what are the steps I need to take to get an output like this :
{"Column1":"Value_Column1","Column2":"Value_Column2","Column3":"Value_Column3","Column4":"Value_Column4"}
Thank you all in advance!

Related

How to remove header by using filter function in spark?

I want to remove header from a file. But, since the file will be split into partitions, I can't just drop the first item. So I was using a filter function to figure it out and here below is the code I am using :
val noHeaderRDD = baseRDD.filter(line=>!line.contains("REPORTDATETIME"));
and the error I am getting says "error not found value line "what could be the issue here with this code?
I don't think anybody answered the obvious, whereby line.contains also possible:
val noHeaderRDD = baseRDD.filter(line => !(line contains("REPORTDATETIME")))
You were nearly there, just a syntax issue, but that is significant of course!
Using textFile as below:
val rdd = sc.textFile(<<path>>)
rdd.filter(x => !x.startsWith(<<"Header Text">>))
Or
In Spark 2.0:
spark.read.option("header","true").csv("filePath")

Flink: How to write DataSet to a variable instead of to a file

I have a flink batch program written in scala using the DataSet API which results in a final dataset I am interested in. I would like to get that dataset as a variable or value (e.g. a list or sequence of String) within my program, without having to write it to any file. Is it possible?
I have seen that flink allows for collection data sinks in order to debug (the only example in their doc is in Java). However, this is only allowed in local execution, and anyway I don't know its equivalent in Scala. What I would like is to write the final resulting dataset after the whole flink parallel execution is done to a program value or variable.
First, try this for the scala version of collection data sink:
import org.apache.flink.api.scala._
import org.apache.flink.api.java.io.LocalCollectionOutputFormat;
.
.
val env = ExecutionEnvironment.getExecutionEnvironment
// Create a DataSet from a list of elements
val words = env.fromElements("w1","w2", "w3")
var outData:java.util.List[String]= new java.util.ArrayList[String]()
words.output(new LocalCollectionOutputFormat(outData))
// execute program
env.execute("Flink Batch Scala")
println(outData)
Second, if your dataset fits in memory of single machine why do you need to use a distributed processing framework? I think you should think more about your use case! and try to use the right transformations on your dataset.
I used flink 1.72 with scala 2.12. And this is a streaming prediction using SVM that i wrapped up in Model class. I think the most correct answer is using collect(). It'll return Seq. i got this answer after searching for hours. i got the idea from Flink Git - Line 95
var temp_jaringan : DataSet[(Vector,Double)] = model.predict_jaringan(value)
temp_jaringan.print()
var temp_produk : DataSet[(Vector,Double)] = model.predict_produk(value)
temp_produk.print()
var result_jaringan : Seq[(Vector,Double)] = temp_jaringan.collect()
var result_produk : Seq[(Vector,Double)] = temp_produk.collect()
if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == 1.0 ){
println("Keduanya")
}else if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == -1.0){
println("Jaringan")
}else if(result_jaringan(0)._2 == -1.0 && result_produk(0)._2 == 1.0){
println("Produk")
}else{
println("Bukan Keduanya")
}
It may vary based on other version. cause after using and searching flink material like a mad dog for weeks even months for my final project as graduation requirement, i know that this flink develepment projects need more documentation and tutorial, especially for beginners like me.
anyway, correct me if im wrong. Thanks!

Using Custom Hadoop input format for processing binary file in Spark

I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.
Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used
package org {
object Driver {
def myFunc(key : IntWritable, content : BytesWritable):Int = {
println(key.get())
println(content.getSize())
return 1
}
def main(args: Array[String]) {
// create a spark context
val conf = new SparkConf().setAppName("Dummy").setMaster("spark://<host>:7077")
val sc = new SparkContext(conf)
println(sc)
val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])
val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)
println("The count is *****************************"+count)
}
}
}
Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.
Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.
A couple of problems at a glance. You define myFunc but call func. Your myFunc has no return type, so you can't call collect(). If your myFunc truly doesn't have a return value, you can do foreach instead of map.
collect() pulls the data in an RDD to the driver to allow you to do stuff with it locally (on the driver).
I have made some progress in this issue. I am now using the below function which does the job
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
However, landed up with another error the details of which i have posted here
Issue in accessing HDFS file inside spark map function
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

Job executed with no data in Spark Streaming

My code:
// messages is JavaPairDStream<K, V>
Fun01(messages)
Fun02(messages)
Fun03(messages)
Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD) .
Fun01, Fun03 both executed as expected, which prove "messages" is not null or empty.
On Spark application UI, I found Fun02's output stage in "Spark stages", which prove "executed".
The first line of Fun02 is a map function, I add log in it. I also add log for every step in Fun02, they all prove "with no data".
Does somebody know possible reasons? Thanks very much.
#maasg Fun02's logic is:
msg_02 = messages.mapToPair(...)
msg_03 = msg_02.reduceByKeyAndWindow(...)
msg_04 = msg_03.mapValues(...)
msg_05 = msg_04.reduceByKeyAndWindow(...)
msg_06 = msg_05.filter(...)
msg_07 = msg_06.filter(...)
msg_07.cache()
msg_07.foreachRDD(...)
I have done test on Spark-1.1 and Spark-1.2, which is supported by my company's Spark cluster.
It seems that this is a bug in Spark-1.1 and Spark-1.2, fixed in Spark-1.3 .
I post my test result here: http://secfree.github.io/blog/2015/05/08/spark-streaming-reducebykeyandwindow-data-lost.html .
When continuously use two reduceByKeyAndWindow, depending of the window, slide value, "data lost" may appear.
I can not find the bug in Spark's issue list, so I can not get the patch.

Apache Spark streaming mapping object and printing attribute

I'm reading from a text file, parsing each line to JSON and am attempting to print one of the attributes:
val msgData = ssc.textFileStream(dataDir)
val msgs = msgData.map(MessageParser.parse)
msgs.foreach(msg => println(msg.my_attribute))
However, I get the following error on compilation:
value my_attribute is not a member of org.apache.spark.rdd.RDD[com.imgzine.analytics.messages.Message]
What am I missing?
Thanks
Spark Streaming discretizes a stream of data by creating micro-batch containers. Those are called 'DStreams' and contain a collection of RDD's.
Translated to your case, you need to operate on the content of the RDD, not the DStream:
msgs.foreach(rdd => rdd.foreach(elem => println(elem.my_attribute))
DStreams offer a help method to print the first elements (10 I think) of each RDD:
dstream.print()
Of course, that will just invoke .toString on the objects contained in the RDD and print the result. Maybe not what you want with my_attribute as stated in the question.