Using Custom Hadoop input format for processing binary file in Spark - scala

I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.
Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used
package org {
object Driver {
def myFunc(key : IntWritable, content : BytesWritable):Int = {
println(key.get())
println(content.getSize())
return 1
}
def main(args: Array[String]) {
// create a spark context
val conf = new SparkConf().setAppName("Dummy").setMaster("spark://<host>:7077")
val sc = new SparkContext(conf)
println(sc)
val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])
val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)
println("The count is *****************************"+count)
}
}
}
Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.
Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.

A couple of problems at a glance. You define myFunc but call func. Your myFunc has no return type, so you can't call collect(). If your myFunc truly doesn't have a return value, you can do foreach instead of map.
collect() pulls the data in an RDD to the driver to allow you to do stuff with it locally (on the driver).

I have made some progress in this issue. I am now using the below function which does the job
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
However, landed up with another error the details of which i have posted here
Issue in accessing HDFS file inside spark map function
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

Related

How to run an if else statement in Scala in Databricks streaming

I am new both to Scala and to Databricks streaming. I am reading streamed events into a dataframe and I want to use an if-else statement to trigger a different notebook based on whether the dataframe is empty or not. The simple code below (and variations of it)
if(finalDF.isEmpty){
print("0")
}
else{
print("1")
}
persistently results in the following error
AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
eventhubs
How can I incorporate writeStream.start() into the above code? Or, how can I evaluate the dataframe content and based on that, take one or another action, given that the dataframe is populated by streaming events into it?
The streaming DF couldn't be empty or not by design - stream is infinite, and if you don't have data there right now, then you can get something new in next second. So your code won't work.
You can use foreachBatch to process "current" snapshot of data with which you can work as with "normal", non-streaming dataframe, but then you may not be able to trigger the notebook from inside it, so code for both conditions should be inside the same function, not in different notebooks.
I tested this code and it works as a way to introduce an if-else and decide actions based on event contents.
df.writeStream.foreachBatch((df: org.apache.spark.sql.DataFrame, batchID: Long) => myfunc(df)).start()
def myfunc(df: org.apache.spark.sql.DataFrame){
val test1= df.filter($"col" === "test1")
val test2= df.filter($"col" === "test2")
if(test1.count()>0){
dbutils.notebook.run("some_notebook", 60)
}
if(test2.count()>0){
dbutils.notebook.run("another_notebook", 60)
}
}

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

Flink: How to write DataSet to a variable instead of to a file

I have a flink batch program written in scala using the DataSet API which results in a final dataset I am interested in. I would like to get that dataset as a variable or value (e.g. a list or sequence of String) within my program, without having to write it to any file. Is it possible?
I have seen that flink allows for collection data sinks in order to debug (the only example in their doc is in Java). However, this is only allowed in local execution, and anyway I don't know its equivalent in Scala. What I would like is to write the final resulting dataset after the whole flink parallel execution is done to a program value or variable.
First, try this for the scala version of collection data sink:
import org.apache.flink.api.scala._
import org.apache.flink.api.java.io.LocalCollectionOutputFormat;
.
.
val env = ExecutionEnvironment.getExecutionEnvironment
// Create a DataSet from a list of elements
val words = env.fromElements("w1","w2", "w3")
var outData:java.util.List[String]= new java.util.ArrayList[String]()
words.output(new LocalCollectionOutputFormat(outData))
// execute program
env.execute("Flink Batch Scala")
println(outData)
Second, if your dataset fits in memory of single machine why do you need to use a distributed processing framework? I think you should think more about your use case! and try to use the right transformations on your dataset.
I used flink 1.72 with scala 2.12. And this is a streaming prediction using SVM that i wrapped up in Model class. I think the most correct answer is using collect(). It'll return Seq. i got this answer after searching for hours. i got the idea from Flink Git - Line 95
var temp_jaringan : DataSet[(Vector,Double)] = model.predict_jaringan(value)
temp_jaringan.print()
var temp_produk : DataSet[(Vector,Double)] = model.predict_produk(value)
temp_produk.print()
var result_jaringan : Seq[(Vector,Double)] = temp_jaringan.collect()
var result_produk : Seq[(Vector,Double)] = temp_produk.collect()
if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == 1.0 ){
println("Keduanya")
}else if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == -1.0){
println("Jaringan")
}else if(result_jaringan(0)._2 == -1.0 && result_produk(0)._2 == 1.0){
println("Produk")
}else{
println("Bukan Keduanya")
}
It may vary based on other version. cause after using and searching flink material like a mad dog for weeks even months for my final project as graduation requirement, i know that this flink develepment projects need more documentation and tutorial, especially for beginners like me.
anyway, correct me if im wrong. Thanks!

Fast file writing in scala?

So I have a scala program that iterates through a graph and writes out data line by line to a text file. It is essentially an edge list file for use with graphx.
The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file. Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?
More info:
I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:
val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")
val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()
Writing files to HDFS is never fast. Your tags seem to suggest that you are already using spark anyway, so you could as well, take advantage of it.
sparkContext
.makeRDD(20, edges.toStream)
.map(e => e.inVertex.id -> e.outVertex.id)
.toDF
.write
.delimiter(" ")
.csv(path)
This splits your input into 20 partitions (you can control that number with the numeric parameter to makeRDD above), and writes them in parallel to 20 different chunks in hdfs, that represent your resulting file.

Apache Spark streaming mapping object and printing attribute

I'm reading from a text file, parsing each line to JSON and am attempting to print one of the attributes:
val msgData = ssc.textFileStream(dataDir)
val msgs = msgData.map(MessageParser.parse)
msgs.foreach(msg => println(msg.my_attribute))
However, I get the following error on compilation:
value my_attribute is not a member of org.apache.spark.rdd.RDD[com.imgzine.analytics.messages.Message]
What am I missing?
Thanks
Spark Streaming discretizes a stream of data by creating micro-batch containers. Those are called 'DStreams' and contain a collection of RDD's.
Translated to your case, you need to operate on the content of the RDD, not the DStream:
msgs.foreach(rdd => rdd.foreach(elem => println(elem.my_attribute))
DStreams offer a help method to print the first elements (10 I think) of each RDD:
dstream.print()
Of course, that will just invoke .toString on the objects contained in the RDD and print the result. Maybe not what you want with my_attribute as stated in the question.