How to get the file name from DStream of Spark StreamingContext? - scala

Event after lots of try and googling, could not get the fileName, if I am use the streaming context. I can use the wholeTextFiles of SparkContext but, then I have to re-implement the streaming context's functionality.
Note: FileName (error events as json file) is the input to the system, so retaining the name in the output is extremely important so that any event can be traced during audit.
Note: FileName is of the format below. SerialNumber part can be extracted from the event json, but time is stored as milliseconds and difficult to get in below format in a reliable way and no way to find the counter.
...
Each file contains just one line as a complex json string. Using the streaming context I am able to create a RDD[String], where each string is a json string from a single file. Can any one have any solution/workaround for associating the strings with the respective file name.
val sc = new SparkContext("local[*]", "test")
val ssc = new StreamingContext(sc, Seconds(4))
val dStream = ssc.textFileStream(pathOfDirToStream)
dStream.foreachRDD { eventsRdd => /* How to get the file name */ }

You could do this using fileStream and creating your own FileInputFormat, similar to TextInputFormat which uses the InputSplit to provide the filename as a Key. Then you can use fileStream to get a DStream with filename and line.

Hi to get file names from DStream I have created a java function which fetch file path using java spark api and than in spark-streaming(which is written in scala) i have called that function.
Here is a java Code sample:
import java.io.Serializable;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.rdd.NewHadoopPartition;
import org.apache.spark.rdd.UnionPartition;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.Partition;
public class GetFileNameFromStream implements Serializable{
public String getFileName(Partition partition)
{
UnionPartition upp = (UnionPartition)partition;
NewHadoopPartition npp = (NewHadoopPartition) upp.parentPartition();
String filePath=npp.serializableHadoopSplit().value().toString();
return filePath;
}
}
In spark streaming, i have called above java function
Here is a code sample
val obj =new GetFileNameFromStream
dstream.transform(rdd=>{
val lenPartition = rdd.partitions.length
val listPartitions = rdd.partitions
for(part <-listPartitions){
var filePath=obj.getFileName(part)
})

Related

Load XML file from HDFS in Scala

I want to load a XML file from HDFS using XML Scala API. I am trying as follows but its not recognizing the path. Could anyone let me know how we can load file from HDFS by using Scala?
import scala.xml.{NodeSeq, XML}
val xml_load = XML.loadFile("hdfs:////user/np.user/raw/xmlfile.xml")
I assume you're using Scala 2.12.x; I also assume those four slashes in hdfs:////user... are typo.
You're using method XML.loadFile(name: String); it internally uses FileInputStream. It's not possible to open an HDFS file with a plain FileInputStream. You need an input stream which supports HDFS. You can find it in org.apache.hadoop:hadoop-hdfs library.
The code then looks like this:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
// configure properly so the code knows which Hadoop cluster to connect to
// https://hadoop.apache.org/docs/r3.2.0/api/org/apache/hadoop/conf/Configuration.html
val conf = new Configuration()
// obtain input stream instance
val hdfsPath: Path = new Path("hdfs://user/np.user/raw/xmlfile.xml")
val fs: FileSystem = hdfsPath.getFileSystem(conf)
val inputStream: FSDataInputStream = fs.open(hdfsPath)
// load XML
try {
val xml_load = XML.load(inputStream)
} finally {
// close resources; of course, this will silently swallow any exception in close() methods
inputStream.close()
fs.close()
}

Saving RDD as textfile gives FileAlreadyExists Exception. How to create new file every time program loads and delete old one using FileUtils

Code:
val badData:RDD[ListBuffer[String]] = rdd.filter(line => line(1).equals("XX") || line(5).equals("XX"))
badData.coalesce(1).saveAsTextFile(propForFile.getString("badDataFilePath"))
First time program runs fine. On running again it throws exception for file AlreadyExists.
I want to resolve this using FileUtils java functionalities and save rdd as a text file.
Before you write the file to a specified path, delete the already existing path.
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.delete(new Path(bad/data/file/path), true)
Then perform your usual write process. Hope this should resolve the problem.
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val fs = spark.SparkContext.hadoopCofigurations
if (fs.exists(new Path(path/to/the/files)))
fs.delete(new Path(path/to/the/files), true)
Pass the file name as String to the method, if directory or files present it will delete. Use this piece of code before writing it to the output path.
Why not use DataFrames? Get the RDD[ListBuffer[String] into an RDD[Row] - something like -
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
val badData:RDD[ListBuffer[String]] = rdd.map(line =>
Row(line(0), line(1)... line(n))
.filter(row => filter stuff)
badData.toDF().write.mode(SaveMode.Overwrite)

spark-scala: download a list of URLs from a particular column

I have CSV file which contains details of all the candidates who have applied for a particular positions.
Sample Data: (notice that all the resume URL are of different file types-pdf,docx,doc)
Name age Resume_file
A1 20 http://resumeOfcandidateA1.pdf
A2 20 http://resumeOfcandidateA2.docx
I wish to download the contents of resume URL given in 3rd Column into my table.
I tried using “wget” + “pdftotext” command to download the list of resumes but that did not help as for each URL it would create a different file in my cluster (outside the table) and linking it to the rest of the table was not possible due to lack of a unique criteria.
I even tried using scala.io.Source, but this required mentioning the link explicitly each time to download the contents and this too was outside the table.
You can implement Scala function responsible for downloading content of URL. Example library that you can use for this is scalaj (https://github.com/scalaj/scalaj-http).
import scalaj.http._
def downloadURLContent(url: String): Array[Byte] = {
val request = Http(url)
val response = request.asBytes
response.body
}
Then you can use this function with RDD or Dataset to download content for each URL using map transformation:
ds.map(r => downloadURLContent(r.Resume_file))
If you prefer using DataFrame, you just need to create udf based on downloadURLContent function and use withColumn transformation:
val downloadURLContentUDF = udf((url:String) => downloadURLContent(url))
df.withColumn("content", downloadURLContentUDF(df("Resume_file")))
Partial Answer: Downloaded the text file to a particular location with proper extension and after giving the file_name as User_id.
Pending part - extracting text of all the files and then joining this text files with original csv file using User_id as their key.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import sys.process._
import java.net.URL
import java.io.File
object wikipedia{
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("wiki").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val input = sc.textFile("E:/new_data/resume.txt")
def fileDownloader(url: String, filename: String) = {
new URL(url) #> new File(filename) !!
}
input.foreach(x => {
// user_id is first part of the file
// Url is the second part of the file.
if (x.split(",")(1).isDefinedAt(12))
{
//to get the extension of the document
val ex = x.substring(x.lastIndexOf('.'))
// remove spaces from URL and replace with "%20"
// storing the data file aftr giving the filename as user_id to particular location.
fileDownloader(x.split(",")(1).replace(" ", "%20"), "E:/new_data/resume_list/"+x.split(",")(0)+ex)
} } )
}
}

Read specific column from Parquet without using Spark

I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code:
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.avro.generic.GenericRecord
import org.apache.parquet.hadoop.ParquetReader
import org.apache.parquet.avro.AvroParquetReader
object parquetToJson{
def main (args : Array[String]):Unit= {
//case class Customer(key: Int, name: String, sellAmount: Double, profit: Double, state:String)
val parquetFilePath = new Path("data/parquet/Customer/")
val reader = AvroParquetReader.builder[GenericRecord](parquetFilePath).build()//.asInstanceOf[ParquetReader[GenericRecord]]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
val list = iter.toList
list.foreach(record => println(record))
}
}
The commented out case class represents the schema of my file and write now the above code reads all the columns from the file. I want to read specific columns.
If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).
In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.

Will standalone scala program takes advantage of distributed/parallel processing? or does spark Scala require separate code?

First of all sorry for asking the basic doubt over here, but still explanation for the below will be appreciable..
i am very new to scala and spark, so my doubt is if i write a standalone scala program, and execute it on spark(1 master 3 worker), will the scala program takes advantage of disturbed/parallel processing, or should i need to write a separate program to get an advantage of distributed processing??
For example, we have a scala code that process a particular formatted file to comma separated file, it takes a directory as input and parses all file and write an output to single file(each file will be usually 100-200MB). So here is the code.
import scala.io.Source
import java.io.File
import java.io.PrintWriter
import scala.collection.mutable.ListBuffer
import java.util.Calendar
//import scala.io.Source
//import org.apache.spark.SparkContext
//import org.apache.spark.SparkContext._
//import org.apache.spark.SparkConf
object Parser {
def main(args:Array[String]) {
//val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
//val sc = new SparkContext(conf)
var inp = new File(args(0))
var ext: String = ""
if(args.length == 1)
{ ext = "log" } else { ext = args(1) }
var files: List[String] = List("")
if (inp.exists && inp.isDirectory) {
files = getListOfFiles(inp,ext)
}
else if(inp.exists ) {
files = List(inp.toString)
}
else
{
println("Enter the correct Directory/File name");
System.exit(0);
}
if(files.length <=0 )
{
println(s"No file found with extention '.$ext'")
}
else{
var out_file_name = "output_"+Calendar.getInstance().getTime.toString.replace(" ","-").replace(":","-")+".log"
var data = getHeader(files(0))
var writer=new PrintWriter(new File(out_file_name))
var record_count = 0
//var allrecords = data.mkString(",")+("\n")
//writer.write(allrecords)
for(eachFile <- files)
{
record_count += parseFile(writer,data,eachFile)
}
writer.close()
println(record_count +s" processed into $out_file_name")
}
//all func are defined here.
}
Files from the specific dir are read using scala.io
Source.fromFile(file).getLines
So my doubt is will the above code(standalone prg) can be executed on distributed spark system? will i get an advantage of parallel processing??
ok, how about using sc to read file, will it then uses distributed processing
val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
val sc = new SparkContext(conf)
...
...
for(eachFile <- files)
{
record_count += parseFile(sc,writer,data,eachFile)
}
------------------------------------
def parseFile(......)
sc.textFile(file).getLines
So if i edit the top code to make use of sc then will it process on distributes spark system.
No it won't. To make use of distributed computing using Spark, you need to use SparkContext.
If you run the application you have provided using spark-submit you will not be using the Spark cluster at all. You have to rewrite it to use the SparkContext. Please read through the Spark Programming Guide.
It is extremely helpful to watch some introductory videos on Youtube for getting to know how Apache Spark works in general.
For example, these:
https://www.youtube.com/watch?v=7k4yDKBYOcw
https://www.youtube.com/watch?v=rvDpBTV89AM&list=PLF6snu5Jy-v-WRAcCfWNHks7lcNO-zrTI&index=4
Is is very important to understand it for using Spark.
"advantage of distributed processing"
Using Spark can give you advantages of distributing processing on multiple server cluster. So if you are going to move your application later to the cluster, it makes sense to develop application using Spark model and corresponding API.
Well, you can run Spark application locally on your local machine but in this case you won't get all the advantages the Spark can provide.
Anyway, as it is said before, Spark is a special framework with its own libraries for developtment. So you have to rewrite your application using Spark context and Spark API, i.e. special objects like RDDs or Dataframes and corresponding methods.