spark-scala: download a list of URLs from a particular column - scala

I have CSV file which contains details of all the candidates who have applied for a particular positions.
Sample Data: (notice that all the resume URL are of different file types-pdf,docx,doc)
Name age Resume_file
A1 20 http://resumeOfcandidateA1.pdf
A2 20 http://resumeOfcandidateA2.docx
I wish to download the contents of resume URL given in 3rd Column into my table.
I tried using “wget” + “pdftotext” command to download the list of resumes but that did not help as for each URL it would create a different file in my cluster (outside the table) and linking it to the rest of the table was not possible due to lack of a unique criteria.
I even tried using scala.io.Source, but this required mentioning the link explicitly each time to download the contents and this too was outside the table.

You can implement Scala function responsible for downloading content of URL. Example library that you can use for this is scalaj (https://github.com/scalaj/scalaj-http).
import scalaj.http._
def downloadURLContent(url: String): Array[Byte] = {
val request = Http(url)
val response = request.asBytes
response.body
}
Then you can use this function with RDD or Dataset to download content for each URL using map transformation:
ds.map(r => downloadURLContent(r.Resume_file))
If you prefer using DataFrame, you just need to create udf based on downloadURLContent function and use withColumn transformation:
val downloadURLContentUDF = udf((url:String) => downloadURLContent(url))
df.withColumn("content", downloadURLContentUDF(df("Resume_file")))

Partial Answer: Downloaded the text file to a particular location with proper extension and after giving the file_name as User_id.
Pending part - extracting text of all the files and then joining this text files with original csv file using User_id as their key.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import sys.process._
import java.net.URL
import java.io.File
object wikipedia{
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("wiki").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val input = sc.textFile("E:/new_data/resume.txt")
def fileDownloader(url: String, filename: String) = {
new URL(url) #> new File(filename) !!
}
input.foreach(x => {
// user_id is first part of the file
// Url is the second part of the file.
if (x.split(",")(1).isDefinedAt(12))
{
//to get the extension of the document
val ex = x.substring(x.lastIndexOf('.'))
// remove spaces from URL and replace with "%20"
// storing the data file aftr giving the filename as user_id to particular location.
fileDownloader(x.split(",")(1).replace(" ", "%20"), "E:/new_data/resume_list/"+x.split(",")(0)+ex)
} } )
}
}

Related

Error: Could not find or load main class with Spark in Eclipse

Facing issue while running a spark application from Eclipse (scala). However I'm able to run Scala from eclipse without any issue; the issue seems to appearing only with spark app.
Error: Could not find or load main class com.sidSparkScala.RatingsCounter*
package com.sundogsoftware.spark
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
/**
* Count up how many of each star rating exists in the MovieLens
* 100K data set.
*/
object RatingsCounter {
/** Our main function where the action happens */
def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Create a SparkContext using every core of the local machine, named RatingsCounter
val sc = new SparkContext("local[*]", "RatingsCounter")
// Load up each line of the ratings data into an RDD
val lines = sc.textFile("../ml-100k/u.data")
// Convert each line to a string, split it out by tabs, and extract the third field.
// (The file format is userID, movieID, rating, timestamp)
val ratings = lines.map(x => x.toString().split("\t")(2))
// Count up how many times each value (rating) occurs
val results = ratings.countByValue()
// Sort the resulting map of (rating, count) tuples
val sortedResults = results.toSeq.sortBy(_._1)
// Print each result on its own line.
sortedResults.foreach(println)
}
}
It seems like the package name you originally created is this com.sidSparkScala.RatingsCounter and the package name mentioned in the script is com.sundogsoftware.spark (at the top). All you have to do is replace the package com.sundogsoftware.spark with com.sidSparkScala.RatingsCounter.
Your class is not compiled properly, otherwise eclipse could find this class. Please check, is there any compilation error.
You must change the package name to yours, not the courses provided.
I guess you import this scala file from external, based the code above, the package is com.sundogsoftware.spark , just change it to yours would be fine.

Read specific column from Parquet without using Spark

I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code:
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.avro.generic.GenericRecord
import org.apache.parquet.hadoop.ParquetReader
import org.apache.parquet.avro.AvroParquetReader
object parquetToJson{
def main (args : Array[String]):Unit= {
//case class Customer(key: Int, name: String, sellAmount: Double, profit: Double, state:String)
val parquetFilePath = new Path("data/parquet/Customer/")
val reader = AvroParquetReader.builder[GenericRecord](parquetFilePath).build()//.asInstanceOf[ParquetReader[GenericRecord]]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
val list = iter.toList
list.foreach(record => println(record))
}
}
The commented out case class represents the schema of my file and write now the above code reads all the columns from the file. I want to read specific columns.
If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).
In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.

Spark: How to write org.apache.spark.rdd.RDD[java.io.ByteArrayOutputStream]

I have an RDD that has the signature
org.apache.spark.rdd.RDD[java.io.ByteArrayOutputStream]
In this RDD, each row has its own partition.
This ByteArrayOutputStream is zip output. I am applying some processing on the data in each partition and i want to export the processed data from each partition as a single zip file. What is the best way to export each Row in the final RDD as one file per row on hdfs?
If you are interested in knowing how I ended up with such an Rdd.
val npyData = transformedTopData.select("tokenIDF", "topLevelId").rdd.repartition(2).mapPartitions(x => {
val vectors = for {
row <- x
} yield {
row.getAs[Vector](0)
}
Seq(ml2npyCSR(vectors.toSeq).zipOut)
}.iterator)
EDIT: Count works perfectly fine
scala> npyData.count()
res9: Long = 2
Spark has very little support for file system operations. You'll need to Hadoop FileSystem API to create individual files
// This method is needed as Hadoop conf object is not serializable
def createFileStream(pathStr:String) = {
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);
val outFileStream = fs.create(new Path(pathStr));
outFileStream
}
// Method writes to individual files.
// Needs a unique id along with object for output file naming
def writeToFile( x:(Char, Long) ) : Unit = {
val (dataStream, id) = x
val output_dir = "/tmp/del_a/"
val outFileStream = createFileStream(output_dir+id)
dataStream.writeTo(outFileStream)
outFileStream.close()
}
// zipWithIndex used for creating unique id for each item in rdd
npyData.zipWithIndex().foreach(writeToFile)
Reference:
Hadoop FileSystem example
ByteArrayOutputStream.writeTo(java.io.OutputStream)
I figured out that I should represent my data as PairRDD and implement a custom FileOutputFormat. I looked in to the implementation of SequenceFileOutputFormat for inspiration and managed to write my own version based on that.
My custom FileOutputFormat is available here

How to get files name with spark sc.textFile?

I am reading a directory of files using the following code:
val data = sc.textFile("/mySource/dir1/*")
now my data rdd contains all rows of all files in the directory (right?)
I want now to add a column to each row with the source files name, how can I do that?
The other options I tried is using wholeTextFile but I keep getting out of memory exceptions.
5 servers 24 cores 24 GB (executor-core 5 executor-memory 5G)
any ideas?
You can use this code. I have tested it with Spark 1.4 and 1.5.
It gets the file name from the inputSplit and adds it to each line using the iterator using the mapPartitionsWithInputSplit of the NewHadoopRDD
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.{NewHadoopRDD}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
val sc = new SparkContext(new SparkConf().setMaster("local"))
val fc = classOf[TextInputFormat]
val kc = classOf[LongWritable]
val vc = classOf[Text]
val path :String = "file:///home/user/test"
val text = sc.newAPIHadoopFile(path, fc ,kc, vc, sc.hadoopConfiguration)
val linesWithFileNames = text.asInstanceOf[NewHadoopRDD[LongWritable, Text]]
.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tup => (file.getPath, tup._2))
}
)
linesWithFileNames.foreach(println)
I think it's pretty late to answer this question but I found an easy way to do what you were looking for:
Step 0: from pyspark.sql import functions as F
Step 1: createDataFrame using the RDD as usual. Let's say df
Step 2: Use input_file_name()
df.withColumn("INPUT_FILE", F.input_file_name())
This will add a column to your DataFrame with source file name.

How to get the file name from DStream of Spark StreamingContext?

Event after lots of try and googling, could not get the fileName, if I am use the streaming context. I can use the wholeTextFiles of SparkContext but, then I have to re-implement the streaming context's functionality.
Note: FileName (error events as json file) is the input to the system, so retaining the name in the output is extremely important so that any event can be traced during audit.
Note: FileName is of the format below. SerialNumber part can be extracted from the event json, but time is stored as milliseconds and difficult to get in below format in a reliable way and no way to find the counter.
...
Each file contains just one line as a complex json string. Using the streaming context I am able to create a RDD[String], where each string is a json string from a single file. Can any one have any solution/workaround for associating the strings with the respective file name.
val sc = new SparkContext("local[*]", "test")
val ssc = new StreamingContext(sc, Seconds(4))
val dStream = ssc.textFileStream(pathOfDirToStream)
dStream.foreachRDD { eventsRdd => /* How to get the file name */ }
You could do this using fileStream and creating your own FileInputFormat, similar to TextInputFormat which uses the InputSplit to provide the filename as a Key. Then you can use fileStream to get a DStream with filename and line.
Hi to get file names from DStream I have created a java function which fetch file path using java spark api and than in spark-streaming(which is written in scala) i have called that function.
Here is a java Code sample:
import java.io.Serializable;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.rdd.NewHadoopPartition;
import org.apache.spark.rdd.UnionPartition;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.Partition;
public class GetFileNameFromStream implements Serializable{
public String getFileName(Partition partition)
{
UnionPartition upp = (UnionPartition)partition;
NewHadoopPartition npp = (NewHadoopPartition) upp.parentPartition();
String filePath=npp.serializableHadoopSplit().value().toString();
return filePath;
}
}
In spark streaming, i have called above java function
Here is a code sample
val obj =new GetFileNameFromStream
dstream.transform(rdd=>{
val lenPartition = rdd.partitions.length
val listPartitions = rdd.partitions
for(part <-listPartitions){
var filePath=obj.getFileName(part)
})