spark scala dataframes - create an object with attributes in a json file - scala

I have a json file of the format
{"x1": 2, "y1": 6, "x2":3, "y2":7}
I have a scala class
class test(int:x, int:y)
Using spark I am trying to read this file and create two test objects for each line from the json file. for example
{"x1": 2, "y1": 6, "x2":3, "y2":7} should create
test1 = new test(2,6) and
test2 = new test(3,7)
Then for each line of the json file, i want to call a function that takes two test objects as parameters. Example callFunction(test1,test2)
How do i do this with spark. I see method that will convert rows in json file to list of objects but no way to create multiple objects using attributes in a single row of json file
val conf = new SparkConf()
.setAppName("Example")
.setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val coordinates = sqlContext.read.json("c:/testfile.json")
//NOT SURE HOW TO DO THE FOLLOWING
//test1 = new Test(attr1 of json file, attr2 of json file)
//test2 = new Test(attr3 of json file, attr4 of json file)
//callFunction(test1,test2)
//collect the result of callFunction

Related

How to store values into a dataframe from a List using Scala for handling nested JSON data

I have the below code, where I am pulling data from an API and storing it into a JSON file. Further I will be loading into an oracle table. Also, the data value of the ID column is the column name under VelocityEntries. I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
With the below code I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
val inputStream = scala.io.Source.fromInputStream(connection.getInputStream).mkString
val fileWriter1 = new FileWriter(new File(filename))
fileWriter1.write(inputStream.mkString)
fileWriter1.close()
val json_df = spark.read.option("multiLine", true).json(filename)
val embedded_df = json_df.select(explode(col("sprints")) as "x").select(("x.*"))
val list_df = json_df.select("velocityStatEntries.*").columns.toList
for( i <- list_df)
{
completed_df = json_df.select(s"velocityStatEntries.$i.completed.value")
completed_df.show()
}

How to convert spark response to JSON object

val conf = new SparkConf().setAppName("test").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val input = sqlContext.read.json("input.json")
input.select("email", "first_name").where("email=='donny54#yahoo.com'").show()
I am getting following response
How can I get response as a JSON Object?
You can write it to Json File : https://www.tutorialkart.com/apache-spark/spark-write-dataset-to-json-file-example/
Or if you prefer to show it as a Dataset of Json Strings, use the toJSON function :
input
.select("email", "first_name")
.where("email=='donny54#yahoo.com'")
.toJSON()
.show()

spark-scala: download a list of URLs from a particular column

I have CSV file which contains details of all the candidates who have applied for a particular positions.
Sample Data: (notice that all the resume URL are of different file types-pdf,docx,doc)
Name age Resume_file
A1 20 http://resumeOfcandidateA1.pdf
A2 20 http://resumeOfcandidateA2.docx
I wish to download the contents of resume URL given in 3rd Column into my table.
I tried using “wget” + “pdftotext” command to download the list of resumes but that did not help as for each URL it would create a different file in my cluster (outside the table) and linking it to the rest of the table was not possible due to lack of a unique criteria.
I even tried using scala.io.Source, but this required mentioning the link explicitly each time to download the contents and this too was outside the table.
You can implement Scala function responsible for downloading content of URL. Example library that you can use for this is scalaj (https://github.com/scalaj/scalaj-http).
import scalaj.http._
def downloadURLContent(url: String): Array[Byte] = {
val request = Http(url)
val response = request.asBytes
response.body
}
Then you can use this function with RDD or Dataset to download content for each URL using map transformation:
ds.map(r => downloadURLContent(r.Resume_file))
If you prefer using DataFrame, you just need to create udf based on downloadURLContent function and use withColumn transformation:
val downloadURLContentUDF = udf((url:String) => downloadURLContent(url))
df.withColumn("content", downloadURLContentUDF(df("Resume_file")))
Partial Answer: Downloaded the text file to a particular location with proper extension and after giving the file_name as User_id.
Pending part - extracting text of all the files and then joining this text files with original csv file using User_id as their key.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import sys.process._
import java.net.URL
import java.io.File
object wikipedia{
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("wiki").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val input = sc.textFile("E:/new_data/resume.txt")
def fileDownloader(url: String, filename: String) = {
new URL(url) #> new File(filename) !!
}
input.foreach(x => {
// user_id is first part of the file
// Url is the second part of the file.
if (x.split(",")(1).isDefinedAt(12))
{
//to get the extension of the document
val ex = x.substring(x.lastIndexOf('.'))
// remove spaces from URL and replace with "%20"
// storing the data file aftr giving the filename as user_id to particular location.
fileDownloader(x.split(",")(1).replace(" ", "%20"), "E:/new_data/resume_list/"+x.split(",")(0)+ex)
} } )
}
}

Writing to a file in Apache Spark

I am writing a Scala code that requires me to write to a file in HDFS.
When I use Filewriter.write on local, it works. The same thing does not work on HDFS.
Upon checking, I found that there are the following options to write in Apache Spark-
RDD.saveAsTextFile and DataFrame.write.format.
My question is: what if I just want to write an int or string to a file in Apache Spark?
Follow up:
I need to write to an output file a header, DataFrame contents and then append some string.
Does sc.parallelize(Seq(<String>)) help?
create RDD with your data (int/string) using Seq: see parallelized-collections for details:
sc.parallelize(Seq(5)) //for writing int (5)
sc.parallelize(Seq("Test String")) // for writing string
val conf = new SparkConf().setAppName("Writing Int to File").setMaster("local")
val sc = new SparkContext(conf)
val intRdd= sc.parallelize(Seq(5))
intRdd.saveAsTextFile("out\\int\\test")
val conf = new SparkConf().setAppName("Writing string to File").setMaster("local")
val sc = new SparkContext(conf)
val stringRdd = sc.parallelize(Seq("Test String"))
stringRdd.saveAsTextFile("out\\string\\test")
Follow up Example: (Tested as below)
val conf = new SparkConf().setAppName("Total Countries having Icon").setMaster("local")
val sc = new SparkContext(conf)
val headerRDD= sc.parallelize(Seq("HEADER"))
//Replace BODY part with your DF
val bodyRDD= sc.parallelize(Seq("BODY"))
val footerRDD = sc.parallelize(Seq("FOOTER"))
//combine all rdds to final
val finalRDD = headerRDD ++ bodyRDD ++ footerRDD
//finalRDD.foreach(line => println(line))
//output to one file
finalRDD.coalesce(1, true).saveAsTextFile("test")
output:
HEADER
BODY
FOOTER
more examples here. . .

Scala/Spark save texts in program without save to files

My code will save val: s into result.txt
And then read the file again
I want to know is there a method My code can run directly without save to another file and read it back.
I user val textFile = sc.parallelize(s)
But the next part would have error: value contains is not a member of char
import java.io._
val s = (R.capture("lines"))
resultPath = /home/user
val pw = new PrintWriter(new File(f"$resultPath%s/result.txt"))
pw.write(s)
pw.close
//val textFile = sc.textFile(f"$resultPath%s/result.txt") old method:save into a file and read it back
val textFile = sc.parallelize(s)
val rows = textFile.map { line =>
!(line contains "[, 1]")
val fields = line.split("[^\\d.]+")
((fields(0), fields(1).toDouble))
}
I would have to say the problem you are having is that the variable s is a String data type and you are doing a parallelize on a String instead of a collection. So when you run the map function it is iterating over each character in the String.