Read Json files over List of HTTPs location using Spark - scala

I'm trying to read JSON files from HTTP using spark. Since it's not an HDFS or any place where Spark can read data easily and convert it to a data frame. The URL(S) is HTTPS and needs a token and a bunch of headers to successfully retrieve the response. Is there a way to achieve this task? The response is like this which can be easily converted to a row in the data frame.
{
"code": "403010",
"message": "message 1"
}
{
"code": "403010",
"message": "message 1"
}
{
"code": "403010",
"message": "message 1"
}
Now the response is weird because there are multiple JSON heads but it is the actual response from the API.

answer is provided to this url.(for others) Get Results From URL using Scala Spark
import org.apache.spark.sql.{DataFrame, SQLContext, SparkSession}
def GetUrlContentJson(url: String): DataFrame ={
val result = scala.io.Source.fromURL(url).mkString
//only one line inputs are accepted. (I tested it with a complex Json and it worked)
val jsonResponseOneLine = result.toString().stripLineEnd
//You need an RDD to read it with spark.read.json! This took me some time. However it seems obvious now
val jsonRdd = spark.sparkContext.parallelize(jsonResponseOneLine :: Nil)
val jsonDf = spark.read.json(jsonRdd)
return jsonDf
}
val response = GetUrlContentJson(url)
response.show

Related

Simple Spark Scala Post to External Rest API Example

New to Spark Scala, I just want to read a json file and post the content to an external rest api server. Can anyone provide a simple example? or provide guidelines?
You probably do not want to use Spark for this. Spark is an analytical engine for processing large amounts of data - unless you're reading in massive amounts of json from hdfs, this task is more suitable for scala. You should look up ways to read a json file in scala, and send that content to a server in scala.
Here are some great places to get started:
Scala Read JSON file
https://alvinalexander.com/scala/how-to-send-json-post-data-to-restful-url-in-scala
The following is from the above URL:
import java.io._
import org.apache.commons._
import org.apache.http._
import org.apache.http.client._
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.DefaultHttpClient
import java.util.ArrayList
import org.apache.http.message.BasicNameValuePair
import org.apache.http.client.entity.UrlEncodedFormEntity
import com.google.gson.Gson
case class Person(firstName: String, lastName: String, age: Int)
object HttpJsonPostTest extends App {
// create our object as a json string
val spock = new Person("Leonard", "Nimoy", 82)
val spockAsJson = new Gson().toJson(spock)
// add name value pairs to a post object
val post = new HttpPost("http://localhost:8080/posttest")
val nameValuePairs = new ArrayList[NameValuePair]()
nameValuePairs.add(new BasicNameValuePair("JSON", spockAsJson))
post.setEntity(new UrlEncodedFormEntity(nameValuePairs))
// send the post request
val client = new DefaultHttpClient
val response = client.execute(post)
println("--- HEADERS ---")
response.getAllHeaders.foreach(arg => println(arg))
}

Parse Json file on S3 using Json Play using Scala

I want to access a json file from S3 using json play fromework
val creds:DefaultAWSCredentialsProviderChain = new DefaultAWSCredentialsProviderChain
val s3Client = new AmazonS3Client(creds)
val uri: AmazonS3URI = new AmazonS3URI(conf_file)
val s3Object: S3Object = s3Client.getObject(uri.getBucket, uri.getKey)
val json = Json.parse(s3Object.getObjectContent)
val mylist = (json \ "mydata").get.as[List[JsValue]]
But this line gives an error
val mylist = (json \ "mydata").get.as[List[JsValue]]
as
no such element "mydata"
Can anyone tell how to access a json file and read its contents using json play in scala.
I am able to access same file from local machine, as well as fetch contents of "mydata" from within json
Did you tried to Print first the object and check if it is formatted properly to JSON, because everything seems works fine to me.
val json: JsValue = Json.parse("""{
"mydata": [
{"first": "aa"},
{"second": "bb"},
{"third": "cc"}
]
}""")
Try to implement something like this.
(json \ "mydata").asOpt[Seq[JsValue]].getOrElse(None)

spark-scala: download a list of URLs from a particular column

I have CSV file which contains details of all the candidates who have applied for a particular positions.
Sample Data: (notice that all the resume URL are of different file types-pdf,docx,doc)
Name age Resume_file
A1 20 http://resumeOfcandidateA1.pdf
A2 20 http://resumeOfcandidateA2.docx
I wish to download the contents of resume URL given in 3rd Column into my table.
I tried using “wget” + “pdftotext” command to download the list of resumes but that did not help as for each URL it would create a different file in my cluster (outside the table) and linking it to the rest of the table was not possible due to lack of a unique criteria.
I even tried using scala.io.Source, but this required mentioning the link explicitly each time to download the contents and this too was outside the table.
You can implement Scala function responsible for downloading content of URL. Example library that you can use for this is scalaj (https://github.com/scalaj/scalaj-http).
import scalaj.http._
def downloadURLContent(url: String): Array[Byte] = {
val request = Http(url)
val response = request.asBytes
response.body
}
Then you can use this function with RDD or Dataset to download content for each URL using map transformation:
ds.map(r => downloadURLContent(r.Resume_file))
If you prefer using DataFrame, you just need to create udf based on downloadURLContent function and use withColumn transformation:
val downloadURLContentUDF = udf((url:String) => downloadURLContent(url))
df.withColumn("content", downloadURLContentUDF(df("Resume_file")))
Partial Answer: Downloaded the text file to a particular location with proper extension and after giving the file_name as User_id.
Pending part - extracting text of all the files and then joining this text files with original csv file using User_id as their key.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import sys.process._
import java.net.URL
import java.io.File
object wikipedia{
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("wiki").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val input = sc.textFile("E:/new_data/resume.txt")
def fileDownloader(url: String, filename: String) = {
new URL(url) #> new File(filename) !!
}
input.foreach(x => {
// user_id is first part of the file
// Url is the second part of the file.
if (x.split(",")(1).isDefinedAt(12))
{
//to get the extension of the document
val ex = x.substring(x.lastIndexOf('.'))
// remove spaces from URL and replace with "%20"
// storing the data file aftr giving the filename as user_id to particular location.
fileDownloader(x.split(",")(1).replace(" ", "%20"), "E:/new_data/resume_list/"+x.split(",")(0)+ex)
} } )
}
}

How can I prettyprint a JSON Dataframe in spark with Scala?

I have a dataframe from that I want to write to a json file as valid json:
My current code looks like:
val df: DataFrame = myFun(...)
df.toJSON.saveAsTextFile( "myFile.json" )
The format of the output is:
{}{}{}
How can I get the file contents to organize as valid JSON?:
[{},{},{}]
My workaround using Spray JSON:
def apply(df: DataFrame): Option[String] = {
val collectedData = df.toJSON.coalesce(1).collect().mkString("\n")
val json = "[" + ("}\n".r replaceAllIn (collectedData, "},\n")) + "]"
val pretty = json.parseJson.prettyPrint
Some(s"$pretty\n")
}
ugly and inefficient but does what I want provided the final result isn't big-data huge, in which case I wouldn't want a single proper json file anyway.
I am using this ( Python )
import json
from bson import json_util
from bson.json_util import dumps
with open('myJson.json', 'w') as outfile:
json.dump(myDF, outfile)
I am sure you will find an alternative with Scala.

twitterStream not found

I'm trying to compile my first scala program and I'm using twitterStream to get tweets, here is a snippet of my code:
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
import TutorialHelper._
object Tutorial {
def main(args: Array[String]) {
// Location of the Spark directory
val sparkHome = "/home/shaza90/spark-1.1.0"
// URL of the Spark cluster
val sparkUrl = TutorialHelper.getSparkUrl()
// Location of the required JAR files
val jarFile = "target/scala-2.10/tutorial_2.10-0.1-SNAPSHOT.jar"
// HDFS directory for checkpointing
val checkpointDir = TutorialHelper.getHdfsUrl() + "/checkpoint/"
// Configure Twitter credentials using twitter.txt
TutorialHelper.configureTwitterCredentials()
val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile))
val tweets = ssc.twitterStream()
val statuses = tweets.map(status => status.getText())
statuses.print()
ssc.checkpoint(checkpointDir)
ssc.start()
}
}
When compiling I'm getting this error message:
value twitterStream is not a member of org.apache.spark.streaming.StreamingContext
Do you know if I'm missing any library or dependency?
In this case you want a stream of tweets. We all know that Sparks provides Streams. Now, lets check if Spark itself provides something for interacting with twitter specifically.
Open Spark API-docs -> http://spark.apache.org/docs/1.2.0/api/scala/index.html#package
Now search for twitter and bingo... there is something called TwitterUtils in package org.apache.spark.streaming. Now since it is called TwitterUtils and is in package org.apache.spark.streaming, I think it will provide helpers to create stream from twitter API's.
Now lets click on TwitterUtils and goto -> http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.streaming.dstream.ReceiverInputDStream
And yup... it has a method with following signature
def createStream(
ssc: StreamingContext,
twitterAuth: Option[Authorization],
filters: Seq[String] = Nil,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[Status]
It returns a ReceiverInputDStream[ Status ] where Status is twitter4j.Status.
Parameters are further explained
ssc
StreamingContext object
twitterAuth
Twitter4J authentication, or None to use Twitter4J's default OAuth authorization; this uses the system properties twitter4j.oauth.consumerKey, twitter4j.oauth.consumerSecret, twitter4j.oauth.accessToken and twitter4j.oauth.accessTokenSecret
filters
Set of filter strings to get only those tweets that match them
storageLevel
Storage level to use for storing the received objects
See... API docs are simple. I believe, now you should be a little more motivated to read API docs.
And... This means you need to look a little( at least getting started part ) into twitter4j documentation too.
NOTE :: This answer is specifically written to explain "Why not to shy
away from API docs ?". And was written after careful thoughts. So
please, do not edit unless your edit makes some significant
contribution.