I need to fetch the date from a file.
Below is my spark program:
import org.apache.spark.sql.SparkSession
import scala.xml.XML
import java.text.SimpleDateFormat
object Active6Month {
def main(args:Array[String]){
val format = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss.SSS")
val format1 = new SimpleDateFormat("yyyy-MM")
val spark = SparkSession.builder.appName("Active6Months").master("local").getOrCreate()
val data = spark.read.textFile("D:\\BGH\\StackOverFlow\\Posts.xml").rdd
val date = data.filter{line => {
line.toString().trim().startsWith("<row")
}}.filter{line=>{
line.contains("PostTypeId=\"1\"")
}}.map{line=>{
val xml = XML.loadString(line)
var closedDate = format1.format(format.parse(xml.attribute("ClosedDate").toString())).toString()
(closedDate,1)
}}.reduceByKey(_+_)
date.foreach(println)
spark.stop
}
}
And I am getting this error:
java.text.ParseException: Unparseable date: "Some(2014-05-14T14:40:25.950)"
The format of date in file is perfect i.e:
CreationDate="2014-05-13T23:58:30.457"
But in error it shows the String "Some" attached to it.
And my other question is why same working in below code:
val date = data.filter{line => {
line.toString().trim().startsWith("<row")
}}.filter{line=>{
line.contains("PostTypeId=\"1\"")
}}.flatMap{line=>{
val xml = XML.loadString(line)
xml.attribute("ClosedDate")
}}.map{line=>{
(format1.format(format.parse(line.toString())).toString(),1)
}}.reduceByKey(_+_)
My guess is that xml.attribute("ClosedDate").toString() is actually returning a string containing Some attached to it. Have you debugged that to make sure?
Maybe you shouldn't use toString(), but instead, get the attribute value, by using the proper method.
Or you can do it the "ugly" way and include "Some" in the pattern:
val format = new SimpleDateFormat("'Some('yyyy-MM-dd'T'hh:mm:ss.SSS')'")
Your second approach works because (and that's a guess because I don't code in Scala), probably the xml.attribute("ClosedDate") method returns an object, and calling toString() on this object returns the string with "Some" attached to it (why? ask the API authors). But when you use map on this object, it sets the line variable to the correct value (without the "Some" part).
Related
How to print the date of a file in Scala is explained here.
My question is how I can get a variable containing this information which can be returned as a column to a dataframe. None of the conversions I would expect to be allowed, actually are allowed.
My code (using Scala 2.11):
import org.apache.spark.sql.functions._
import java.nio.file.{Files, Paths} // Needed for file time
import java.nio.file.attribute.BasicFileAttributes
import java.util.Date
def GetFileTimeFunc(pathStr: String): String = {
// From: https://stackoverflow.com/questions/47453193/how-to-get-creation-date-of-a-file-using-scala
val FileTime = Files.readAttributes(Paths.get(pathStr), classOf[BasicFileAttributes]).creationTime;
val JavaDate = Date.from(FileTime.toInstant);
return(JavaDate.toString())
}
#transient val GetFileTime = udf(GetFileTimeFunc _)
val filePath = "dbfs:/mnt/myData/" // location of data
val file_df = dbutils.fs.ls(filePath).toDF // Output columns are $"path", $"name", and $"size"
.withColumn("FileTimeCreated", GetFileTime($"path"))
display(file_df)//.select("name", "size"))
Output:
SparkException: Failed to execute user defined function($anonfun$2: (string) => string)
For some reason, Instant is not allowed as a column type, so I cannot use it as a return type.The same for FileTime, JavDate, etc.
I have a dataset of event case class that I would like to save the json string element inside it into a file on s3 with a path like bucketName/service/yyyy/mm/dd/hh/[SomeGuid].gz
So for example, the events case class looks like this:
case class Event(
hourPath: String, // e.g. bucketName/service/yyyy/mm/dd/hh/
json: String // The json line that represents this particular event.
... // Other properties used in earlier transformations.
)
Is there a way to save on the dataset where we write the events that belong to a particular hour into a file on s3?
Calling partitionBy on the DataframeWriter is the closest I can get, but the file path isn't exactly what I want.
You can iterate each item and write it to a file in S3. It's efficient to do it with Spark because it will be executed in parallel.
This code is working for me:
val tempDS = eventsDS.rdd.collect.map(x => saveJSONtoS3(x.hourPath,x.json))
def saveJSONtoS3(path: String, jsonString: String) : Unit = {
val bucketName = path.substring(0,path.indexOf('/'));
val file = path.substring(bucketName.length()+1);
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, file, new ByteArrayInputStream(jsonString.getBytes), meta)
}
You need to import:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.services.s3.model.ObjectMetadata
You need to include aws-java-sdk library.
I'm trying to map string to double from an ArrayBuffer that I parsed through Playframework but I keep getting the following error:
Exception in thread "main" java.lang.NumberFormatException: For input string: ""0.04245800""
I'm not sure why it's doing this and I'm new to Scala coming from Python background.
import org.apache.http.client.methods.HttpGet
import play.api.libs.json._
import org.apache.http.impl.client.DefaultHttpClient
object main extends App {
val url = "https://api.binance.com/api/v1/aggTrades?symbol=ETHBTC"
val httpClient = new DefaultHttpClient()
val httpResponse = httpClient.execute(new HttpGet(url))
val entity = httpResponse.getEntity()
val content = ""
if (entity !=null) {
val inputStream = entity.getContent()
val result = io.Source.fromInputStream(inputStream).getLines.mkString
inputStream.close
println("REST API: " + url)
val json: JsValue = Json.parse(result)
var prices = (json\\"p")
println(prices.map(_.toString()).map(_.toDouble))
}
}
If you know for sure your list contains only strings you can cast them like this, and use the 'original' value to get the Double value from:
println(prices.map(_.as[JsString].value.toDouble))
As JsString is not a String you cannot call toDouble on that.
Just for completeness: If you are not certain your list contains only strings you should add an instanceof check or pattern matching.
I have a json string:
val message = "{\"me\":\"a\",
\"version\":\"1.0\",
\"message_metadata\": \"{
\"event_type\":\"UpdateName\",
\"start_date\":\"1515\"}\"
}"
I want to extract the value of the field event_type from this json string.
I have used below code to extract the value:
val mapper = new ObjectMapper
val root = mapper.readTree(message)
val metadata =root.at("/message_metadata").asText()
val root1 = mapper.readTree(metadata)
val event_type =root1.at("/event_type").asText()
print("eventType:" + event_type.toString) //UpdateName
This works fine and I get the value as UpdateName. But I when I want to get the event type in a single line as below:
val mapper = new ObjectMapper
val root = mapper.readTree(message)
val event_type =root.at("/message_metadata/event_type").asText()
print("eventType:" + event_type.toString) //Empty string
Here event type returns a empty sting. This might be because of the message_metadata has Json object as a string value. Is there a way I can get the value of event_type in a single line?
The problem is that your JSON message contains an object who's message_metadata field itself contains JSON, so it must be decoded separately. I'd suggest that you don't put JSON into JSON but only encode the data structure once.
Example:
val message = "{\"me\":\"a\",
\"version\":\"1.0\",
\"message_metadata\": {
\"event_type\":\"UpdateName\",
\"start_date\":\"1515\"
}
}"
You can parse your JSON using case classes and then get your event_type field from there.
case class Json(me: String, version: String, message_metadata: Message)
case class Message(event_type: String, start_date: String)
object Mapping {
def main(args: Array[String]): Unit = {
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
val objectMapper = new ObjectMapper() with ScalaObjectMapper
objectMapper.registerModule(DefaultScalaModule)
val str = "{\n \"me\": \"a\",\n \"version\": \"1.0\",\n \"message_metadata\": {\n \"event_type\": \"UpdateName\",\n \"start_date\": \"1515\"\n }\n}"
val json = objectMapper.readValue(str, classOf[Json])
//to print event_type
println(json.message_metadata.event_type)
//output: UpdateName
}
}
You can even convert a JSON to Scala Case Class and then get the particular field from the case class.
Please find a working and detailed answer which I have provided using generics here.
I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.