My JSON file(input.json) looks like below.
{"first_name":"Sabrina","last_name":"Mayert","email":"donny54#yahoo.com"}
{"first_name":"Taryn","last_name":"Dietrich","email":"donny54#yahoo.com"}
My Scala code looks like below. Here I am trying to return first_name and last_name based on email.
val conf = new SparkConf().setAppName("RowCount").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val input = sqlContext.read.json("input.json")
val data = input
.select("first_name", "last_name")
.where("email=='donny54#yahoo.com'")
.toJSON
data.write.json("input2")
sc.stop
complete(data.toString)
data.write.json("input2") creating file looks like below
{"value":"{\"first_name\":\"Sabrina\",\"last_name\":\"Mayert\"}"}
{"value":"{\"first_name\":\"Taryn\",\"last_name\":\"Dietrich\"}"}
complete(data.toString) returning response [value: string]
How can I get response array of JSON object.
[{"first_name":"Sabrina","last_name":"Mayer"},{"first_name":"Taryn","last_name":"Dietrich"}]
Thanks for help in advance.
You are converting to json twice. Do not use the json conversion twice, and you should get your desired output:
val data = input
.select("first_name", "last_name")
.where("email=='donny54#yahoo.com'")
data.write.json("input2")
Output:
{"first_name":"Sabrina","last_name":"Mayert"}
{"first_name":"Taryn","last_name":"Dietrich"}
Does this solve your issue, or do you specifically need to convert it to an array?
Related
I have the below code, where I am pulling data from an API and storing it into a JSON file. Further I will be loading into an oracle table. Also, the data value of the ID column is the column name under VelocityEntries. I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
With the below code I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
val inputStream = scala.io.Source.fromInputStream(connection.getInputStream).mkString
val fileWriter1 = new FileWriter(new File(filename))
fileWriter1.write(inputStream.mkString)
fileWriter1.close()
val json_df = spark.read.option("multiLine", true).json(filename)
val embedded_df = json_df.select(explode(col("sprints")) as "x").select(("x.*"))
val list_df = json_df.select("velocityStatEntries.*").columns.toList
for( i <- list_df)
{
completed_df = json_df.select(s"velocityStatEntries.$i.completed.value")
completed_df.show()
}
val conf = new SparkConf().setAppName("test").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val input = sqlContext.read.json("input.json")
input.select("email", "first_name").where("email=='donny54#yahoo.com'").show()
I am getting following response
How can I get response as a JSON Object?
You can write it to Json File : https://www.tutorialkart.com/apache-spark/spark-write-dataset-to-json-file-example/
Or if you prefer to show it as a Dataset of Json Strings, use the toJSON function :
input
.select("email", "first_name")
.where("email=='donny54#yahoo.com'")
.toJSON()
.show()
Hi I am trying to read tweets from Twitter using Apache Spark Streaming and trying to convert to a DataFrame. I have the approach that I have pasted below. However, I am not beign able to get the correct approach. Some pointers would be welcome.
As you can see converting to DF inside the foreach does not get me a single DF from tweetStream. I probably have the wrong approach as I am new to this. How do I approach this?
val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth).filter(status=>status.getLang=="en")
.map(status=>gson.toJson(status))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
tweetStream.foreachRDD({status=>val DF = status.toDF()})
I have not tried it, but maybe something like this works:
var df_tweets:DataFrame = null
dstream_tweets.foreachRDD {
rrd => if (df_tweets != null) {
df_tweets = df_tweets.unionAll(rdd.toDF) // combine previous dataframe
} else {
df_tweets = rdd.toDF() // create new dataframe
}
}
I am writing a Scala code that requires me to write to a file in HDFS.
When I use Filewriter.write on local, it works. The same thing does not work on HDFS.
Upon checking, I found that there are the following options to write in Apache Spark-
RDD.saveAsTextFile and DataFrame.write.format.
My question is: what if I just want to write an int or string to a file in Apache Spark?
Follow up:
I need to write to an output file a header, DataFrame contents and then append some string.
Does sc.parallelize(Seq(<String>)) help?
create RDD with your data (int/string) using Seq: see parallelized-collections for details:
sc.parallelize(Seq(5)) //for writing int (5)
sc.parallelize(Seq("Test String")) // for writing string
val conf = new SparkConf().setAppName("Writing Int to File").setMaster("local")
val sc = new SparkContext(conf)
val intRdd= sc.parallelize(Seq(5))
intRdd.saveAsTextFile("out\\int\\test")
val conf = new SparkConf().setAppName("Writing string to File").setMaster("local")
val sc = new SparkContext(conf)
val stringRdd = sc.parallelize(Seq("Test String"))
stringRdd.saveAsTextFile("out\\string\\test")
Follow up Example: (Tested as below)
val conf = new SparkConf().setAppName("Total Countries having Icon").setMaster("local")
val sc = new SparkContext(conf)
val headerRDD= sc.parallelize(Seq("HEADER"))
//Replace BODY part with your DF
val bodyRDD= sc.parallelize(Seq("BODY"))
val footerRDD = sc.parallelize(Seq("FOOTER"))
//combine all rdds to final
val finalRDD = headerRDD ++ bodyRDD ++ footerRDD
//finalRDD.foreach(line => println(line))
//output to one file
finalRDD.coalesce(1, true).saveAsTextFile("test")
output:
HEADER
BODY
FOOTER
more examples here. . .
How do you write RDD[Array[Byte]] to a file using Apache Spark and read it back again?
Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes
val rdd: RDD[Array[Byte]] = ???
// To write
rdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray)))
.saveAsSequenceFile("/output/path", codecOpt)
// To read
val rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path")
.map(_._2.copyBytes())
Here is a snippet with all required imports that you can run from spark-shell, as requested by #Choix
import org.apache.hadoop.io.BytesWritable
import org.apache.hadoop.io.NullWritable
val path = "/tmp/path"
val rdd = sc.parallelize(List("foo"))
val bytesRdd = rdd.map{str => (NullWritable.get, new BytesWritable(str.getBytes) ) }
bytesRdd.saveAsSequenceFile(path)
val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes())
val recoveredAsString = recovered.map( new String(_) )
recoveredAsString.collect()
// result is: Array[String] = Array(foo)