val conf = new SparkConf().setAppName("test").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val input = sqlContext.read.json("input.json")
input.select("email", "first_name").where("email=='donny54#yahoo.com'").show()
I am getting following response
How can I get response as a JSON Object?
You can write it to Json File : https://www.tutorialkart.com/apache-spark/spark-write-dataset-to-json-file-example/
Or if you prefer to show it as a Dataset of Json Strings, use the toJSON function :
input
.select("email", "first_name")
.where("email=='donny54#yahoo.com'")
.toJSON()
.show()
Related
My JSON file(input.json) looks like below.
{"first_name":"Sabrina","last_name":"Mayert","email":"donny54#yahoo.com"}
{"first_name":"Taryn","last_name":"Dietrich","email":"donny54#yahoo.com"}
My Scala code looks like below. Here I am trying to return first_name and last_name based on email.
val conf = new SparkConf().setAppName("RowCount").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val input = sqlContext.read.json("input.json")
val data = input
.select("first_name", "last_name")
.where("email=='donny54#yahoo.com'")
.toJSON
data.write.json("input2")
sc.stop
complete(data.toString)
data.write.json("input2") creating file looks like below
{"value":"{\"first_name\":\"Sabrina\",\"last_name\":\"Mayert\"}"}
{"value":"{\"first_name\":\"Taryn\",\"last_name\":\"Dietrich\"}"}
complete(data.toString) returning response [value: string]
How can I get response array of JSON object.
[{"first_name":"Sabrina","last_name":"Mayer"},{"first_name":"Taryn","last_name":"Dietrich"}]
Thanks for help in advance.
You are converting to json twice. Do not use the json conversion twice, and you should get your desired output:
val data = input
.select("first_name", "last_name")
.where("email=='donny54#yahoo.com'")
data.write.json("input2")
Output:
{"first_name":"Sabrina","last_name":"Mayert"}
{"first_name":"Taryn","last_name":"Dietrich"}
Does this solve your issue, or do you specifically need to convert it to an array?
I am learning how to read and write from files in HDFS by using Spark/Scala.
I am unable to write in HDFS file, the file is created, but it's empty.
I don't know how to create a loop for writing in a file.
The code is:
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
// Read the adult CSV file
val logFile = "hdfs://zobbi01:9000/input/adult.csv"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
//val logFile = sc.textFile("hdfs://zobbi01:9000/input/adult.csv")
val headerAndRows = logData.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first
val data = headerAndRows.filter(_(0) != header(0))
val maps = data.map(splits => header.zip(splits).toMap)
val result = maps.filter(map => map("AGE") != "23")
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
If I replace:
result.foreach{println}
Then it works!
but when using the method of (saveAsTextFile), then an error message is thrown as
<console>:76: error: type mismatch;
found : Unit
required: scala.collection.immutable.Map[String,String] => Unit
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
Any help please.
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
This is all what you need to do. You don't need to loop through all the rows.
Hope this helps!
What this does!!!
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
RDD action cannot be triggered from RDD transformations unless special conf set.
Just use result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt") to save to HDFS.
I f you need other formats in the file to be written, change in rdd itself before writing.
I have a json file of the format
{"x1": 2, "y1": 6, "x2":3, "y2":7}
I have a scala class
class test(int:x, int:y)
Using spark I am trying to read this file and create two test objects for each line from the json file. for example
{"x1": 2, "y1": 6, "x2":3, "y2":7} should create
test1 = new test(2,6) and
test2 = new test(3,7)
Then for each line of the json file, i want to call a function that takes two test objects as parameters. Example callFunction(test1,test2)
How do i do this with spark. I see method that will convert rows in json file to list of objects but no way to create multiple objects using attributes in a single row of json file
val conf = new SparkConf()
.setAppName("Example")
.setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val coordinates = sqlContext.read.json("c:/testfile.json")
//NOT SURE HOW TO DO THE FOLLOWING
//test1 = new Test(attr1 of json file, attr2 of json file)
//test2 = new Test(attr3 of json file, attr4 of json file)
//callFunction(test1,test2)
//collect the result of callFunction
I am writing a Scala code that requires me to write to a file in HDFS.
When I use Filewriter.write on local, it works. The same thing does not work on HDFS.
Upon checking, I found that there are the following options to write in Apache Spark-
RDD.saveAsTextFile and DataFrame.write.format.
My question is: what if I just want to write an int or string to a file in Apache Spark?
Follow up:
I need to write to an output file a header, DataFrame contents and then append some string.
Does sc.parallelize(Seq(<String>)) help?
create RDD with your data (int/string) using Seq: see parallelized-collections for details:
sc.parallelize(Seq(5)) //for writing int (5)
sc.parallelize(Seq("Test String")) // for writing string
val conf = new SparkConf().setAppName("Writing Int to File").setMaster("local")
val sc = new SparkContext(conf)
val intRdd= sc.parallelize(Seq(5))
intRdd.saveAsTextFile("out\\int\\test")
val conf = new SparkConf().setAppName("Writing string to File").setMaster("local")
val sc = new SparkContext(conf)
val stringRdd = sc.parallelize(Seq("Test String"))
stringRdd.saveAsTextFile("out\\string\\test")
Follow up Example: (Tested as below)
val conf = new SparkConf().setAppName("Total Countries having Icon").setMaster("local")
val sc = new SparkContext(conf)
val headerRDD= sc.parallelize(Seq("HEADER"))
//Replace BODY part with your DF
val bodyRDD= sc.parallelize(Seq("BODY"))
val footerRDD = sc.parallelize(Seq("FOOTER"))
//combine all rdds to final
val finalRDD = headerRDD ++ bodyRDD ++ footerRDD
//finalRDD.foreach(line => println(line))
//output to one file
finalRDD.coalesce(1, true).saveAsTextFile("test")
output:
HEADER
BODY
FOOTER
more examples here. . .
Perhaps this question may seem a bit abstract, here it is:
val originalAvroSchema : Schema = // read from a file
val rdd : RDD[GenericData.Record] = // From some streaming source
// Looking for a handy:
val df: DataFrame = rdd.toDF(schema)
I explore spark-avro but it has support only to read from a file, not from existing RDD.
import com.databricks.spark.avro._
val sqlContext = new SQLContext(sc)
val rdd : RDD[MyAvroRecord] = ...
val df = rdd.toAvroDF(sqlContext)