In Spark Json to Csv converting? - scala

I have a Json object like bellow
{"Event":"xyz","Name":"test","Prog":0,"AId":"367","CId":"11522"}
using bellow spark script,I have converted into csv
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.load("org.apache.spark.sql.json", Map("path" -> "test1.json"))
df.save("com.databricks.spark.csv", SaveMode.ErrorIfExists, Map("path" -> "datascv", "header" -> "true"))
I am able to convert into CSV file,My Output is
AId,CId,Event,Name,Prog
367,11522,xyz,test,0
but here header of csv is in ascending order,but I want to maintain my csv file header in customized format like bellow i.e same as my json order.
Event,Name,Prog,AId,CIdEvent,Name,Prog,AId,CId
Please help me with this.
Thanks in advance.

You can try the following.
val selectedData = df.select("Event", "Name", "Prog", "AId", "CId")
selectedData.save("com.databricks.spark.csv", SaveMode.ErrorIfExists,
Map("path" -> "datascv", "header" -> "true"))

Related

Write empty DF with header to csv

Spark creates an empty file without headers when you try to create csv file using emptyDF even though header option is true(header=true)
import ss.implicits._
val df = List((1, "kishore", 22000)).toDF("id", "name", "salary")
val emptyDF = df.where("id != 1")
emptyDF.show()
emptyDF.write.option("header", true).csv("folder/filename.csv")
is it possible to create csv file with header for emptyDF?
if(emptyDF.take(1).isEmpty){
//To create Headers on empty DF
ss.createDataFrame(List(Row.fromSeq(emptyDF.schema.fieldNames)).asJava, StructType(emptyDF.schema.fieldNames.map{n => StructField(n, StringType)}))
.write.option("header", false).csv("folder/filename.csv")
} else {
emptyDF.write.option("header", true).csv("folder/filename.csv")
}

How to send JSON response in Spark

My JSON file(input.json) looks like below.
{"first_name":"Sabrina","last_name":"Mayert","email":"donny54#yahoo.com"}
{"first_name":"Taryn","last_name":"Dietrich","email":"donny54#yahoo.com"}
My Scala code looks like below. Here I am trying to return first_name and last_name based on email.
val conf = new SparkConf().setAppName("RowCount").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val input = sqlContext.read.json("input.json")
val data = input
.select("first_name", "last_name")
.where("email=='donny54#yahoo.com'")
.toJSON
data.write.json("input2")
sc.stop
complete(data.toString)
data.write.json("input2") creating file looks like below
{"value":"{\"first_name\":\"Sabrina\",\"last_name\":\"Mayert\"}"}
{"value":"{\"first_name\":\"Taryn\",\"last_name\":\"Dietrich\"}"}
complete(data.toString) returning response [value: string]
How can I get response array of JSON object.
[{"first_name":"Sabrina","last_name":"Mayer"},{"first_name":"Taryn","last_name":"Dietrich"}]
Thanks for help in advance.
You are converting to json twice. Do not use the json conversion twice, and you should get your desired output:
val data = input
.select("first_name", "last_name")
.where("email=='donny54#yahoo.com'")
data.write.json("input2")
Output:
{"first_name":"Sabrina","last_name":"Mayert"}
{"first_name":"Taryn","last_name":"Dietrich"}
Does this solve your issue, or do you specifically need to convert it to an array?

Create a DF after registering a previous DF in Spark Scala

I am a new developer at Spark Scala and I want to ask you about my problem.
I have two huge dataframes, my second dataframe is computed from the first dataframe (it contains a distinct column from the first one).
To optimize my code, I thought about this approach :
Register my first dataframe as a .csv file in HDFS
And then simply read this .csv file to calculate the second dataframe.
So, it wrote this :
//val temp1 is my first DF
writeAsTextFileAndMerge("result1.csv", "/user/result", temp1, spark.sparkContext.hadoopConfiguration)
val temp2 = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
.csv("/user/result/result1.csv").select("ID").distinct
writeAsTextFileAndMerge("result2.csv", "/user/result",
temp2, spark.sparkContext.hadoopConfiguration)
And this is my save function :
def writeAsTextFileAndMerge(fileName: String, outputPath: String, df: DataFrame, conf: Configuration) {
val sourceFile = WorkingDirectory
df.write.options(Map("header" -> "true", "delimiter" -> ";")).mode("overwrite").csv(sourceFile)
merge(fileName, sourceFile, outputPath, conf)
}
def merge(fileName: String, srcPath: String, dstPath: String, conf: Configuration) {
val hdfs = FileSystem.get(conf)
val destinationPath = new Path(dstPath)
if (!hdfs.exists(destinationPath))
hdfs.mkdirs(destinationPath)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath + "/" + fileName),
true, conf, null)
}
It seems "logical" to me but I got errors doing this. I guess it's not possible for Spark to "wait" until registering my first DF in HDFS and AFTER read this new file (or maybe I have some errors on my save function ?).
Here is the exception that I got :
19/02/16 17:27:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.ArrayIndexOutOfBoundsException: 1
java.lang.ArrayIndexOutOfBoundsException: 1
Can you help me to fix this please ?
The problem is the merge - Spark is not aware and thus not synchronized with all the HDFS operations you are making.
The good news is that you don't need to do that. just do df.write and then create a new dataframe with the read (spark will read all the parts into a single df)
i.e. the following would work just fine
temp1.write.options(Map("header" -> "true", "delimiter" -> ";")).mode("overwrite").csv("/user/result/result1.csv")
val temp2 = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
.csv("/user/result/result1.csv").select("ID").distinct

Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.sequenceFile[LongWritable,String](src)
val jsonRecs = file.map((record: (String, String)) => new String(record._2))
val df = sqlContext.read.json(jsonRecs)
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df = spark.read.option("header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input
sc.sequenceFile[String,String](src)
}

Connect to SQLite in Apache Spark

I want to run a custom function on all tables in a SQLite database. The function is more or less the same, but depends on the schema of the individual table. Also, the tables and their schemata are only known at runtime (the program is called with an argument that specifies the path of the database).
This is what I have so far:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// somehow bind sqlContext to DB
val allTables = sqlContext.tableNames
for( t <- allTables) {
val df = sqlContext.table(t)
val schema = df.columns
sqlContext.sql("SELECT * FROM " + t + "...").map(x => myFunc(x,schema))
}
The only hint I found so far needs to know the table in advance, which is not the case in my scenario:
val tableData =
sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db", "dbtable" -> t))
.load()
I am using the xerial sqlite jdbc driver. So how can I conntect solely to a database, not to a table?
Edit: Using Beryllium's answer as a start I updated my code to this:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val metaData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> "(SELECT * FROM sqlite_master) AS t")).load()
val myTableNames = metaData.select("tbl_name").distinct()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.table(t.toString)
for (record <- tableData.select("*")) {
println(record)
}
}
At least I can read the table names at runtime which is a huge step forward for me. But I can't read the tables. I tried both
val tableData = sqlContext.table(t.toString)
and
val tableData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> t.toString)).load()
in the loop, but in both cases I get a NullPointerException. Although I can print the table names it seems I cannot connect to them.
Last but not least I always get an SQLITE_ERROR: Connection is closed error. It looks to be the same issue described in this question: SQLITE_ERROR: Connection is closed when connecting from Spark via JDBC to SQLite database
There are two options you can try
Use JDBC directly
Open a separate, plain JDBC connection in your Spark job
Get the tables names from the JDBC meta data
Feed these into your for comprehension
Use a SQL query for the "dbtable" argument
You can specify a query as the value for the dbtable argument. Syntactically this query must "look" like a table, so it must be wrapped in a sub query.
In that query, get the meta data from the database:
val df = sqlContext.read.format("jdbc").options(
Map(
"url" -> "jdbc:postgresql:xxx",
"user" -> "x",
"password" -> "x",
"dbtable" -> "(select * from pg_tables) as t")).load()
This example works with PostgreSQL, you have to adapt it for SQLite.
Update
It seems that the JDBC driver only supports to iterate over one result set.
Anyway, when you materialize the list of table names using collect(), then the following snippet should work:
val myTableNames = metaData.select("tbl_name").map(_.getString(0)).collect()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.read.format("jdbc")
.options(
Map(
"url" -> "jdbc:sqlite:/x.db",
"dbtable" -> t)).load()
tableData.show()
}