Parsing nested XML in Databricks - scala

I am trying to p
I am trying to read the XML into a data frame and trying to flatten the data using explode as below.
val df = spark.read.format("xml").option("rowTag","on").option("inferschema","true").load("filepath")
val parsxml= df
.withColumn("exploded_element", explode(("prgSvc.element"))).
I am getting the below error.
command-5246708674960:4: error: type mismatch;
found : String("prgSvc.element")
required: org.apache.spark.sql.Column
.withColumn("exploded_element", explode(("prgSvc.element")))**
Before reading the XML into the data frame, I also tried to manually assign a custom schema and read the XML file. But the output is all NULL. Could you please let me know if my approach is valid and how to resolve this issue and achieve the output.
Thank you.

Use this
import spark.implicits._
val parsxml= df .withColumn("exploded_element", explode($"prgSvc.element"))

Related

What is PySpark SQL equivalent function for pyspark.pandas.DataFrame.to_string?

Pandas API function: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_string.html
Another answer, though it doesn't work for me pyspark : Convert DataFrame to RDD[string]
Following above post advice, I tried going with
data.rdd.map(lambda row: [str(c) for c in row])
Then I get this error
TypeError: 'PipelinedRDD' object is not iterable
I would like for it to output rows of strings as if it's similar to to_string() above. Is this possible?
Would pyspark.sql.DataFrame.show satisfy your expectations about the console output? You can sort the df via pyspark.sql.DataFrame.sort before printing if required.

Spark SQL : is it possible to read the custom schema from an external source instead of creating it in within the spark code?

Trying to load a csv file without schema inference.
Usually we create the schema as StructType within the spark code.
Is it possible to save the schema in an external file (may be a property/config file) and read it dynamically while creating the dataframe ?
val customSchema_v2 = new StructType()
.add("PROPERTY_ID_2222", "int" )
.add("OWNER_ID_2222", "int")
Is it possible to save the schema i.e "PROPERTY_ID_2222", "int" and "OWNER_ID_2222", "int" in a file and call the schema from there ?
Both StructType and StructField can Serializable, so you can serialize a StructType to a file and deserialize it when you need
You can use JSON for schemas.
import org.apache.spark.sql.types._
val customSchema_v2 = new StructType()
.add("PROPERTY_ID_2222", "int" )
.add("OWNER_ID_2222", "int")
val schemaString = customSchema_v2.json
println(schemaString)
val loadedSchema = DataType.fromJson(schemaString)
CONSOLE Output:
{"type":"struct","fields":[{"name":"PROPERTY_ID_2222","type":"integer","nullable":true,"metadata":{}},{"name":"OWNER_ID_2222","type":"integer","nullable":true,"metadata":{}}]}
You would need to add code that reads the schema from the JSNO files.
JSON files could also be created manually and can be in pretty format. To understand it better add more columns with different data types and use customSchema_v2.prettyJson to lear the syntax.

SparkSQL Dataframe Error: value show is not a member of org.apache.spark.sql.DataFrameReader

I'm new to Spark/Scala/Dataframes. I'm using Scala 2.10.5, Spark 1.6.0. I am trying to load in a csv file and then create a dataframe from it. Using the scala shell I execute the following in the order below. Once I execute line 6, I get an error that says:
error: value show is not a member of org.apache.spark.sql.DataFrameReader
Could someone advise what I might be missing? I understand I don't need to import sparkcontext if I'm using the REPL (shell) so sc will be automatically created, but any ideas what I'm doing wrong?
1.import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
val sqlContext = new SQLContext(sc)
val csvfile = "path_to_filename in hdfs...."
val df = sqlContext.read.format(csvfile).option("header", "true").option("inferSchema", "true")
df.show()
Try this:
val df = sqlContext.read.option("header", "true").option("inferSchema", "true").csv(csvfile)
sqlContext.read gives you a DataFrameReader, and option and format both set some options and give you back a DataFrameReader. You need to call one of the methods that gives you a DataFrame (like csv) before you can do things like show with it.
See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader for more info.

Importing a SparkSession DataFrame on DSX

I'm currently working on Data Science Experience and would like to import a CSV file as a SparkSession DataFrame. I am able to successfully import the DataFrame, however, all of the column attributes are converted to string type. How do you make this DSX feature recognize the types present in the CSV file?
Currently, the generated code for the actual creation of the pyspark.sql.DataFrame looks like this:
df_data_1 = spark.read\
.format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
.option('header', 'true')\
.load('swift://container_name.' + name + '/test.csv')
df_data_1.take(5)
You have to add the the following options, then the schema will be inferred:
.option(inferschema='true')\

spark scala issue uploading csv

i am trying to upload a csv file into a tempTable such that I can query on it and I am having two issues.
First: I tried uploading the csv to a DataFrame, and this csv has some empty fields.... and I didn't find a way to do it. I found someone posting in another post to use :
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
but it gives me an error saying "Failed to load class for data source: com.databricks.spark.csv"
Then I uploaded the file and read it as a text file, without the headings as:
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._;
case class cars(id: Int, name: String, licence: String);
val carsDF = sc.textFile("../myTests/cars.csv").map(_.split(",")).map(p => cars( p(0).trim.toInt, p(1).trim, p(2).trim) ).toDF();
carsDF.registerTempTable("cars");
val dgp = sqlContext.sql("SELECT * FROM cars");
dgp.show()
gives an error because one of the licence field is empty... I tried to control this issue when I build the data frame but did not work.
I can obviously go into the csv file but and fix by adding a null to it but U do not want to do this because of there are a lot of fields it could be problematic. I want to fix it programmatically either when i create the dataframe or the class...
any other thoughts please let me know as well
To be able to use spark-csv you have to make sure it is available. In an interactive mode the simplest solution is to use packages argument when you start shell:
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
Regarding manual parsing working with csv files, especially malformed like cars.csv, requires much more work than simply splitting on commas. Some things to consider:
how to detect csv dialect, including method of string quoting
how to handle quotes and new line characters inside strings
how handle malformed lines
In case of example file you have to at least:
filter empty lines
read header
map lines to fields providing default value if field is missing
Here you go. Remember to check the delimiter for your CSV.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// create a table from dataframe
df.createOrReplaceTempView("tableName")
// run your sql query
val sqlResults = spark.sql("SELECT * FROM tableName")
// display sql results
display(sqlResults)