Importing a SparkSession DataFrame on DSX - pyspark

I'm currently working on Data Science Experience and would like to import a CSV file as a SparkSession DataFrame. I am able to successfully import the DataFrame, however, all of the column attributes are converted to string type. How do you make this DSX feature recognize the types present in the CSV file?

Currently, the generated code for the actual creation of the pyspark.sql.DataFrame looks like this:
df_data_1 = spark.read\
.format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
.option('header', 'true')\
.load('swift://container_name.' + name + '/test.csv')
df_data_1.take(5)
You have to add the the following options, then the schema will be inferred:
.option(inferschema='true')\

Related

How to import and read data from a dataset without using transform or transform_df in palantir foundry?

I want to know are there any ways to import the file without using transform_df or transform in code repository.
Basically I want to extract the data from the dataset and return all the values in terms of list. If I use transform or transform_df decorators then I won't be able to access that input file while calling the return function.
Are you trying to access the raw files in the dataset? That is possible using the filesystem API. Search your stack's documentation for "Raw File Access" wher eyou can find example python code. You still use the #transform decorator, except instead of calling .dataframe() you call .filesystem(). Here's some example code.
import csv
with hair_eye_color.filesystem().open('students.csv') as f:
reader = csv.reader(f, delimiter=',')
next(reader)
next(reader)
# ['id', 'hair', 'eye', 'sex']
# ['1', 'brown', 'brown', 'M']
You can create and a Spark dataframe using the file data and write it the output.

Adding Column In sparkdataframe

Hi I am trying to add one column in my spark dataframe and calculating value based on existing dataframe column. I am writing below code.
val df1=spark.sql("select id,dt1,salary frm dbdt1.tabledt1")
val df2=df1.withColumn("new_date",WHEN (month(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))
IN (01,02,03)) THEN
CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))-1,'-'),
substr(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy'))),3,4))
.otherwise(CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy'))),'-')
,SUBSTR(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy')))+1,3,4))))
But it always showing issue error: unclosed character literal. Can someone plase guide me how should i add this new column or modify the existing code.
Incorrect syntax in many places. First I suggest you look at a few spark sql examples online and also the org.apache.spark.sql.functions API documentation because your use of WHEN, CONCAT, IN are all incorrect.
Scala strings are enclosed by double quotes, you appear to be using SQL string syntax.
'dd-MM-yyyy' should be "dd-MM-yyyy"
To reference a column dt1 on DataFrame df1 you can use one of the following:
df1("dt1")
col("dt1") // if you import org.apache.spark.sql.functions.col
$"dt1" // if you import spark.implicits._ locally
For example:
from_unixtime(unix_timestamp(col("dt1")), 'dd-MM- yyyy')

Spark SQL : is it possible to read the custom schema from an external source instead of creating it in within the spark code?

Trying to load a csv file without schema inference.
Usually we create the schema as StructType within the spark code.
Is it possible to save the schema in an external file (may be a property/config file) and read it dynamically while creating the dataframe ?
val customSchema_v2 = new StructType()
.add("PROPERTY_ID_2222", "int" )
.add("OWNER_ID_2222", "int")
Is it possible to save the schema i.e "PROPERTY_ID_2222", "int" and "OWNER_ID_2222", "int" in a file and call the schema from there ?
Both StructType and StructField can Serializable, so you can serialize a StructType to a file and deserialize it when you need
You can use JSON for schemas.
import org.apache.spark.sql.types._
val customSchema_v2 = new StructType()
.add("PROPERTY_ID_2222", "int" )
.add("OWNER_ID_2222", "int")
val schemaString = customSchema_v2.json
println(schemaString)
val loadedSchema = DataType.fromJson(schemaString)
CONSOLE Output:
{"type":"struct","fields":[{"name":"PROPERTY_ID_2222","type":"integer","nullable":true,"metadata":{}},{"name":"OWNER_ID_2222","type":"integer","nullable":true,"metadata":{}}]}
You would need to add code that reads the schema from the JSNO files.
JSON files could also be created manually and can be in pretty format. To understand it better add more columns with different data types and use customSchema_v2.prettyJson to lear the syntax.

What is the fastest way to transform a very large JSON file with Spark?

I am having a rather large JSON file (Amazon product data) with a lot of single JSON objects. Those JSON objects contain text that I want to preprocess for a specific training task but it is the preprocessing that I need to speed up here. One JSON object looks like this:
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
The task would be to extract reviewText from each JSON object and perform some preprocessing like lemmatizing etc.
My problem is that I don't know how I could use Spark in order to speed this task up on a cluster.. I am actually not even sure if I can read that JSON file as a stream object-by-object and parallelize the main task.
What would be the best way to get started with this?
As you have single JSON object per line, you can use RDD's textFile to get RDD[String] of lines. Then use map to parse JSON objects using something like json4s and extract necessary field.
You whole code will looks as simple as this (assuming you have SparkContext as sc):
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit def formats = DefaultFormats
val r = sc.textFile("input_path").map(l => (parse(l) \ "reviewText").extract[String])
You can use a JSON dataset and then execute a simple sql query to retrieve the reviewText column's value:
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "path/reviews.json"
val people = sqlContext.read.json(path)
// Register this DataFrame as a table.
people.registerTempTable("reviews")
val reviewTexts = sqlContext.sql("SELECT reviewText FROM reviews")
Built from examples at the SparkSQL docs (http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets)
I would load JSON data into Dataframe and then select field that i need, also you can use map to apply preprocessing like lemmatising.

spark scala issue uploading csv

i am trying to upload a csv file into a tempTable such that I can query on it and I am having two issues.
First: I tried uploading the csv to a DataFrame, and this csv has some empty fields.... and I didn't find a way to do it. I found someone posting in another post to use :
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
but it gives me an error saying "Failed to load class for data source: com.databricks.spark.csv"
Then I uploaded the file and read it as a text file, without the headings as:
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._;
case class cars(id: Int, name: String, licence: String);
val carsDF = sc.textFile("../myTests/cars.csv").map(_.split(",")).map(p => cars( p(0).trim.toInt, p(1).trim, p(2).trim) ).toDF();
carsDF.registerTempTable("cars");
val dgp = sqlContext.sql("SELECT * FROM cars");
dgp.show()
gives an error because one of the licence field is empty... I tried to control this issue when I build the data frame but did not work.
I can obviously go into the csv file but and fix by adding a null to it but U do not want to do this because of there are a lot of fields it could be problematic. I want to fix it programmatically either when i create the dataframe or the class...
any other thoughts please let me know as well
To be able to use spark-csv you have to make sure it is available. In an interactive mode the simplest solution is to use packages argument when you start shell:
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
Regarding manual parsing working with csv files, especially malformed like cars.csv, requires much more work than simply splitting on commas. Some things to consider:
how to detect csv dialect, including method of string quoting
how to handle quotes and new line characters inside strings
how handle malformed lines
In case of example file you have to at least:
filter empty lines
read header
map lines to fields providing default value if field is missing
Here you go. Remember to check the delimiter for your CSV.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// create a table from dataframe
df.createOrReplaceTempView("tableName")
// run your sql query
val sqlResults = spark.sql("SELECT * FROM tableName")
// display sql results
display(sqlResults)