How to create an empty dataFrame in Spark - scala

I have a set of Avro based hive tables and I need to read data from them. As Spark-SQL uses hive serdes to read the data from HDFS, it is much slower than reading HDFS directly. So I have used data bricks Spark-Avro jar to read the Avro files from underlying HDFS dir.
Everything works fine except when the table is empty. I have managed to get the schema from the .avsc file of hive table using the following command but I am getting an error "No Avro files found"
val schemaFile = FileSystem.get(sc.hadoopConfiguration).open(new Path("hdfs://myfile.avsc"));
val schema = new Schema.Parser().parse(schemaFile);
spark.read.format("com.databricks.spark.avro").option("avroSchema", schema.toString).load("/tmp/myoutput.avro").show()
Workarounds:
I have placed an empty file in that directory and the same thing works fine.
Are there any other ways to achieve the same? like conf setting or something?

You don't need to use emptyRDD. Here is what worked for me with PySpark 2.4:
empty_df = spark.createDataFrame([], schema) # spark is the Spark Session
If you already have a schema from another dataframe, you can just do this:
schema = some_other_df.schema
If you don't, then manually create the schema of the empty dataframe, for example:
schema = StructType([StructField("col_1", StringType(), True),
StructField("col_2", DateType(), True),
StructField("col_3", StringType(), True),
StructField("col_4", IntegerType(), False)]
)
I hope this helps.

Similar to EmiCareOfCell44's answer, just a little bit more elegant and more "empty"
val emptySchema = StructType(Seq())
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row],
emptySchema)

To create an empty DataFrame:
val my_schema = StructType(Seq(
StructField("field1", StringType, nullable = false),
StructField("field2", StringType, nullable = false)
))
val empty: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], my_schema)
Maybe this may help

Depending on your Spark version, you can use the reflection way.. There is a private method in SchemaConverters which does the job to convert the Schema to a StructType.. (not sure why it is private to be honest, it would be really useful in other situations). Using scala reflection you should be able to do it in the following way
import scala.reflect.runtime.{universe => ru}
import org.apache.avro.Schema
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
var schemaStr = "{\n \"type\": \"record\",\n \"namespace\": \"com.example\",\n \"name\": \"FullName\",\n \"fields\": [\n { \"name\": \"first\", \"type\": \"string\" },\n { \"name\": \"last\", \"type\": \"string\" }\n ]\n }"
val schema = new Schema.Parser().parse(schemaStr);
val m = ru.runtimeMirror(getClass.getClassLoader)
val module = m.staticModule("com.databricks.spark.avro.SchemaConverters")
val im = m.reflectModule(module)
val method = im.symbol.info.decl(ru.TermName("toSqlType")).asMethod
val objMirror = m.reflect(im.instance)
val structure = objMirror.reflectMethod(method)(schema).asInstanceOf[com.databricks.spark.avro.SchemaConverters.SchemaType]
val sqlSchema = structure.dataType.asInstanceOf[StructType]
val empty = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], sqlSchema)
empty.printSchema

Related

How to define a schema for json to be used in from_json to parse out values

I am trying to come up with a schema definition to parse out information from dataframe string column I am using from_json for that . I need help in defining schema which I am somehow not getting it right.
Here is the Json I have
[
{
"sectionid":"838096e332d4419191877a3fd40ed1f4",
"sequence":0,
"questions":[
{
"xid":"urn:com.mheducation.openlearning:lms.assessment.author:qastg.global:assessment_item:2a0f52fb93954f4590ac88d90888be7b",
"questionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"quizquestionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"qtype":"3",
"sequence":0,
"subsectionsequence":-1,
"type":"80",
"question":"<p>This is a simple, 1 question assessment for automation testing</p>",
"totalpoints":"5.0",
"scoring":"1",
"scoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"inputoption":"0",
"casesensitive":"0",
"suggestedscoring":"1",
"suggestedscoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"answers":[
"1"
],
"options":[
]
}
]
}
]
I want to parse this information out which will result in columns
sectionid , sequence, xid, question.sequence, question.question(question text), answers
Here is what I have I have defined a schema for testing like this
import org.apache.spark.sql.types.{StringType, ArrayType, StructType,
StructField}
val schema = new StructType()
.add("sectionid", StringType, true)
.add("sequence", StringType, true)
.add("questions", StringType, true)
.add("answers", StringType, true)
finalDF = finalDF
.withColumn( "parsed", from_json(col("enriched_payload.transformed"),schema) )
But I am getting NULL in result columns the reason I think is my schema is not right.
I am struggling to come up with right definition . How do I come up with correct json schema definition ?
I am using spark 3.0
Try below code.
import org.apache.spark.sql.types._
val schema = ArrayType(
new StructType()
.add("sectionid",StringType,true)
.add("sequence",LongType,true)
.add("questions", ArrayType(
new StructType()
.add("answers",ArrayType(StringType,true),true)
.add("casesensitive",StringType,true)
.add("inputoption",StringType,true)
.add("options",ArrayType(StringType,true),true)
.add("qtype",StringType,true)
.add("question",StringType,true)
.add("questionid",StringType,true)
.add("quizquestionid",StringType,true)
.add("scoring",StringType,true)
.add("scoringrules",StringType,true)
.add("sequence",LongType,true)
.add("subsectionsequence",LongType,true)
.add("suggestedscoring",StringType,true)
.add("suggestedscoringrules",StringType,true)
.add("totalpoints",StringType,true)
.add("type",StringType,true)
.add("xid",StringType,true)
)
)
)

Parquet file is not partitioned Spark

I am trying to save a parquet Spark dataframe with partitioning to the temporary directory for unit tests, however, for some reason partitions are not created. The data itself is saved into the directory and can be used for tests.
Here is the method I have created for that:
def saveParquet(df: DataFrame, partitions: String*): String = {
val path = createTempDir()
df.repartition(1).parquet(path)(partitions: _*)
path
}
val feedPath: String = saveParquet(feedDF.select(feed.schema), "processing_time")
This method works for various dataframe with various schemas but for some reason does not generate partitions for this one. I have logged out the resulting path and it looks like this:
/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515
But it should look like this:
/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515/processing_time=1591714800000/part-some-random-numbersnappy.parquet
I have checked that the data and all the columns are read just fine before partitioning, as soon as partition call is created this problem occurs. Also, I ran a regex on directories which failed with match error on test samples - s".*processing_time=([0-9]+)/.*parquet".r
So what could be the reason of this problem? How else can I partition the dataframe?
Dataframe schema looks like this:
val schema: StructType = StructType(
Seq(
StructField("field1", StringType),
StructField("field2", LongType),
StructField("field3", StringType),
StructField("field4Id", IntegerType, nullable = true),
StructField("field4", FloatType, nullable = true),
StructField("field5Id", IntegerType, nullable = true),
StructField("field5", FloatType, nullable = true),
StructField("field6Id", IntegerType, nullable = true),
StructField("field6", FloatType, nullable = true),
//partition keys
StructField("processing_time", LongType)
)
)

Cleaning CSV/Dataframe of size ~40GB using Spark and Scala

I am kind of newbie to big data world. I have a initial CSV which has a data size of ~40GB but in some kind of shifted order. I mean if you see initial CSV, for Jenny there is no age, so sex column value is shifted to age and remaining column value keeps shifting till the last element in the row.
I want clean/process this CVS using dataframe with Spark in Scala. I tried quite a few solution with withColumn() API and all, but nothing worked for me.
If anyone can suggest me some sort of logic or API available which is out there to solve this in a cleaner way. I might not need proper solution but pointers will also do. Help much appreciated!!
Initial CSV/Dataframe
Required CSV/Dataframe
EDIT:
This is how I'm reading the data:
val spark = SparkSession .builder .appName("SparkSQL")
.master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
import spark.implicits._
val df = spark.read.option("header", true").csv("path/to/csv.csv")
This pretty much looks like the data is flawed. To handle this, I would suggest reading each line of the csv file as a single string and the applying a map() function to handle the data
case class myClass(name: String, age: Integer, sex: String, siblings: Integer)
val myNewDf = myDf.map(row => {
val myRow: String = row.getAs[String]("MY_SINGLE_COLUMN")
val myRowValues = myRow.split(",")
if (4 == myRowValues.size()) {
//everything as expected
return myClass(myRowValues[0], myRowValues[1], myRowValues[2], myRowValues[3])
} else {
//do foo to guess missing values
}
}
As in your case Data is not properly formatted. To handle this first data has to be cleansed, i.e all rows of CSV should have same Schema or same no of delimiter/columns.
Basic approach to do this in spark could be:
Load data as Text
Apply map operation on loaded DF/DS to clean it
Create Schema manually
Apply Schema on the cleansed DF/DS
Sample Code
//Sample CSV
John,28,M,3
Jenny,M,3
//Sample Code
val schema = StructType(
List(
StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true),
StructField("sex", StringType, nullable = true),
StructField("sib", IntegerType, nullable = true)
)
)
import spark.implicits._
val rawdf = spark.read.text("test.csv")
rawdf.show(10)
val rdd = rawdf.map(row => {
val raw = row.getAs[String]("value")
//TODO: Data cleansing has to be done.
val values = raw.split(",")
if (values.length != 4) {
s"${values(0)},,${values(1)},${values(2)}"
} else {
raw
}
})
val df = spark.read.schema(schema).csv(rdd)
df.show(10)
You can try to define a case class with Optional field for age and load your csv with schema directly into a Dataset.
Something like that :
import org.apache.spark.sql.{Encoders}
import sparkSession.implicits._
case class Person(name: String, age: Option[Int], sex: String, siblings: Int)
val schema = Encoders.product[Person].schema
val dfInput = sparkSession.read
.format("csv")
.schema(schema)
.option("header", "true")
.load("path/to/csv.csv")
.as[Person]

How can I get Spark SQL to import a Long without an "L" suffix?

I have a set of CSV's that I produced by Sqoop'ing a mySQL database. I am trying to define them the source for a dataframe in Spark.
The schema in the source db contains several fields with a Long datatype, and actually stores giant numbers in those fields.
When trying to access the dataframe, Scala chokes on interpreting these because I do not have an L suffix on the long integers.
As an example, this throws an error: val test: Long = 20130102180600
While this succeeds: val test: Long = 20130102180600L
Is there any way to force Scala to interpret these as Long Integers without that suffix? Due to the scale of the data, I do not believe it is feasible to post-process the fields as they come out of the database.
Give the schema explicitly, as in the example in README:
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("year", IntegerType, true),
StructField("make", StringType, true),
StructField("model", StringType, true),
StructField("comment", StringType, true),
StructField("blank", StringType, true)))
val df = sqlContext.load(
"com.databricks.spark.csv",
schema = customSchema,
Map("path" -> "cars.csv", "header" -> "true"))
val selectedData = df.select("year", "model")
selectedData.save("newcars.csv", "com.databricks.spark.csv")
Except using LongType for the large integer fields, of course.
Looking at the code, this definitely looks like it should work: fields are converted from String to desired type using TypeCast.castTo, and TypeCast.castTo for LongType just calls datum.toLong which works as desired (you can check "20130102180600".toLong in Scala REPL). In fact, InferSchema handles this case as well. I strongly suspect that the issue is different: perhaps the numbers are even out of Long range?
(I haven't actually tried this, but I expect it to work; if it doesn't, you should report the bug. Start by reading https://stackoverflow.com/help/mcve.)

Not able to create parquet files in hdfs using spark shell

I want to create parquet file in hdfs and then read it through hive as external table. I'm struck with stage failures in spark-shell while writing parquet files.
Spark Version: 1.5.2
Scala Version: 2.10.4
Java: 1.7
Input file:(employee.txt)
1201,satish,25
1202,krishna,28
1203,amith,39
1204,javed,23
1205,prudvi,23
In Spark-Shell:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val schema = StructType(schemaString.split(" ").map(fieldName ⇒ StructField(fieldName, StringType, true)))
val rowRDD = employee.map(_.split(",")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
When I type the last command I get,
ERROR
SPARK APPLICATION MANAGER
I even tried increasing the executor memory, its still failing.
Also Importantly , finalDF.show() is producing the same error.
So, I believe I have made a logical error here.
Thanks for supporting
The issue here is you are creating a schema with all the fields/columns type defaulted to StringType. But while passing the values in the schema, the value of Id and Age is being converted to Integer as per the code.Hence, throwing the Matcherror while running.
The data types of columns in the schema should match the data type of values being passed to it. Try the below code.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
//val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types._;
val schema = StructType(StructField("id", IntegerType, true) :: StructField("name", StringType, true) :: StructField("age", IntegerType, true) :: Nil)
val rowRDD = employee.map(_.split(" ")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
This code should run fine.