I'm a bit new to Spark and Scala.I have a (large ~ 1million) Scala Spark DataFrame, and I need to make it a json String.
the schema of the df like this
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
|--valKey(String)
|--vslScore(Double)
key is product id and, value is some produt set and it's score values that I get from a parquet file.
I only manage to get something like this. For curly brackets I simply concatenate them to result.
3434343<tab>{smartphones:apple:0.4564879,smartphones:samsung:0.723643 }
But I expect a value like this.Each value should have a
3434343<tab>{"smartphones:apple":0.4564879, "smartphones:samsung":0.723643 }
are there anyway that I directly convert this into a Json string without concatenate anything. I hope to write output files into .csv format. This is code I'm using
val df = parquetReaderDF.withColumn("key",col("productId"))
.withColumn("value", struct(
col("productType"),
col("brand"),
col("score")))
.select("key","value")
val df2 = df.withColumn("valKey", concat(
col("productType"),lit(":")
,col("brand"),lit(":"),
col("score")))
.groupBy("key")
.agg(collect_list(col("valKey")))
.map{ r =>
val key = r.getAs[String]("key")
val value = r.getAs[Seq[String]] ("collect_list(valKey)").mkString(",")
(key,value)
}
.toDF("key", "valKey")
.withColumn("valKey", concat(lit("{"), col("valKey"), lit("}")))
df.coalesce(1)
.write.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.option("header", "false")
.option("quoteMode", "yes")
.save("data.csv")
Related
I'm using Spark Structured Streaming (3.2.1) with Kafka.
I'm trying to simply read JSON from Kafka using a defined schema.
My problem is that in defined schema I got non-nullable field that is ignored when I read messages from Kafka. I use the from_json functions that seems to ignore that some fields can't be null.
Here is my code example:
val schemaTest = new StructType()
.add("firstName", StringType)
.add("lastName", StringType)
.add("birthDate", LongType, nullable = false)
val loader =
spark
.readStream
.format("kafka")
.option("startingOffsets", "earliest")
.option("kafka.bootstrap.servers", "BROKER:PORT")
.option("subscribe", "TOPIC")
.load()
val df = loader.
selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
df.printSchema()
val q = df.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
I got this when I'm printing the schema of df which is different of my schemaTest:
root
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- birthDate: long (nullable = true)
And received data are like that:
+---------+--------+----------+
|firstName|lastName|birthDate |
+---------+--------+----------+
|Toto |Titi |1643799912|
+---------+--------+----------+
|Tutu |Tata |null |
+---------+--------+----------+
We also try to add option to change mode in from_json function from default one PERMISSIVE to others (DROPMALFORMED, FAILFAST) but in fact the second record that doesn't respect the defined schema is simply not considered as corrupted because the field birthDate is nullable..
Maybe I missed something but if it's not the case, I got following questions.
Do you know why the printSchema of df is not like my schemaTest ? (With non nullable field)
And also, how can I manage non-nullable value in my case ? I know that I can filter but I would like to know if there is an alternative using schema like it's supposed to work. And also, It's not quite simple to filter if I got a schema with lots of fields non-nullable.
This is actually the intended behavior of from_json function. You can read the following from the source code:
// The JSON input data might be missing certain fields. We force the nullability
// of the user-provided schema to avoid data corruptions. In particular, the parquet-mr encoder
// can generate incorrect files if values are missing in columns declared as non-nullable.
val nullableSchema = schema.asNullable
override def nullable: Boolean = true
If you have multiple fields which are mandatory then you can construct the filter expression from your schemaTest (or list of columns) and use it like this:
val filterExpr = schemaTest.fields
.filter(!_.nullable)
.map(f => col(f.name).isNotNull)
.reduce(_ and _)
val df = loader
.selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.filter(filterExpr)
I would like to propose a different way of doing :
def isCorrupted(df: DataFrame): DataFrame = {
val filterNullable = schemaTest
.filter(e => !e.nullable)
.map(_.name)
filterNullable.foldLeft(df) { case ((accumulator), (columnName)) =>
accumulator.withColumn("isCorrupted", when(col(columnName).isNull, 1).otherwise(0))
}
.filter(col("isCorrupted") === lit(0))
.drop(col("isCorrupted"))
}
val df = loader
.selectExpr("CAST(value as STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.transform(isCorrupted)
I have duplicate columns in text file and when I try to load that text file using spark scala code, it gets loaded successfully into data frame and I can see the first 20 rows by df.Show()
Full Code:-
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
df.registerTempTable("Sample_File")
df.Show()
Till this point my code works fine.
But as soon as I try below code then it gives me error.
val results = hivesql.sql("Select id,sequence,sequence from Sample_File")
so I have 2 columns with same name in text file i.e sequence
How can I access that two columns.. I tried with sequence#2 but still not working
Spark Version:-1.6.0
Scala Version:- 2.10.5
result of df.printschema()
|-- id: string (nullable = true)
|-- sequence: string (nullable = true)
|-- sequence: string (nullable = true)
I second #smart_coder's approach, I have a slightly different approach though. Please find it below.
You need to have unique column names to do query from hivesql.sql.
you can rename the column names dynamically by using below code:
Your code:
val df = hivesql.createDataFrame(rowRDD, schema)
After this point, we need to remove ambiguity, below is the solution:
var list = df.schema.map(_.name).toList
for(i <- 0 to list.size -1){
val cont = list.count(_ == list(i))
val col = list(i)
if(cont != 1){
list = list.take(i) ++ List(col+i) ++ list.drop(i+1)
}
}
val df1 = df.toDF(list: _*)
// you would get the output as below:
result of df1.printschema()
|-- id: string (nullable = true)
|-- sequence1: string (nullable = true)
|-- sequence: string (nullable = true)
So basically, we are getting all the column names as a list, then checking if any column is repeating more than once,
if a column is repeating, we are appending the column name with the index, then we create a new dataframe d1 with the new list with renamed column names.
I have tested this in Spark 2.4, but it should work in 1.6 as well.
The below code might help you to resolve your problem. I have tested this in Spark 1.6.3.
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
val colNames = Seq("id","sequence1","sequence2")
val df1 = df.toDF(colNames: _*)
df1.registerTempTable("Sample_File")
val results = hivesql.sql("select id,sequence1,sequence2 from Sample_File")
I have a list of defined columns as:
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "text", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
And a file (csv with header columns Products Selled, Total Value) which is readed as dataframe.
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(filePath)
// csv file have header as colNames
var finalDf = df
.withColumn("row_id", monotonically_increasing_id)
.select(cols
.map(_.name.trim)
.map(col): _*)
// convert df col names as colCodes (for kudu table columns)
cols.foreach(col => finalDf = finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))
In last line, I change the dataframe column name from Products Selled into products_selled. Due of this, finalDf is a var.
I want to know if is a solution to declare finalDf as val, and not var.
I tried something like below code, but withColumnRenamed return a new DataFrame, but I can not do this outside cols.foreach
cols.foreach(col => finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))
Using select You can rename columns.
renaming columns inside select is faster than foldLeft, check post for comparison.
Try below code.
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "string", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
val colExpr = cols.map(c => trim(col(c.colName)).as(c.colCode.trim))
If you are storing valid column data type in ExcelColumn case class, you can use column data type like below.
val colExpr = cols.map(c => trim(col(c.colName).cast(c.colType)).as(c.colCode.trim))
finalDf.select(colExpr:_*)
The better way is to use foldLeft with withColumnRenamed
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "text", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
val resultDF = cols.foldLeft(df){(acc, name ) =>
acc.withColumnRenamed(name.colName.trim, name.colCode.trim)
}
Original Schema:
root
|-- Products Selled: integer (nullable = false)
|-- Total Value: string (nullable = true)
|-- value: integer (nullable = false)
New Schema:
root
|-- products_selled: integer (nullable = false)
|-- total_value: string (nullable = true)
|-- value: integer (nullable = false)
I have a piece of code where at the end, I am write dataframe to parquet file.
The logic is such that the dataframe could be empty sometimes and hence I get the below error.
df.write.format("parquet").mode("overwrite").save(somePath)
org.apache.spark.sql.AnalysisException: Parquet data source does not support null data type.;
When I print the schema of "df", I get below.
df.schema
res2: org.apache.spark.sql.types.StructType =
StructType(
StructField(rpt_date_id,IntegerType,true),
StructField(rpt_hour_no,ShortType,true),
StructField(kpi_id,IntegerType,false),
StructField(kpi_scnr_cd,StringType,false),
StructField(channel_x_id,IntegerType,false),
StructField(brand_id,ShortType,true),
StructField(kpi_value,FloatType,false),
StructField(src_lst_updt_dt,NullType,true),
StructField(etl_insrt_dt,DateType,false),
StructField(etl_updt_dt,DateType,false)
)
Is there a workaround to just write the empty file with schema, or not write the file at all when empty?
Thanks
The error you are getting is not related with the fact that your dataframe is empty. I don't see the point of saving an empty dataframe but you can do it if you want. Try this if you don't believe me:
val schema = StructType(
Array(
StructField("col1",StringType,true),
StructField("col2",StringType,false)
)
)
spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
.write
.format("parquet")
.save("/tmp/test_empty_df")
You are getting that error because one of your columns is of NullType and as the exception that was thrown indicates "Parquet data source does not support null data type"
I can't know for sure why you have a column with Null type but that usually happens when you read your data from a source and let spark infer the schema. If in that source there is an empty column, spark won't be able to infer the schema and will set it to null type.
If this is what's happening, my advice is that you specify the schema on read.
If this is not the case, a possible solution is to cast all the columns of NullType to a parquet-compatible type (like StringType). Here is an example on how to do it:
//df is a dataframe with a column of NullType
val df = Seq(("abc",null)).toDF("col1", "col2")
df.printSchema
root
|-- col1: string (nullable = true)
|-- col2: null (nullable = true)
//fold left to cast all NullType to StringType
val df1 = df.columns.foldLeft(df){
(acc,cur) => {
if(df.schema(cur).dataType == NullType)
acc.withColumn(cur, col(cur).cast(StringType))
else
acc
}
}
df1.printSchema
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
Hope this helps
'or not write the file at all when empty?' Check if df is not empty & then only write it.
if (!df.isEmpty)
df.write.format("parquet").mode("overwrite").save("somePath")
While I am trying to create a dataframe using a decimal type it is throwing me the below error.
I am performing the following steps:
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StringType;
import org.apache.spark.sql.types.DataTypes._;
//created a DecimalType
val DecimalType = DataTypes.createDecimalType(15,10)
//Created a schema
val sch = StructType(StructField("COL1",StringType,true)::StructField("COL2",**DecimalType**,true)::Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x=>x.split(",")).map(x=>Row.fromSeq(x))
val df1= sqlContext.createDataFrame(row,sch)
df1 is getting created without any errors.But, when I issue as df1.collect() action, it is giving me the below error:
scala.MatchError: 0 (of class java.lang.String)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:326)
test_file.txt content:
test1,0
test2,0.67
test3,10.65
test4,-10.1234567890
Is there any issue with the way that I am creating DecimalType?
You should have an instance of BigDecimal to convert to DecimalType.
val DecimalType = DataTypes.createDecimalType(15, 10)
val sch = StructType(StructField("COL1", StringType, true) :: StructField("COL2", DecimalType, true) :: Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x => x.split(",")).map(x => Row(x(0), BigDecimal.decimal(x(1).toDouble)))
val df1 = spark.createDataFrame(row, sch)
df1.collect().foreach { println }
df1.printSchema()
The result looks like this:
[test1,0E-10]
[test2,0.6700000000]
[test3,10.6500000000]
[test4,-10.1234567890]
root
|-- COL1: string (nullable = true)
|-- COL2: decimal(15,10) (nullable = true)
When you read a file as sc.textFile it reads all the values as string, So error is due to applying the schema while creating dataframe
For this you can convert the second value to Decimal before applying schema
val row = src.map(x=>x.split(",")).map(x=>Row(x(0), BigDecimal.decimal(x(1).toDouble)))
Or if you reading a cav file then you can use spark-csv to read csv file and provide the schema while reading the file.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For Spark > 2.0
spark.read
.option("header", true)
.schema(sch)
.csv(file)
Hope this helps!
A simpler way to solve your problem would be to load the csv file directly as a dataframe. You can do that like this:
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false") // no header
.option("inferSchema", "true")
.load("/file/path/")
Or for Spark > 2.0:
val spark = SparkSession.builder.getOrCreate()
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "false") // no headers
.load("/file/path")
Output:
df.show()
+-----+--------------+
| _c0| _c1|
+-----+--------------+
|test1| 0|
|test2| 0.67|
|test3| 10.65|
|test4|-10.1234567890|
+-----+--------------+