DecimalType issue while creating Dataframe - scala

While I am trying to create a dataframe using a decimal type it is throwing me the below error.
I am performing the following steps:
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StringType;
import org.apache.spark.sql.types.DataTypes._;
//created a DecimalType
val DecimalType = DataTypes.createDecimalType(15,10)
//Created a schema
val sch = StructType(StructField("COL1",StringType,true)::StructField("COL2",**DecimalType**,true)::Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x=>x.split(",")).map(x=>Row.fromSeq(x))
val df1= sqlContext.createDataFrame(row,sch)
df1 is getting created without any errors.But, when I issue as df1.collect() action, it is giving me the below error:
scala.MatchError: 0 (of class java.lang.String)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:326)
test_file.txt content:
test1,0
test2,0.67
test3,10.65
test4,-10.1234567890
Is there any issue with the way that I am creating DecimalType?

You should have an instance of BigDecimal to convert to DecimalType.
val DecimalType = DataTypes.createDecimalType(15, 10)
val sch = StructType(StructField("COL1", StringType, true) :: StructField("COL2", DecimalType, true) :: Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x => x.split(",")).map(x => Row(x(0), BigDecimal.decimal(x(1).toDouble)))
val df1 = spark.createDataFrame(row, sch)
df1.collect().foreach { println }
df1.printSchema()
The result looks like this:
[test1,0E-10]
[test2,0.6700000000]
[test3,10.6500000000]
[test4,-10.1234567890]
root
|-- COL1: string (nullable = true)
|-- COL2: decimal(15,10) (nullable = true)

When you read a file as sc.textFile it reads all the values as string, So error is due to applying the schema while creating dataframe
For this you can convert the second value to Decimal before applying schema
val row = src.map(x=>x.split(",")).map(x=>Row(x(0), BigDecimal.decimal(x(1).toDouble)))
Or if you reading a cav file then you can use spark-csv to read csv file and provide the schema while reading the file.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For Spark > 2.0
spark.read
.option("header", true)
.schema(sch)
.csv(file)
Hope this helps!

A simpler way to solve your problem would be to load the csv file directly as a dataframe. You can do that like this:
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false") // no header
.option("inferSchema", "true")
.load("/file/path/")
Or for Spark > 2.0:
val spark = SparkSession.builder.getOrCreate()
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "false") // no headers
.load("/file/path")
Output:
df.show()
+-----+--------------+
| _c0| _c1|
+-----+--------------+
|test1| 0|
|test2| 0.67|
|test3| 10.65|
|test4|-10.1234567890|
+-----+--------------+

Related

read csv file with json as string in csv and convert to json apache spark scala

I have a csv file like this:
id|jsonData
1|"[{""phone"":1989731788,""sources"":[""ecventa/clientecompania""],""lastDate"":1227532475000},{""phone"":374660,""sources"":[""ecventa/clientecompania""],""lastDate"":1227532475000}]"
and then I have a StructType like this:
val nestedPhone = new StryctType()
.add("phone",StringType,true)
.add("sources",ArrayType(StringType),true)
.add("lastDate",StringType,true)
val myStructType = new StructType()
.add("id",StringType,true)
.add("formatedData",ArrayType(nestedPhone),true)
var batchDF = spark.read.format("csv")
.option("header", "true")
.option("delimiter", "|")
.load("mycsvPath")
Can I create a new Dataframe with myStructType using batchDF?
I test with this:
val result = batchDF
.withColumn("formatedData",from_json(expr("substring(jsonData, 2, length(jsonData) - 2)"),ArrayType(nestedPhone)))
but that works bad because FormatedDate return null row.
try in this way
val nestedPhone = new StructType()
.add("phone", StringType, true)
.add("sources", ArrayType(StringType), true)
.add("lastDate", StringType, true)
val batchDF = sparkSession.read.format("csv")
.option("header", "true")
.option("delimiter", "|")
.load("src/main/resources/file.csv")
val result = batchDF
.withColumn("formatedData",
from_json(
expr(
"substring(jsonData, 2, length(jsonData) - 2)"
), ArrayType(nestedPhone)
)
)

I don't know how to do the same using parquet file

Link to (data.csv) and (output.csv)
import org.apache.spark.sql._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val tempDF=spark.read.csv("data.csv")
tempDF.coalesce(1).write.parquet("Parquet")
val rdd = sc.textFile("Parquet")
I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file.Link of (data.csv) and (output.csv)
val header = rdd.first
val rdd1 = rdd.filter(_ != header)
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
import spark.sqlContext.implicits._
val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
dataSet.coalesce(1).write.option("header","true").csv("output")
}
case class CountryAgg(country: String, values: String)
}
I reckon, you are trying to add up corresponding elements from the array based on Country. I have done this using DataFrame APIs, which makes the job easier.
Code for your reference:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("path", "/path/to/input/data.csv")
.load()
val df1 = df.select(
$"Country",
(split($"Values", ";"))(0).alias("c1"),
(split($"Values", ";"))(1).alias("c2"),
(split($"Values", ";"))(2).alias("c3"),
(split($"Values", ";"))(3).alias("c4"),
(split($"Values", ";"))(4).alias("c5")
)
.groupBy($"Country")
.agg(
sum($"c1" cast "int").alias("s1"),
sum($"c2" cast "int").alias("s2"),
sum($"c3" cast "int").alias("s3"),
sum($"c4" cast "int").alias("s4"),
sum($"c5" cast "int").alias("s5")
)
.select(
$"Country",
concat(
$"s1", lit(";"),
$"s2", lit(";"),
$"s3", lit(";"),
$"s4", lit(";"),
$"s5"
).alias("Values")
)
df1.repartition(1)
.write
.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "/path/to/output")
.save()
Here is the output for your reference.
scala> df1.show()
+-------+-------------------+
|Country| Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
| China| 218;239;234;209;75|
| India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+
P.S.:
You can change the output format to parquet/orc or anything you wish.
I have repartitioned df1 into 1 partition just so that you could get a single output file. You can choose to repartition or not based
on your usecase
Hope this helps.
You could just read the file as parquet and perform the same operations on the resulting dataframe:
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")
If you need an rdd you can then just call:
val rdd = parquetFileDF.rdd
The you can proceed with the transformations as before and write as parquet like you have in your question.

How to read redis map in spark using spark-redis

I have a normal scala map in Redis (key and value). Now I want to read that map in one of my spark-streaming program and use this as a broadcast variable so that my slaves can use that map to resolve key mapping. I am using spark-redis 2.3.1 library, but now sure how to read that.
Map in redis table "employee" -
name | value
------------------
123 David
124 John
125 Alex
This is how I am trying to read in spark (Not sure if this is correct- please correct me) --
val loadedDf = spark.read
.format("org.apache.spark.sql.redis")
.schema(
StructType(Array(
StructField("name", IntegerType),
StructField("value", StringType)
)
))
.option("table", "employee")
.option("key.column", "name")
.load()
loadedDf.show()
The above code does not show anything, I get empty output.
You could use the below code to your task but you need to utilize Spark Dataset (case Dataframe to case class) to do this task. Below is a full example to read and write in Redis.
object DataFrameExample {
case class employee(name: String, value: Int)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val personSeq = Seq(employee("John", 30), employee("Peter", 45)
val df = spark.createDataFrame(personSeq)
df.write
.format("org.apache.spark.sql.redis")
.option("table", "person")
.mode(SaveMode.Overwrite)
.save()
val loadedDf = spark.read
.format("org.apache.spark.sql.redis")
.option("table", "person")
.load()
loadedDf.printSchema()
loadedDf.show()
}
}
Output is below
root
|-- name: string (nullable = true)
|-- value: integer (nullable = false)
+-----+-----+
| name|value|
+-----+-----+
| John| 30 |
|Peter| 45 |
+-----+-----+
You could also check more details in Redis documentation

Convert a dataframe into json string in Spark

I'm a bit new to Spark and Scala.I have a (large ~ 1million) Scala Spark DataFrame, and I need to make it a json String.
the schema of the df like this
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
|--valKey(String)
|--vslScore(Double)
key is product id and, value is some produt set and it's score values that I get from a parquet file.
I only manage to get something like this. For curly brackets I simply concatenate them to result.
3434343<tab>{smartphones:apple:0.4564879,smartphones:samsung:0.723643 }
But I expect a value like this.Each value should have a
3434343<tab>{"smartphones:apple":0.4564879, "smartphones:samsung":0.723643 }
are there anyway that I directly convert this into a Json string without concatenate anything. I hope to write output files into .csv format. This is code I'm using
val df = parquetReaderDF.withColumn("key",col("productId"))
.withColumn("value", struct(
col("productType"),
col("brand"),
col("score")))
.select("key","value")
val df2 = df.withColumn("valKey", concat(
col("productType"),lit(":")
,col("brand"),lit(":"),
col("score")))
.groupBy("key")
.agg(collect_list(col("valKey")))
.map{ r =>
val key = r.getAs[String]("key")
val value = r.getAs[Seq[String]] ("collect_list(valKey)").mkString(",")
(key,value)
}
.toDF("key", "valKey")
.withColumn("valKey", concat(lit("{"), col("valKey"), lit("}")))
df.coalesce(1)
.write.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.option("header", "false")
.option("quoteMode", "yes")
.save("data.csv")

How to create a Dataframe programmatically that isn't StringType

I'm building a schema that is rather large so I am using the example of progamatical schema creation from the documentation.
val schemaString = "field1,...,field126"
val schema = StructType(schemaString.split(",").map(fieldName => StructField(fieldName.trim, StringType, true)))
This works fine but I need to have all fields as DoubleType for my ML function. I changed the StringType to DoubleType and I get an error.
val schemaString = "field1,...,field126"
val schema = StructType(schemaString.split(",").map(fieldName => StructField(fieldName.trim, DoubleType, true)))
Error:
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
I know I can shift to creating the schema manually but with 126 fields the code gets bulky.
val schema = new StructType()
.add("ColumnA", IntegerType)
.add("ColumnB", StringType)
val df = sqlContext.read
.schema(schema)
.format("com.databricks.spark.csv")
.delimiter(",")
.load("/path/to/file.csv")
I think there is no need to pass your own schema , It will infer it automatically , if your csv file contains the name of the columns then it will take it too if you set the header as true.
This will work simply(not-tested) :
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data/sample.csv")
It will give you a dataframe and if you have the column name to then just set header as true !