How to read redis map in spark using spark-redis - scala

I have a normal scala map in Redis (key and value). Now I want to read that map in one of my spark-streaming program and use this as a broadcast variable so that my slaves can use that map to resolve key mapping. I am using spark-redis 2.3.1 library, but now sure how to read that.
Map in redis table "employee" -
name | value
------------------
123 David
124 John
125 Alex
This is how I am trying to read in spark (Not sure if this is correct- please correct me) --
val loadedDf = spark.read
.format("org.apache.spark.sql.redis")
.schema(
StructType(Array(
StructField("name", IntegerType),
StructField("value", StringType)
)
))
.option("table", "employee")
.option("key.column", "name")
.load()
loadedDf.show()
The above code does not show anything, I get empty output.

You could use the below code to your task but you need to utilize Spark Dataset (case Dataframe to case class) to do this task. Below is a full example to read and write in Redis.
object DataFrameExample {
case class employee(name: String, value: Int)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val personSeq = Seq(employee("John", 30), employee("Peter", 45)
val df = spark.createDataFrame(personSeq)
df.write
.format("org.apache.spark.sql.redis")
.option("table", "person")
.mode(SaveMode.Overwrite)
.save()
val loadedDf = spark.read
.format("org.apache.spark.sql.redis")
.option("table", "person")
.load()
loadedDf.printSchema()
loadedDf.show()
}
}
Output is below
root
|-- name: string (nullable = true)
|-- value: integer (nullable = false)
+-----+-----+
| name|value|
+-----+-----+
| John| 30 |
|Peter| 45 |
+-----+-----+
You could also check more details in Redis documentation

Related

Spark Structured Streaming: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

I'm using Spark Structured Streaming (3.2.1) with Kafka.
I'm trying to simply read JSON from Kafka using a defined schema.
My problem is that in defined schema I got non-nullable field that is ignored when I read messages from Kafka. I use the from_json functions that seems to ignore that some fields can't be null.
Here is my code example:
val schemaTest = new StructType()
.add("firstName", StringType)
.add("lastName", StringType)
.add("birthDate", LongType, nullable = false)
val loader =
spark
.readStream
.format("kafka")
.option("startingOffsets", "earliest")
.option("kafka.bootstrap.servers", "BROKER:PORT")
.option("subscribe", "TOPIC")
.load()
val df = loader.
selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
df.printSchema()
val q = df.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
I got this when I'm printing the schema of df which is different of my schemaTest:
root
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- birthDate: long (nullable = true)
And received data are like that:
+---------+--------+----------+
|firstName|lastName|birthDate |
+---------+--------+----------+
|Toto |Titi |1643799912|
+---------+--------+----------+
|Tutu |Tata |null |
+---------+--------+----------+
We also try to add option to change mode in from_json function from default one PERMISSIVE to others (DROPMALFORMED, FAILFAST) but in fact the second record that doesn't respect the defined schema is simply not considered as corrupted because the field birthDate is nullable..
Maybe I missed something but if it's not the case, I got following questions.
Do you know why the printSchema of df is not like my schemaTest ? (With non nullable field)
And also, how can I manage non-nullable value in my case ? I know that I can filter but I would like to know if there is an alternative using schema like it's supposed to work. And also, It's not quite simple to filter if I got a schema with lots of fields non-nullable.
This is actually the intended behavior of from_json function. You can read the following from the source code:
// The JSON input data might be missing certain fields. We force the nullability
// of the user-provided schema to avoid data corruptions. In particular, the parquet-mr encoder
// can generate incorrect files if values are missing in columns declared as non-nullable.
val nullableSchema = schema.asNullable
override def nullable: Boolean = true
If you have multiple fields which are mandatory then you can construct the filter expression from your schemaTest (or list of columns) and use it like this:
val filterExpr = schemaTest.fields
.filter(!_.nullable)
.map(f => col(f.name).isNotNull)
.reduce(_ and _)
val df = loader
.selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.filter(filterExpr)
I would like to propose a different way of doing :
def isCorrupted(df: DataFrame): DataFrame = {
val filterNullable = schemaTest
.filter(e => !e.nullable)
.map(_.name)
filterNullable.foldLeft(df) { case ((accumulator), (columnName)) =>
accumulator.withColumn("isCorrupted", when(col(columnName).isNull, 1).otherwise(0))
}
.filter(col("isCorrupted") === lit(0))
.drop(col("isCorrupted"))
}
val df = loader
.selectExpr("CAST(value as STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.transform(isCorrupted)

How to wrap multiple sql functions into a UDF in Spark?

I am working with Spark 2.3.2.
On one column within my Dataframe I am performing many spark.sql.functions sequentually. How can I wrap this sequence of functions into a user-defined-function (UDF) to make it reusable?
Here is my example focusing on the one column "columnName". First I am creating my test data:
val testSchema = new StructType()
.add("columnName", new StructType()
.add("2020-11", LongType)
.add("2020-12", LongType)
)
val testRow = Seq(Row(Row(1L, 2L)))
val testRDD = spark.sparkContext.parallelize(testRow)
val testDF = spark.createDataFrame(testRDD, testSchema)
testDF.printSchema()
/*
root
|-- columnName: struct (nullable = true)
| |-- 2020-11: long (nullable = true)
| |-- 2020-12: long (nullable = true)
*/
testDF.show(false)
/*
+----------+
|columnName|
+----------+
|[1, 2] |
+----------+
*/
And here is the sequence of applied Spark SQL functions (just as an example):
val testResult = testDF
.select(explode(split(regexp_replace(to_json(col("columnName")), "[\"{}]", ""), ",")).as("result"))
I am failing to create a UDF "myUDF", such that I can get the same result when calling
val testResultWithUDF = testDF.select(myUDF(col("columnName"))
This is what I "would like" to do:
def parseAndExplode(spalte: Column): Column = {
explode(split(regexp_replace(to_json(spalte), "[\"{}]", ""), ",")
}
val myUDF = udf(parseAndExplode _)
testDF.withColumn("udf_result", myUDF(col("columnName"))).show(false)
but it is throwing an Exception:
Schema for type org.apache.spark.sql.Column is not supported
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
Also tried with using a Row as input parameter but then again failed trying to apply built-in SQL functions.
There is no need to use an udf here. explode, split and most other functions from org.apache.spark.sql.functions return already an object of type Column.
def parseAndExplode(spalte: Column): Column = {
explode(split(regexp_replace(to_json(spalte), "[\"{}]", ""), ","))
}
testDF.withColumn("udf_result",parseAndExplode('columnName)).show(false)
prints
+----------+----------+
|columnName|udf_result|
+----------+----------+
|[1, 2] |2020-11:1 |
|[1, 2] |2020-12:2 |
+----------+----------+

I don't know how to do the same using parquet file

Link to (data.csv) and (output.csv)
import org.apache.spark.sql._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val tempDF=spark.read.csv("data.csv")
tempDF.coalesce(1).write.parquet("Parquet")
val rdd = sc.textFile("Parquet")
I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file.Link of (data.csv) and (output.csv)
val header = rdd.first
val rdd1 = rdd.filter(_ != header)
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
import spark.sqlContext.implicits._
val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
dataSet.coalesce(1).write.option("header","true").csv("output")
}
case class CountryAgg(country: String, values: String)
}
I reckon, you are trying to add up corresponding elements from the array based on Country. I have done this using DataFrame APIs, which makes the job easier.
Code for your reference:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("path", "/path/to/input/data.csv")
.load()
val df1 = df.select(
$"Country",
(split($"Values", ";"))(0).alias("c1"),
(split($"Values", ";"))(1).alias("c2"),
(split($"Values", ";"))(2).alias("c3"),
(split($"Values", ";"))(3).alias("c4"),
(split($"Values", ";"))(4).alias("c5")
)
.groupBy($"Country")
.agg(
sum($"c1" cast "int").alias("s1"),
sum($"c2" cast "int").alias("s2"),
sum($"c3" cast "int").alias("s3"),
sum($"c4" cast "int").alias("s4"),
sum($"c5" cast "int").alias("s5")
)
.select(
$"Country",
concat(
$"s1", lit(";"),
$"s2", lit(";"),
$"s3", lit(";"),
$"s4", lit(";"),
$"s5"
).alias("Values")
)
df1.repartition(1)
.write
.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "/path/to/output")
.save()
Here is the output for your reference.
scala> df1.show()
+-------+-------------------+
|Country| Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
| China| 218;239;234;209;75|
| India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+
P.S.:
You can change the output format to parquet/orc or anything you wish.
I have repartitioned df1 into 1 partition just so that you could get a single output file. You can choose to repartition or not based
on your usecase
Hope this helps.
You could just read the file as parquet and perform the same operations on the resulting dataframe:
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")
If you need an rdd you can then just call:
val rdd = parquetFileDF.rdd
The you can proceed with the transformations as before and write as parquet like you have in your question.

DecimalType issue while creating Dataframe

While I am trying to create a dataframe using a decimal type it is throwing me the below error.
I am performing the following steps:
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StringType;
import org.apache.spark.sql.types.DataTypes._;
//created a DecimalType
val DecimalType = DataTypes.createDecimalType(15,10)
//Created a schema
val sch = StructType(StructField("COL1",StringType,true)::StructField("COL2",**DecimalType**,true)::Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x=>x.split(",")).map(x=>Row.fromSeq(x))
val df1= sqlContext.createDataFrame(row,sch)
df1 is getting created without any errors.But, when I issue as df1.collect() action, it is giving me the below error:
scala.MatchError: 0 (of class java.lang.String)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:326)
test_file.txt content:
test1,0
test2,0.67
test3,10.65
test4,-10.1234567890
Is there any issue with the way that I am creating DecimalType?
You should have an instance of BigDecimal to convert to DecimalType.
val DecimalType = DataTypes.createDecimalType(15, 10)
val sch = StructType(StructField("COL1", StringType, true) :: StructField("COL2", DecimalType, true) :: Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x => x.split(",")).map(x => Row(x(0), BigDecimal.decimal(x(1).toDouble)))
val df1 = spark.createDataFrame(row, sch)
df1.collect().foreach { println }
df1.printSchema()
The result looks like this:
[test1,0E-10]
[test2,0.6700000000]
[test3,10.6500000000]
[test4,-10.1234567890]
root
|-- COL1: string (nullable = true)
|-- COL2: decimal(15,10) (nullable = true)
When you read a file as sc.textFile it reads all the values as string, So error is due to applying the schema while creating dataframe
For this you can convert the second value to Decimal before applying schema
val row = src.map(x=>x.split(",")).map(x=>Row(x(0), BigDecimal.decimal(x(1).toDouble)))
Or if you reading a cav file then you can use spark-csv to read csv file and provide the schema while reading the file.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For Spark > 2.0
spark.read
.option("header", true)
.schema(sch)
.csv(file)
Hope this helps!
A simpler way to solve your problem would be to load the csv file directly as a dataframe. You can do that like this:
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false") // no header
.option("inferSchema", "true")
.load("/file/path/")
Or for Spark > 2.0:
val spark = SparkSession.builder.getOrCreate()
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "false") // no headers
.load("/file/path")
Output:
df.show()
+-----+--------------+
| _c0| _c1|
+-----+--------------+
|test1| 0|
|test2| 0.67|
|test3| 10.65|
|test4|-10.1234567890|
+-----+--------------+

Unable to find encoder for type stored in a Dataset. in spark structured streaming

I am trying example of spark structured streaming given on the spark website but it is throwing error
1. Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
2. not enough arguments for method as: (implicit evidence$2: org.apache.spark.sql.Encoder[data])org.apache.spark.sql.Dataset[data].
Unspecified value parameter evidence$2.
val ds: Dataset[data] = df.as[data]
Here is my code
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.Encoders
object final_stream {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("kafka-consumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
case class data(name: String, id: String)
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "172.21.0.187:9093")
.option("subscribe", "test")
.load()
println(df.isStreaming)
val ds: Dataset[data] = df.as[data]
val value = ds.select("name").where("id > 10")
value.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
any help on how to make this work.?
I want final output like this
i want output like this
+-----+--------+
| name| id
+-----+--------+
|Jacek| 1
+-----+--------+
The reason for the error is that you are dealing with Array[Byte] as coming from Kafka and there are no fields to match data case class.
scala> println(schema.treeString)
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
Change the line df.as[data] to the following:
df.
select($"value" cast "string").
map(value => ...parse the value to get name and id here...).
as[data]
I strongly recommend using select and functions object to deal with the incoming data.
The error is due to mismatch of number of column in dataframe and your case class.
You have [topic, timestamp, value, key, offset, timestampType, partition] columns in dataframe
Whereas your case class is only with two columns
case class data(name: String, id: String)
You can display the content of dataframe as
val display = df.writeStream.format("console").start()
Sleep for some seconds and then
display.stop()
And also use option("startingOffsets", "earliest") as mentioned here
Then create a case class as per your data.
Hope this helps!