How to convert rdd object to dataframe in Scala - scala

I read data from ElasticSearch and save into an RDD.
val es_rdd = sc.esRDD("indexname/typename",query="?q=*")
The rdd has the next example data:
(uniqueId,Map(field -> value))
(uniqueId2,Map(field2 -> value2))
How can I convert this RDD (String, Map to a Dataframe (String, String, String)?

You can use explode to achieve it.
import spark.implicits._
import org.apache.spark.sql.functions._
val rdd = sc.range(1, 10).map(s => (s, Map(s -> s)))
val ds = spark.createDataset(rdd)
val df = ds.toDF()
df.printSchema()
df.show()
df.select('_1,explode('_2)).show()
output:
root
|-- _1: long (nullable = false)
|-- _2: map (nullable = true)
| |-- key: long
| |-- value: long (valueContainsNull = false)
+---+--------+
| _1| _2|
+---+--------+
| 1|[1 -> 1]|
| 2|[2 -> 2]|
| 3|[3 -> 3]|
| 4|[4 -> 4]|
| 5|[5 -> 5]|
| 6|[6 -> 6]|
| 7|[7 -> 7]|
| 8|[8 -> 8]|
| 9|[9 -> 9]|
+---+--------+
+---+---+-----+
| _1|key|value|
+---+---+-----+
| 1| 1| 1|
| 2| 2| 2|
| 3| 3| 3|
| 4| 4| 4|
| 5| 5| 5|
| 6| 6| 6|
| 7| 7| 7|
| 8| 8| 8|
| 9| 9| 9|
+---+---+-----+

I readed directly in Spark.SQL format using the next call to elastic:
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", "?q=*")
.option("pushdown", "true")
.load("indexname/typename")

Related

Drop rows in spark which dont follow schema

currently, schema for my table is:
root
|-- product_id: integer (nullable = true)
|-- product_name: string (nullable = true)
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
I want to apply the below schema on the above table and delete all the rows which do not follow the below schema:
val productsSchema = StructType(Seq(
StructField("product_id",IntegerType,nullable = true),
StructField("product_name",StringType,nullable = true),
StructField("aisle_id",IntegerType,nullable = true),
StructField("department_id",IntegerType,nullable = true)
))
Use option "DROPMALFORMED" while loading the data which ignores corrupted records.
spark.read.format("json")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.schema(productsSchema)
.load("sample.json")
If data is not matching with schema, spark will put null as value in that column. We just have to filter the null values for all columns.
Used filter to filter ```null`` values for all columns.
scala> "cat /tmp/sample.json".! // JSON File Data, one row is not matching with schema.
{"product_id":1,"product_name":"sampleA","aisle_id":"AA","department_id":"AAD"}
{"product_id":2,"product_name":"sampleBB","aisle_id":"AAB","department_id":"AADB"}
{"product_id":3,"product_name":"sampleCC","aisle_id":"CC","department_id":"CCC"}
{"product_id":3,"product_name":"sampledd","aisle_id":"dd","departmentId":"ddd"}
{"name","srinivas","age":29}
res100: Int = 0
scala> schema.printTreeString
root
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
|-- product_id: long (nullable = true)
|-- product_name: string (nullable = true)
scala> val df = spark.read.schema(schema).option("badRecordsPath", "/tmp/badRecordsPath").format("json").load("/tmp/sample.json") // Loading Json data & if schema is not matching we will be getting null rows for all columns.
df: org.apache.spark.sql.DataFrame = [aisle_id: string, department_id: string ... 2 more fields]
scala> df.show(false)
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
|null |null |null |null |
+--------+-------------+----------+------------+
scala> df.filter(df.columns.map(c => s"${c} is not null").mkString(" or ")).show(false) // Filter null rows.
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
+--------+-------------+----------+------------+
scala>
do check out na.drop functions on data-frame, you can drop rows based on null values, min nulls in a row, and also based on a specific column which has nulls.
scala> sc.parallelize(Seq((1,"a","a"),(1,"a","a"),(2,"b","b"),(3,"c","c"),(4,"d","d"),(4,"d",null))).toDF
res7: org.apache.spark.sql.DataFrame = [_1: int, _2: string ... 1 more field]
scala> res7.show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//dropping row if a null is found
scala> res7.na.drop.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//drops only if `minNonNulls = 3` if accepted to each row
scala> res7.na.drop(minNonNulls = 3).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//not dropping any
scala> res7.na.drop(minNonNulls = 2).show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//drops row based on nulls in `_3` column
scala> res7.na.drop(Seq("_3")).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+

Drop duplicates if reverse is present between two columns

I have dataframe contain (around 20000000 rows) and I'd like to drop duplicates from a dataframe for two columns if those columns have the same values, or even if those values are in the reverse order.
For example the original dataframe:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 1| B|
| 2| 1| C|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
| 4| 3| G|
+----+----+----+
where the schema of the column as follows:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
The desired dataframe should look like:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
+----+----+----+
The dropDuplicates() method remove duplicates if the values in the same order
I followed the accepted answer to this question Pandas: remove reverse duplicates from dataframe but it took more time.
You can use this:
Hope this helps.
Note : In 'col3' 'D' will be removed istead of 'C', because 'C' is positioned before 'D'.
from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack2.csv', header = 'True')
df2 = df.select(F.least(df.col1,df.col2).alias('col1'),F.greatest(df.col1,df.col2).alias('col2'),df.col3)
df2.dropDuplicates(['col1','col2']).show()

Scala/Spark: How to select columns to read ONLY when list of columns > 0

I'm passing in a parameter fieldsToLoad: List[String] and I want to load ALL columns if this list is empty and load only the columns specified in the list if the list has more one or more columns. I have this now which reads the columns passed in the list:
val parquetDf = sparkSession.read.parquet(inputPath:_*).select(fieldsToLoad.head, fieldsToLoadList.tail:_*)
But how do I add a condition to load * (all columns) when the list is empty?
#Andy Hayden answer is correct but I want to introduce how to use selectExpr function to simplify the selection
scala> val df = Range(1, 4).toList.map(x => (x, x + 1, x + 2)).toDF("c1", "c2", "c3")
df: org.apache.spark.sql.DataFrame = [c1: int, c2: int ... 1 more field]
scala> df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
scala> val fieldsToLoad = List("c2", "c3")
fieldsToLoad: List[String] = List(c2, c3) ^
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+
| c2| c3|
+---+---+
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+
scala> val fieldsToLoad = List()
fieldsToLoad: List[Nothing] = List()
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
You could use an if statement first to replace the empty with just *:
val cols = if (fieldsToLoadList.nonEmpty) fieldsToLoadList else Array("*")
sparkSession.read.parquet(inputPath:_*).select(cols.head, cols.tail:_*).

How to write a nested query?

I have following table:
+-----+---+----+
|type | t |code|
+-----+---+----+
| A| 25| 11|
| A| 55| 42|
| B| 88| 11|
| A|114| 11|
| B|220| 58|
| B|520| 11|
+-----+---+----+
And what I want:
+-----+---+----+
|t1 | t2|code|
+-----+---+----+
| 25| 88| 11|
| 114|520| 11|
+-----+---+----+
There are two types of events A and B.
Event A is the start, Event B is the end.
I want to connect the start with the next end dependence of the code.
It's quite easy in SQL to do this:
SELECT a.t AS t1,
(SELECT b.t FROM events AS b WHERE a.code == b.code AND a.t < b.t LIMIT 1) AS t2, a.code AS code
FROM events AS a
But I have to problem to implement this in Spark because it looks like that this kind of nested query isn't supported...
I tried it with:
df.createOrReplaceTempView("events")
val sqlDF = spark.sql(/* SQL-query above */)
Error i get:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Accessing outer query column is not allowed in:
Do you have any other ideas to solve that problem?
It's quite easy in SQL to do this
And so is in Spark SQL, luckily.
val events = ...
scala> events.show
+----+---+----+
|type| t|code|
+----+---+----+
| A| 25| 11|
| A| 55| 42|
| B| 88| 11|
| A|114| 11|
| B|220| 58|
| B|520| 11|
+----+---+----+
// assumed that t is int
scala> events.printSchema
root
|-- type: string (nullable = true)
|-- t: integer (nullable = true)
|-- code: integer (nullable = true)
val eventsA = events.
where($"type" === "A").
as("a")
val eventsB = events.
where($"type" === "B").
as("b")
val solution = eventsA.
join(eventsB, "code").
where($"a.t" < $"b.t").
select($"a.t" as "t1", $"b.t" as "t2", $"a.code").
orderBy($"t1".asc, $"t2".asc).
dropDuplicates("t1", "code").
orderBy($"t1".asc)
That should give you the requested output.
scala> solution.show
+---+---+----+
| t1| t2|code|
+---+---+----+
| 25| 88| 11|
|114|520| 11|
+---+---+----+

Aggregation the derived column spark

DF.groupBy("id")
.agg(
sum((when(upper($"col_name") === "text", 1)
.otherwise(0)))
.alias("df_count")
.when($"df_count"> 1, 1)
.otherwise(0)
)
Can I do aggregation on the column which was named as alias? ,i.e if the sum is greater than one then return 1 else 0
Thanks in advance.
I think you could wrap another when.otherwise around the sum result:
val df = Seq((1, "a"), (1, "a"), (2, "b"), (3, "a")).toDF("id", "col_name")
df.show
+---+--------+
| id|col_name|
+---+--------+
| 1| a|
| 1| a|
| 2| b|
| 3| a|
+---+--------+
df.groupBy("id").agg(
sum(when(upper($"col_name") === "A", 1).otherwise(0)).alias("df_count")
).show()
+---+--------+
| id|df_count|
+---+--------+
| 1| 2|
| 3| 1|
| 2| 0|
+---+--------+
df.groupBy("id").agg(
when(sum(when(upper($"col_name")==="A", 1).otherwise(0)) > 1, 1).otherwise(0).alias("df_count")
).show()
+---+--------+
| id|df_count|
+---+--------+
| 1| 1|
| 3| 0|
| 2| 0|
+---+--------+