Not able to display parquet file output using Spark - scala

Installed Spark on local (on Linux file system with no HADOOP), Everything seems to be working fine except reading parquet file and displaying results on console. After reading parquet file, when I do dataframe.show() it just this Indefinnitely:
[Stage 4:> (0 + 1) / 1]
Not sure if I need to configure anything specific on local installation?
Can some one please help me on this?
Installed spark on local(on Linux) and followed below steps using spark/scala:
scala> val columns = Seq("firstname","middlename","lastname","dob","gender","salary")
val columns: Seq[String] = List(firstname, middlename, lastname, dob, gender, salary)
scala> val data = Seq(("James ","x","Smith","36636","M",3000),
| ("Michael ","Rose","cc","40288","M",4000),
| ("Robert ","c","Williams","42114","M",4000),
| ("Maria ","Anne","Jones","39192","F",4000),
| ("Jen","Mary","Brown","34552","F",-1)
| )
|
val data: Seq[(String, String, String, String, String, Int)] = List(("James ",x,Smith,36636,M,3000), ("Michael ",Rose,cc,40288,M,4000), ("Robert ",c,Williams,42114,M,4000), ("Maria ",Anne,Jones,39192,F,4000), (Jen,Mary,Brown,34552,F,-1))
scala> val df1 = data.toDF(columns:_*)
val df1: org.apache.spark.sql.DataFrame = [firstname: string, middlename: string ... 4 more fields]
scala> df1.show()
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| dob|gender|salary|
+---------+----------+--------+-----+------+------+
| James | x| Smith|36636| M| 3000|
| Michael | Rose| cc|40288| M| 4000|
| Robert | c|Williams|42114| M| 4000|
| Maria | Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown|34552| F| -1|
+---------+----------+--------+-----+------+------+
scala> df1.printSchema()
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- dob: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = false)
scala> df1.write.partitionBy("gender","salary").parquet("/home/usalla/spark_test/emppq.parquet")
scala> val parqDF2 = spark.read.parquet("/home/usalla/spark_test/emppq.parquet")
val parqDF2: org.apache.spark.sql.DataFrame = [firstname: string, middlename: string ... 4 more fields]
scala> parqDF2.createOrReplaceTempView("ParquetTable2")
scala> val df3 = spark.sql("select * from ParquetTable2 where gender='M' and salary >= 4000")
val df3: org.apache.spark.sql.DataFrame = [firstname: string, middlename: string ... 4 more fields]
scala> df3.explain
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [firstname#144,middlename#145,lastname#146,dob#147,gender#148,salary#149] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/home/usalla/spark_test/emppq.parquet], PartitionFilters: [isnotnull(gender#148), isnotnull(salary#149), (gender#148 = M), (salary#149 >= 4000)], PushedFilters: [], ReadSchema: struct<firstname:string,middlename:string,lastname:string,dob:string>
scala> df3.printSchema()
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- dob: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = true)
scala> df3.show()
[Stage 4:> (0 + 1) / 1]
^Z

Related

How to handle duplicate column(s) in table definition

Given a dataframe with following schema . The problem is that the dataframe is dynamic and so is its field . So you can pre assume a given schema.
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)
|-- a: string (nullable = true)
|-- b: long (nullable = true)
|-- c: long (nullable = true)
|-- d: long (nullable = true)
|-- a: long (nullable = true)
The following error is shown :-
Found duplicate column(s) in table definition
how should we rename the column name to remove ambiguity
Here is how you can rename it
import spark.implicits._
val df = Seq(
("a", 1, "a"),
("a", 1, "a"),
("a", 1, "a")
).toDF("a", "x", "a")
val columns = List("a", "b", "c")
val newDF = df.toDF(columns: _*)
newDF.show(false)
newDF.printSchema()
New Output:
+---+---+---+
|a |b |c |
+---+---+---+
|a |1 |a |
|a |1 |a |
|a |1 |a |
+---+---+---+
New Schema:
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
|-- c: string (nullable = true)

Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

Converted dataframe(say child dataframe) into json using df.toJSON
After json conversion the schema looks like this :
root
|-- value: string (nullable = true)
I used the following suggestion to get child dataframe into the intermediate parent schema/dataframe:
scala> parentDF.toJSON.select(struct($"value").as("data")).printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Now I still need to build the parentDF schema further to make it look like:
root
|-- id
|-- version
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Q1) How can I build the id column using the id from value(i.e value.id needs to be represented as id)
Q2) I need to bring version from a different dataframe(say versionDF) where version is a constant(in all columns). Do I fetch one row from this versionDF to read value of version column and then populate it as literal in the parentDF ?
please help with any code snippets on this.
Instead of toJSON use to_json in select statement & select required columns along with to_json function.
Check below code.
val version = // Get version value from versionDF
parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version"))
scala> parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- version: double (nullable = false)
Updated
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).show(false)
+---+------------------------------------------+-------+
|id |data |version|
+---+------------------------------------------+-------+
|1 |{"value":{"id":1,"col1":"a1","col2":"b1"}}|1 |
|2 |{"value":{"id":2,"col1":"a2","col2":"b2"}}|1 |
|3 |{"value":{"id":3,"col1":"a3","col2":"b3"}}|1 |
+---+------------------------------------------+-------+
Update-1
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).show(false)
+---+--------------------------------+-------+
|id |data |version|
+---+--------------------------------+-------+
|1 |{"id":1,"col1":"a1","col2":"b1"}|1 |
|2 |{"id":2,"col1":"a2","col2":"b2"}|1 |
|3 |{"id":3,"col1":"a3","col2":"b3"}|1 |
+---+--------------------------------+-------+
Try this:
scala> val versionDF = List((1.0)).toDF("version")
versionDF: org.apache.spark.sql.DataFrame = [version: double]
scala> versionDF.show
+-------+
|version|
+-------+
| 1.0|
+-------+
scala> val version = versionDF.first.get(0)
version: Any = 1.0
scala>
scala> val childDF = List((1,"a1","b1"),(2,"a2","b2"),(3,"a3","b3")).toDF("id","col1","col2")
childDF: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]
scala> childDF.show
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2|
| 3| a3| b3|
+---+----+----+
scala>
scala> val parentDF = childDF.toJSON.select(struct($"value").as("data")).withColumn("id",from_json($"data.value",childDF.schema).getItem("id")).withColumn("version",lit(version))
parentDF: org.apache.spark.sql.DataFrame = [data: struct<value: string>, id: int ... 1 more field]
scala> parentDF.printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- id: integer (nullable = true)
|-- version: double (nullable = false)
scala> parentDF.show(false)
+----------------------------------+---+-------+
|data |id |version|
+----------------------------------+---+-------+
|[{"id":1,"col1":"a1","col2":"b1"}]|1 |1.0 |
|[{"id":2,"col1":"a2","col2":"b2"}]|2 |1.0 |
|[{"id":3,"col1":"a3","col2":"b3"}]|3 |1.0 |
+----------------------------------+---+-------+

Transpose DataFrame single row to column in Spark with scala

I saw this question here:
Transpose DataFrame Without Aggregation in Spark with scala and I wanted to do exactly the opposite.
I have this Dataframe with a single row, with values that are string, int, bool, array:
+-----+-------+-----+------+-----+
|col1 | col2 |col3 | col4 |col5 |
+-----+-------+-----+------+-----+
|val1 | val2 |val3 | val4 |val5 |
+-----+-------+-----+------+-----+
And I want to transpose it like this:
+-----------+-------+
|Columns | values|
+-----------+-------+
|col1 | val1 |
|col2 | val2 |
|col3 | val3 |
|col4 | val4 |
|col5 | val5 |
+-----------+-------+
I am using Apache Spark 2.4.3 with Scala 2.11
Edit: Values can be of any type (int, double, bool, array), not only strings.
Thought differently with out using arrays_zip (which is available in => Spark 2.4)] and got the below...
It will work for Spark =>2.0 onwards in a simpler way (flatmap , map and explode functions)...
Here map function (used in with column) creates a new map column. The input columns must be grouped as key-value pairs.
Case : String data type in Data :
import org.apache.spark.sql.functions._
val df: DataFrame =Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")
var columnsAndValues = df.columns.flatMap { c => Array(lit(c), col(c)) }
df.printSchema()
df.withColumn("myMap", map(columnsAndValues:_*)).select(explode($"myMap"))
.toDF("Columns","Values").show(false)
Result :
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
|-- col4: string (nullable = true)
|-- col5: string (nullable = true)
+-------+------+
|Columns|Values|
+-------+------+
|col1 |val1 |
|col2 |val2 |
|col3 |val3 |
|col4 |val4 |
|col5 |val5 |
+-------+------+
Case : Mix of data types in Data :
If you have different types convert them to String... remaining steps wont change..
val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)
Full Example :
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.Column
val df = Seq(((2), (3), (true), (2.4), ("val"))).toDF("col1", "col2", "col3", "col4", "col5")
df.printSchema()
/**
* convert all columns to to string type since its needed further
*/
val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)
df1.printSchema()
var ColumnsAndValues: Array[Column] = df.columns.flatMap { c => {
Array(lit(c), col(c))
}
}
df1.withColumn("myMap", map(ColumnsAndValues: _*))
.select(explode($"myMap"))
.toDF("Columns", "Values")
.show(false)
Result :
root
|-- col1: integer (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: boolean (nullable = false)
|-- col4: double (nullable = false)
|-- col5: string (nullable = true)
root
|-- col1: string (nullable = false)
|-- col2: string (nullable = false)
|-- col3: string (nullable = false)
|-- col4: string (nullable = false)
|-- col5: string (nullable = true)
+-------+------+
|Columns|Values|
+-------+------+
|col1 |2 |
|col2 |3 |
|col3 |true |
|col4 |2.4 |
|col5 |val |
+-------+------+
From Spark-2.4 Use arrays_zip with array(column_values), array(column_names) then explode to get the result.
Example:
val df=Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")
val cols=df.columns.map(x => col(s"${x}"))
val str_cols=df.columns.mkString(",")
df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()
//+------+-------+
//|values|Columns|
//+------+-------+
//| val1| col1|
//| val2| col2|
//| val3| col3|
//| val4| col4|
//| val5| col5|
//+------+-------+
UPDATE:
val df=Seq(((2),(3),(true),(2.4),("val"))).toDF("col1","col2","col3","col4","col5")
df.printSchema
//root
// |-- col1: integer (nullable = false)
// |-- col2: integer (nullable = false)
// |-- col3: boolean (nullable = false)
// |-- col4: double (nullable = false)
// |-- col5: string (nullable = true)
//cast to string
val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}"))
val str_cols=df.columns.mkString(",")
df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()
//+------+-------+
//|values|Columns|
//+------+-------+
//| 2| col1|
//| 3| col2|
//| true| col3|
//| 2.4| col4|
//| val| col5|
//+------+-------+

Explode multiple columns of same type with different lengths

I have a spark data frame with the following format that needs to be exploded. I check other solutions such as this one. However, in my case, before and after can be arrays of different length.
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
For instance, if the data frame has just one row, before is a array of size 2 and after is an array of size 3, the exploded version should be have 5 rows with the following schema:
root
|-- id: string (nullable = true)
|-- type: string (nullable = true)
|-- start_time: integer (nullable = false)
|-- end_time: string (nullable = true)
|-- area: string (nullable = true)
where type is a new column which can be "before" or "after".
I can do thss in two separate explodes where I make type column in each explode and union then.
val dfSummary1 = df.withColumn("before_exp",
explode($"before")).withColumn("type",
lit("before")).withColumn(
"start_time", $"before_exp.start_time").withColumn(
"end_time", $"before_exp.end_time").withColumn(
"area", $"before_exp.area").drop("before_exp", "before")
val dfSummary2 = df.withColumn("after_exp",
explode($"after")).withColumn("type",
lit("after")).withColumn(
"start_time", $"after_exp.start_time").withColumn(
"end_time", $"after_exp.end_time").withColumn(
"area", $"after_exp.area").drop("after_exp", "after")
val dfResult = dfSumamry1.unionAll(dfSummary2)
But, I was wondering if there is a more elegant way to do this. Thanks.
You can also achieve this without union. With the data :
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
you can do
df
.select($"id",
explode(
array(
struct(lit("before").as("type"), $"before".as("data")),
struct(lit("after").as("type"), $"after".as("data"))
)
).as("step1")
)
.select($"id",$"step1.type", explode($"step1.data").as("step2"))
.select($"id",$"type", $"step2.*")
.show()
+---+------+----------+--------+----+
| id| type|start_time|end_time|area|
+---+------+----------+--------+----+
| 1|before| 01:00| 01:30| 10|
| 1|before| 02:00| 02:30| 20|
| 1| after| 07:00| 07:30| 70|
| 1| after| 08:00| 08:30| 80|
| 1| after| 09:00| 09:30| 90|
+---+------+----------+--------+----+
I think exploding the two columns separately followed by a union is a decent straightforward approach. You could simplify the StructField-element selection a little and create a simple method for the repetitive explode process, like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
def explodeCol(df: DataFrame, colName: String): DataFrame = {
val expColName = colName + "_exp"
df.
withColumn("type", lit(colName)).
withColumn(expColName, explode(col(colName))).
select("id", "type", expColName + ".*")
}
val dfResult = explodeCol(df, "before") union explodeCol(df, "after")
dfResult.show
// +---+------+----------+--------+----+
// | id| type|start_time|end_time|area|
// +---+------+----------+--------+----+
// | 1|before| 01:00| 01:30| 10|
// | 1|before| 02:00| 02:30| 20|
// | 1| after| 07:00| 07:30| 70|
// | 1| after| 08:00| 08:30| 80|
// | 1| after| 09:00| 09:30| 90|
// +---+------+----------+--------+----+

How to transform Spark Dataframe columns to a single column of a string array

I want to know how can I "merge" multiple dataframe columns into one as a string array?
For example, I have this dataframe:
val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")
Which looks like this:
scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
| 1|Jack| 125| Text|
| 2|Mary| 152| Text2|
+---+----+------+-------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- Name: string (nullable = true)
|-- Number: string (nullable = true)
|-- Comment: string (nullable = true)
How can I transform it so it would look like this:
scala> df.show
+---+-----------------+
| Id| List|
+---+-----------------+
| 1| [Jack,125,Text]|
| 2| [Mary,152,Text2]|
+---+-----------------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- List: Array (nullable = true)
| |-- element: string (containsNull = true)
Use org.apache.spark.sql.functions.array:
import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")
result.show()
// +---+------------------+
// |Id |List |
// +---+------------------+
// |1 |[Jack, 125, Text] |
// |2 |[Mary, 152, Text2]|
// +---+------------------+
Can also be used with withColumn :
import org.apache.spark.sql.functions as F
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))