Converting to RDD fails - scala

My code is as below. I read a CSV file which has two columns. Loop through the elements of the Dataframe by converting to a RDD. Now i wanted to create a DF of each Element. Below code fails. Can anyone please help.
val df1 = spark.read.format("csv").load("c:\\file.csv") //CSV has 3 columns
for (row <- df1.rdd.collect)
{
var tab1 = row.mkString(",").split(",")(0) //Has Tablename
var tab2 = row.mkString(",").split(",")(1) //One Select Statment
var tab3 = row.mkString(",").split(",")(1) //Another Select Statment
val newdf = spark.createDataFrame(tab1).toDF("Col") // This is not working
}
I want to join tab2 dataframe with tab3 and append tablename. For example
Exceution of query in tab2 and tab3 gives below result.
Col1 col2
--- ---
A B
C D
E F
G H
I want as below:
Col0 Col1 Col2
---- ---- ---
Tab1 A B
Tab1 C D
Tab2 E F
Tab3 G h
Now tab1 tab2 tab2.. etc this information is in CSV file file which am reading. I want to convert that col0 to a datafram,so that i can read in Spark Sql

I was able to resolve my replacing below:
val newdf = spark.createDataFrame(tab1).toDF("Col") // This is not working
By
val newDf = spark.sparkContext.parallelize(Seq(newdf)).toDF("Col")

Related

Union can be done on tables with same no of column In Scala

Below is my code:
val adf = spark.emptyDataFrame
for (i <- 0 until 10 )
{
val df1 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmna
val df2 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmna
df1.creatOrReplaceTempView("tab1")
df2.creatOrReplaceTempView("tab2")
val res = spark.sql("Select A.* , B.* from tab1 a join tab2 b on a.id = b.id")
adf = adf.union(res)
}
adf.show()
Union is failing as it saying "Union can be done on tables with same no of columns"
Can anyone please help?
Schema of an empty dataframe doesn't have any columns, but your result dataframe has 4 columns, it is the reason why union operation is failing. You can make from adf a variable and use it with null or Option value specified. Example with using Option:
var adf: Option[DataFrame] = None
for (i <- 0 until 10 )
{
val df1 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmsn
val df2 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmsn
df1.creatOrReplaceTempView("tab1")
df2.creatOrReplaceTempView("tab2")
val res = spark.sql("Select A.* , B.* from tab1 a join tab2 b on a.id = b.id")
if (adf.isDefined) {
adf = Option(adf.get.union(res))
} else adf = Option(res)
}
if (adf.isDefined) adf.get.show()
Resolved the issue by:
Creating Empty Dataframe with Schema:
//Created Schema
val schema = StructType(Array
(StructField ("Col1",StringType,true),
(StructField ("Col2",StringType,true),
(StructField ("Col3",StringType,true),
(StructField ("Col4",StringType,true)))
//Created EmptyDataFrame
var adf = spark.createDataFrame(spark.sparkContext.emptyRDD[row],schema)
for (i <- 0 until 10 )
{
val df1 = spark.read.format("csv").load("c:\\file.txt") // file has 2 column
val df2 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmna
df1.creatOrReplaceTempView("tab1")
df2.creatOrReplaceTempView("tab2")
val res = spark.sql("Select A.* , B.* from tab1 a join tab2 b on a.id = b.id")
adf = adf.union(res)
}
adf.show()

Spark Dataframe With for Loop: Optimization Technique

I trying to implement blow logic.
1. Taking some records from one table.
2. based on resultant data I'm using one loop.
3.then inside loop taking data from other tables in two different dataframe
4. joining these two dataframes and loading data into 3rd table.
var id_chck1 = s"select distinct id ,id1, id2 from table WHERE type = 'N';
val id_chck = hive.executeQuery(id_chck1)
for (data <- id_chck) {
var id = data(0)
var id1 = data(1)
var id2 = data(2)
val values_1 = "select distinct bill, bil_num, id_num, bill_date,process_date from table l WHERE id2 = '222';
val values_1_data = hive.executeQuery(values_1)
for (row <- values_1_data.collect) {
val bill = row.mkString(",").split(",")(0)
val bil_num = row.mkString(",").split(",")(1)
val id_num= row.mkString(",").split(",")(2)
val bill_date = row.mkString(",").split(",")(3)
var df1 = s"select column name from tablename where id=222"
val df1_data = hive.executeQuery(df1)
var df2 = s"s"select column name from tablename2 where id=222""
val df2_data = hive.executeQuery(df2)
val df3="joining df1 and df2"
df3.write.format("orc").mode("Append").save("hdfslocation")
}
var load1 = s"load data inpath 'hdfslocation' into table tablename"
val load1_data = hive.executeUpdate(load1)
but this process is taking 6+ hrs is there any other way to doing the same thing so it can complete in short time.is there any other way to do the same thing..like rdd or setting some spark hive properties to improve performance.
I have 5,00,000 records in test1 table.
could you please add the input and expected output as an example?
it's hard to see what exactly you're trying to achieve

Merge multiple Dataframes into one Dataframe in Spark [duplicate]

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)

Merging Dataframes in Spark

I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. When the keys match in A & B, I need to get the row from B, not from A.
For example:
DataFrame A:
Employee1, salary100
Employee2, salary50
Employee3, salary200
DataFrame B
Employee1, salary150
Employee2, salary100
Employee4, salary300
The resulting DataFrame should be:
DataFrame C:
Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300
How can I do this in Spark & Scala?
Try:
dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
or
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
CASE dfB.employee IS NOT NULL THEN dfB.salary
CASE dfB.employee IS NOT NULL THEN dfA.salary
END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
Assuming dfA and dfB have 2 columns emp and sal. You can use the following:
import org.apache.spark.sql.{functions => f}
val dfB1 = dfB
.withColumnRenamed("sal", "salB")
.withColumnRenamed("emp", "empB")
val joined = dfA
.join(dfB1, 'emp === 'empB, "outer")
.select(
f.coalesce('empB, 'emp).as("emp"),
f.coalesce('salB, 'sal).as("sal")
)
NB: you should have only one row per Dataframe for a giving value of emp

How to zip two (or more) DataFrame in Spark

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)