How can I make a Dataframe in Spark from a String instead of a file? [duplicate] - scala

This question already has answers here:
Can I read a CSV represented as a string into Apache Spark using spark-csv
(3 answers)
Closed 3 years ago.
At the moment, I am making a dataframe from a tab separated file with a header, like this.
val df = sqlContext.read.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.option("inferSchema","true").load(pathToFile)
I want to do exactly the same thing but with a String instead of a file. How can I do that?

To the best of my knowledge, there is no built in way to build a dataframe from a string. Yet, for prototyping purposes, you can create a dataframe from a Seq of Tuples.
You could use that to your advantage to create a dataframe from a string.
scala> val s ="x,y,z\n1,2,3\n4,5,6\n7,8,9"
s: String =
x,y,z
1,2,3
4,5,6
7,8,9
scala> val data = s.split('\n')
// Then we extract the first element to use it as a header.
scala> val header = data.head.split(',')
scala> val df = data.tail.toSeq
// converting the seq of strings to a DF with only one column
.toDF("X")
// spliting the string
.select(split('X, ",") as "X")
// extracting each column from the array and renaming them
.select( header.indices.map( i => 'X.getItem(i).as(header(i))) : _*)
scala> df.show
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
ps: if you are not in the spark REPL make sure to write this import spark.implicits._ so as to use toDF().

Related

How to make first row as header in PySpark reading text file as Spark context

The data frame what I get after reading text file in spark context
+----+---+------+
| _1| _2| _3|
+----+---+------+
|name|age|salary|
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
the dataframe I required is
+----+---+------+
|name|age|salary|
+----+---+------+
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
Here is the the code:
## from spark context
df_txt=spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
df_txt1=df_txt.map(lambda x: x.split(" "))
ddf=df_txt1.toDF().show()
You can use spark csv reader to read your comma seperate file.
For reading text file, you have to take first row as header and create a Seq of String and pass to toDF function. Also, remove first header to the rdd.
Note: Below code has written in spark scala. you can convert into lambda function to make it work in pyspark
import org.apache.spark.sql.functions._
val df = spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
val header = df.first()
val headerCol: Seq[String] = header.split(",").toList
val filteredRDD = df.filter(x=> x!= header)
val finaldf = filteredRDD.map( _.split(",")).map(w => (w(0),w(1),w(2))).toDF(headerCol: _*)
finaldf.show()
w(0),w(1),w(2) - you have to define fixed number of column from your file.

How to copy the "first" row of a spark data frame to another data frame? Why does my minimal example fails?

Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
I do not understand what goes wrong in the following code.
Hence I am looking forward for a solution and an explanation what fails in my minimal example.
A minimal example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1)
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
As row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
the exspected result for sdfEmpty is:
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
But I get :
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
| 2| b|
+---+---+
Question:
What confused me is the following: Using val row = sdf.limit(1) I thought I created a permanent/ unchangeable/ well defined object. Such that when I print it once and add it to something, I get the same results.
Remark: (thanks a lot to Daniel's remarks)
I know that in the distributed world of scala there is no well defined notion of "first row". I put it there for simplicity and I hope that people struggling with something similar will "accidentially" use the term "first".
What I try to achieve is the following: (in a simplified example)
I have a data frame with 2 columns A and B. Column A is partially ordered and column B is totally ordered.
I want to filter the data w.r.t. the columns. So the idea is some kind of divide and conquer: split the data frame, such that into pieces both columns are totally ordered and than filter as usual. (and do the obvious iterations)
To achieve this I need to pick a well defined row and split the date w.r.t. row.A. But as the minimal example shows my comands do not produce a well defined object.
Thanks a lot
Spark is distributed, so the notion of 'first' is not something we can rely on. Dependently on partitioning we can get a different result when calling limit or first.
To have consistent results your data has to have an underlying order which we can use - what makes a lot of sense, since unless there is logical ordering to your data, we can't really say what does it mean to take the first row.
Assuming you want to take the first row with respect to column A, you can just run orderBy("A").first()(*) . Although if column A has more than one row with same smallest value there is no guarantee which row you will get.
(* I assume scala API has the same naming as Python so please correct me if they are differently named)
#Christian you can achieve this result by using take function.
take(num) Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
here the code snippet.
scala> import org.apache.spark.sql.types._
scala> val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
scala> import org.apache.spark.sql._
scala> var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
scala> var first1 =sdf.rdd.take(1)
scala> val first_row = spark.createDataFrame(sc.parallelize(first1), sdf.schema)
scala> sdfEmpty.union(first_row).show
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
for more about take() and first() function just read spark Documentation.let me know if you have any query related to this.
I am posting this answer as it contains the solution suggested by Daniel. Once I am through literature provided mahesh gupta or some more testing I'll update this answer and give remarks on the runtimes of the different approaches in "real life".
Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
As in the distributed world of spark there is not a well defined notion of first, but something similar might be achieved due to orderBy.
A minimal working example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1).collect()
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
The row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
** and the result for sdfEmpty is:**
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
Remark: Explanation given by Daniel (see comments above) .limit(n) is a transformation - it does not get evaluated until an action runs like show or collect. Hence depending on the context it can return different value. To use the result of .limit consistently one can .collect it to driver and use it as a local variable.

Remove all records which are duplicate in spark dataframe

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.

How to append an element to an array column of a Spark Dataframe?

Suppose I have the following DataFrame:
scala> val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
df1: org.apache.spark.sql.DataFrame = [id: string, nums: array<int>]
scala> df1.show()
+---+----+
| id|nums|
+---+----+
| a| [1]|
| b| [1]|
+---+----+
And I want to add elements to the array in the nums column, so that I get something like the following:
+---+-------+
| id|nums |
+---+-------+
| a| [1,5] |
| b| [1,5] |
+---+-------+
Is there a way to do this using the .withColumn() method of the DataFrame? E.g.
val df2 = df1.withColumn("nums", append(col("nums"), lit(5)))
I've looked through the API documentation for Spark, but can't find anything that would allow me to do this. I could probably use split and concat_ws to hack something together, but I would prefer a more elegant solution if one is possible. Thanks.
import org.apache.spark.sql.functions.{lit, array, array_union}
val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
val df2 = df1.withColumn("nums", array_union($"nums", lit(Array(5))))
df2.show
+---+------+
| id| nums|
+---+------+
| a|[1, 5]|
| b|[1, 5]|
+---+------+
The array_union() was added since spark 2.4.0 release on 11/2/2018, 7 months after you asked the question, :) see https://spark.apache.org/news/index.html
You can do it using a udf function as
def addValue = udf((array: Seq[Int])=> array ++ Array(5))
df1.withColumn("nums", addValue(col("nums")))
.show(false)
and you should get
+---+------+
|id |nums |
+---+------+
|a |[1, 5]|
|b |[1, 5]|
+---+------+
Updated
Alternative way is to go with dataset way and use map as
df1.map(row => add(row.getAs[String]("id"), row.getAs[Seq[Int]]("nums")++Seq(5)))
.show(false)
where add is a case class
case class add(id: String, nums: Seq[Int])
I hope the answer is helpful
If you are, like me, searching how to do this in a Spark SQL statement; here's how:
%sql
select array_union(array("value 1"), array("value 2"))
You can use array_union to join up two arrays. To be able to use this, you have to turn your value-to-append into an array. Do this by using the array() function.
You can enter a value like array("a string") or array(yourColumn).
Be careful with using spark array_join. It is removing duplicates. So you will not get expected results if you have duplicated entries in your array. And it is at least costing O(N). So when I use it with a array aggregate, it became an O(N^2) operation and took forever for some large arrays.

Overwrite Spark dataframe schema

LATER EDIT:
Based on this article it seems that Spark cannot edit and RDD or column. A new one has to be created with the new type and the old one deleted. The for loop and .withColumn method suggested below seem to be the easiest way to get the job done.
ORIGINAL QUESTION:
Is there a simple way (for both human and machine) to convert multiple columns to a different data type?
I tried to define the schema manually, then load the data from a parquet file using this schema and save it to another file but I get "Job aborted."..."Task failed while writing rows" every time and on every DF. Somewhat easy for me, laborious for Spark ... and it does not work.
Another option is using:
df = df.withColumn("new_col", df("old_col").cast(type)).drop("old_col").withColumnRenamed("new_col", "old_col")
A bit more work for me as there are close to 100 columns and, if Spark has to duplicate each column in memory, then that doesn't sound optimal either. Is there an easier way?
Depending on how complicated the casting rules are, you can accomplish what you are asking a with this loop:
scala> var df = Seq((1,2),(3,4)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> df.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> > df.columns.foreach{c => df = df.withColumn(c, df(c).cast(DoubleType))}
scala> df.show
+---+---+
| a| b|
+---+---+
|1.0|2.0|
|3.0|4.0|
+---+---+
This should be as efficient as any other column operation.