Scala spark Select as not working as expected - scala

Hope someone can help. Fairly certain this is something I'm doing wrong.
I have a dataframe called uuidvar with 1 column called 'uuid' and another dataframe, df1, with a number of columns, one of which is also 'uuid'. I would like to select from from df1 all of the rows which have a uuid that appear in uuidvar. Now, having the same column names is not ideal so I tried to do it with
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("uuid").as("another_uuid"), "right_outer").select("*")
However when I show uuidselection I have 2 columns called "uuid". Furthermore, if I try and select the specific columns I want, I am told
cannot resolve 'uuidvar' given input columns
or similar depending on what I try and select.
I have tried to make it simpler and just do
val uuidvar2=uuidvar.select("uuid").as("uuidvar")
and this doesn't rename the column in uuidvar.
Does 'as' not operate as I am expecting it to, am I making some other fundamental error or is it broken?
I'm using spark 1.5.1 and scala 1.10.

Answer
You can't use as when specifying the join-criterion.
Use withColumnRenamed to modify the column before the join.
Seccnd, use generic col function for accessing columns via name (instead of using the dataframe's apply method, e.g. df1(<columnname>)
case class UUID1 (uuid: String)
case class UUID2 (uuid: String, b:Int)
class UnsortedTestSuite2 extends SparkFunSuite {
configuredUnitTest("SO - uuid") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val uuidvar = sc.parallelize( Seq(
UUID1("cafe-babe-001"),
UUID1("cafe-babe-002"),
UUID1("cafe-babe-003"),
UUID1("cafe-babe-004")
)).toDF()
val df1 = sc.parallelize( Seq(
UUID2("cafe-babe-001", 1),
UUID2("cafe-babe-002", 2),
UUID2("cafe-babe-003", 3)
)).toDF()
val uuidselection=df1.join(uuidvar.withColumnRenamed("uuid", "another_uuid"), col("uuid") === col("another_uuid"), "right_outer")
uuidselection.show()
}
}
delivers
+-------------+----+-------------+
| uuid| b| another_uuid|
+-------------+----+-------------+
|cafe-babe-001| 1|cafe-babe-001|
|cafe-babe-002| 2|cafe-babe-002|
|cafe-babe-003| 3|cafe-babe-003|
| null|null|cafe-babe-004|
+-------------+----+-------------+
Comment
.select("*") does not have any effect. So
df.select("*") =^= df

I've always used the withColumnRenamed api to rename columns:
Take this table as an example:
| Name | Age |
df.withColumnRenamed('Age', 'newAge').show()
| Name | newAge |
So to make it work with your code, something like this should work:
val uuidvar_another = uuidvar.withColumnRenamed("uuid", "another_uuid")
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("another_uuid"), "right_outer").select("*")

Related

Spark generate a list of column names that contains(SQL LIKE) a string

This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+
This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.

Return combined Dataset after joinWith in Spark Scala

Given the below two Spark Datasets, flights and capitals, what would be the most efficient way to return combined (i.e. "joined") result without converting first to a DataFrame or writing out all the columns out by name in a .select() method? I know, for example, that I can access either tuple with (e.g. .map(x => x._1) or use the * operator with:
result.select("_1.*","_2.*")
But the latter may result in duplicate column names and I'm hoping for a cleaner solution.
Thank you for your help.
case class Flights(tripNumber: Int, destination: String)
case class Capitals(state: String, capital: String)
val flights = Seq(
(55, "New York"),
(3, "Georgia"),
(12, "Oregon")
).toDF("tripNumber","destination").as[Flights]
val capitals = Seq(
("New York", "Albany"),
("Georgia", "Atlanta"),
("Oregon", "Salem")
).toDF("state","capital").as[Capitals]
val result = flights.joinWith(capitals,flights.col("destination")===capitals.col("state"))
There are 2 options, but you will have to use join instead of joinWith:
That is the best part of the Dataset API, is to drop one of the join columns
, thus no need to repeat projection columns in a select: val result = flights.join(capitals,flights("destination")===capitals("state")).drop(capitals("state"))
rename join column to be the same in both datasets and use a slightly different way of specifying the join: val result = flights.join(capitals.withColumnRenamed("state", "destination"), Seq("destination"))
Output:
result.show
+-----------+----------+-------+
|destination|tripNumber|capital|
+-----------+----------+-------+
| New York| 55| Albany|
| Georgia| 3|Atlanta|
| Oregon| 12| Salem|
+-----------+----------+-------+

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

Adding a constant value to columns in a Spark dataframe

I have a Spark data frame which will be like below
id person age
1 naveen 24
I want add a constant "del" to each column value except the last column in the dataframe like below,
id person age
1del naveendel 24
Can someone assist me how to implement this in Spark df using Scala
You can use the lit and concat functions:
import org.apache.spark.sql.functions._
// add suffix to all but last column (would work for any number of cols):
val colsWithSuffix = df.columns.dropRight(1).map(c => concat(col(c), lit("del")) as c)
def result = df.select(colsWithSuffix :+ $"age": _*)
result.show()
// +----+---------+---+
// |id |person |age|
// +----+---------+---+
// |1del|naveendel|24 |
// +----+---------+---+
EDIT: to also accommodate null values, you can wrap the column with coalesce before appending the suffix - replace the like calculating colsWithSuffix with:
val colsWithSuffix = df.columns.dropRight(1)
.map(c => concat(coalesce(col(c), lit("")), lit("del")) as c)

Filter dataframe by value NOT present in column of other dataframe [duplicate]

This question already has answers here:
Filter Spark DataFrame based on another DataFrame that specifies denylist criteria
(2 answers)
Closed 6 years ago.
Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe.
I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin" keyword, or one of the join methods.
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
val res = df1.filter(df2("cities").contains("city") === false)
// doesn't work, nor do the 20 other variants I have tried
Anyone got any ideas?
I've discovered that I can solve this using a simpler method - it seems that an antijoin is possible as a parameter to the join method, but the Spark Scaladoc does not describe it:
import org.apache.spark.sql.functions._
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
df1.join(df2, df1("city") === df2("cities"), "leftanti").show
Results in:
+----------+-------+
| location| city|
+----------+-------+
|Chittagong|Chennai|
+----------+-------+
P.S. thanks for the pointer to the duplicate - duly marked as such
If you are trying to filter a DataFrame using another, you should use join (or any of its variants). If what you need is to filter it using a List or any data structure that fits in your master and workers you could broadcast it, then reference it inside the filter or where method.
For instance I would do something like:
import org.apache.spark.sql.functions._
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
df2.join(df1, joinExprs=df1("city") === df2("cities"), joinType="full_outer")
.select("city", "cities")
.where(isnull($"cities"))
.drop("cities").show()