Reference column by id in Spark Dataframe - scala

I have multiple duplicate columns (due to joins) If I try to call them by alias, I get an ambiguous reference error:
Reference 'customers_id' is ambiguous, could be: customers_id#13, customers_id#85, customers_id#130
Is there a way to reference a column in a Scala Spark Dataframe by it's order in the Dataframe or by numeric ID, not by an alias? Sanitized names suggest that columns do have an id assigned (13, 85, 130 in the example below)
LATER EDIT:
I found out that I can reference a specific column by the original dataframe it was in. But, while I can use OriginalDataframe.customer_id in select function, the withColumnRename function only accepts string alias so I cannot rename the duplicate column in the final dataframe.
So, I guess the end question is:
Is there a way to reference a column that has a duplicate alias, that works with all functions that require a string alias as argument?
LATER EDIT 2:
Renaming seemed to have worked via adding a new column and dropping one of the current ones:
joined_dataframe = joined_dataframe.withColumn("renamed_customers_id", original_dataframe("customers_id")).drop(original_dataframe("customers_id"))
But, I'd like to keep my question open:
Is there a way to reference a column that has a duplicate alias (so, using something other than alias) in a way that all functions which expect a string alias accept it?

One way to get out of such a situation would be to create a new Dataframe using the old one's rdd, but with a new schema, where you can name each column as you'd like. This, of course, requires you to explicitly describe the entire schema, including the type of each column. As long as the new schema you provides matches the number of columns, and the column types, of the old Dataframe - this should work.
For example - starting with a Dataframe with two columns named type we can rename them type1 and type2:
df.show()
// +---+----+----+
// | id|type|type|
// +---+----+----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+----+----+
val newDF = sqlContext.createDataFrame(df.rdd, new StructType()
.add("id", IntegerType)
.add("type1", StringType)
.add("type2", StringType)
)
newDF.show()
// +---+-----+-----+
// | id|type1|type2|
// +---+-----+-----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+-----+-----+

The main problem is join, ı use python.
h1.createOrReplaceTempView("h1")
h2.createOrReplaceTempView("h2")
h3.createOrReplaceTempView("h3")
joined1 = h1.join(h2, (h1.A == h2.A) & (h1.B == h2.B) & (h1.C == h2.C), 'inner')
Result dataframe columns:
A B Column1 Column2 A B Column3 ...
I don't like this , but join must be implement like this:
joined1 = h1.join(h2, [*argv], 'inner')
We assume argv = ["A", "B", "C"]
Result columns:
A B column1 column2 column3 ...

Related

Spark scala selecting multiple columns from a list and single columns

I'm attempting to do a select on a dataframe but I'm having a little bit of trouble.
I have this initial dataframe
+----------+-------+-------+-------+
|id|value_a|value_b|value_c|value_d|
+----------+-------+-------+-------+
And what I have to do is sum value_a with value_b and keep the others the same. So I have this list
val select_list = List(id, value_c, value_d)
and after this I do the select
df.select(select_list.map(col):_*, (col(value_a) + col(value_b)).as("value_b"))
And I'm expecting to get this:
+----------+-------+-------+
|id|value_c|value_d|value_b| --- that value_b is the sum of value_a and value_b (original)
+----------+-------+-------+
But i'm getting "a no _* annotation allowed here". Keep in mind that in reality I have a lot of columns so I need to use a list, I can't simply select each column. I'm running into this trouble because the new column that is the result of the sum has the same name of an existing column, so I can't just select(column("*"), sum....).drop(value_b) or I'd be dropping the old column and the new one with the sum.
What is the correct syntax to add multiple and single columns in a single select, or how else can I solve this?
for now I decided to do this:
df.select(col("*"), (col(value_a) + col(value_b)).as("value_b_tmp")).
drop("value_a", "value_b").withColumnRenamed("value_b_tmp", "value_b")
Which works fine but I understand the withColumn and withColumnRenamed is expensive because I'm creating pretty much a new dataframe with a new or renamed column and I'm looking for the less expensive operation possible.
Thanks in advance!
Simply use .withColumn function, it will replace the column if it exists:
df
.withColumn("value_b", col("value_a") + col("value_b"))
.select(select_list.map(col):_*)
You can create a new sum field and collect the result of the operation for the sum of the n columns as:
val df: DataFrame =
spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(1,2,3),Row(1,2,3))),
StructType(List(
StructField("field1", IntegerType),
StructField("field2", IntegerType),
StructField("field3", IntegerType))))
val columnsToSum = df.schema.fieldNames
columnsToSum.filter(name => name != "field1")
.foldLeft(df.withColumn("sum", lit(0)))((df, column) =>
df.withColumn("sum", col("sum") + col(column)))
Gives:
+------+------+------+---+
|field1|field2|field3|sum|
+------+------+------+---+
| 1| 2| 3| 5|
| 1| 2| 3| 5|
+------+------+------+---+

How can I split a column off of a DataFrame, but keep its association with the initial DataFrame?

I have a dataframe dataDF that is:
+-------+------+-----+-----+-----------+
|TEST_PK| COL_1|COL_2|COL_3|h_timestamp|
+-------+------+-----+-----+-----------+
| 1| apple| 10| 1.79| 1111|
| 1| apple| 11| 1.79| 1114|
| 2|banana| 15| 1.79| 1112|
| 2|banana| 16| 1.79| 1115|
| 3|orange| 7| 1.79| 1113|
+-------+------+-----+-----+-----------+
And I need to run this function:
operation(row, h_timestamp)
On each row, but row can not contain h_timestamp, so my first thought is to split the dataframe like:
val columns = dataDF.drop("h_timestamp")
val timestamp = dataDF.select("h_timestamp")
But that doesn't help when I want to perform the operation on every row like:
dataDF.map(row => {
...
val rowWithoutTimestamp = ???
val timestamp = ???
operation(rowWithoutTimestamp, timestamp)
...
})
But now those two dataframes are not linked and I don't know how to get the right timestamp for each row. The TEST_PK column is not necessarily unique.
Is there a way to use .drop() or .select() on just a row or some other way to do this?
Edit: Also, the table could have any number of columns, but will always have the timestamp column and at least one more that is not the timestamp
Since you have what appears to be a primary key column, just fork the timestamp with the id column into it's own dataframe to re-join it later.
val tsDF = dataDF.select("TEST_PK", "h_timestamp")
Then, drop the column from dataDF, do your operation, and re-join the h_timestamp back onto a new dataframe.
val finalDF = postopDF.join(tsDF, "TEST_PK")
Update
The sample code is helpful, you should be able to essentially dissasemble your row, and rebuild a new row with the desired values with something like this:
dataDF.map(row => {
val rowWithoutTimestamp = Row(
row.getAs[Long]("TEST_PK"),
row.getAs[String]("COL_1"),
row.getAs[Long]("COL_2"),
row.getAs[Double]("COL_3")
)
val timestamp = row.getAs[Long]("h_timestamp")
val result = operation(rowWithoutTimestamp, timestamp)
Row(result, timestamp)
})
Of course, I'm not certain what your operation() returns, so it may be necessary to disassemble result into individual values and compose a new row with those and the timestamp.
Update 2
OK, here is a more generic method. It wraps "all columns except" h_timestamp into a struct, and maps over the (struct, ts) tuple. Actually more elegant than the previous solution anyway.
val cols = df.drop("h_timestamp").columns.toSeq
dataDF
.select(struct(cols.map(c => col(c)):_*).as("row_no_ts"), $"h_timestamp")
.map(row => {
val rowWithoutTimestamp = row.getAs[Row]("row_no_ts")
val timestamp = row.getAs[Long]("h_timestamp")
operation(rowWithoutTimestamp, timestamp)
})
I'm not sure if you are mapping to just the output of operation() or some combination with the timestamp again, but both are available to modify to suit your needs.

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

How to fetch the value and type of each column of each row in a dataframe?

How can I convert a dataframe to a tuple that includes the datatype for each column?
I have a number of dataframes with varying sizes and types. I need to be able to determine the type and value of each column and row of a given dataframe so I can perform some actions that are type-dependent.
So for example say I have a dataframe that looks like:
+-------+-------+
| foo | bar |
+-------+-------+
| 12345 | fnord |
| 42 | baz |
+-------+-------+
I need to get
Seq(
(("12345", "Integer"), ("fnord", "String")),
(("42", "Integer"), ("baz", "String"))
)
or something similarly simple to iterate over and work with programmatically.
Thanks in advance and sorry for what is, I'm sure, a very noobish question.
If I understand your question correct, then following shall be your solution.
val df = Seq(
(12345, "fnord"),
(42, "baz"))
.toDF("foo", "bar")
This creates dataframe which you already have.
+-----+-----+
| foo| bar|
+-----+-----+
|12345|fnord|
| 42| baz|
+-----+-----+
Next step is to extract dataType from the schema of the dataFrame and create a iterator.
val fieldTypesList = df.schema.map(struct => struct.dataType)
Next step is to convert the dataframe rows into rdd list and map each value to dataType from the list created above
val dfList = df.rdd.map(row => row.toString().replace("[","").replace("]","").split(",").toList)
val tuples = dfList.map(list => list.map(value => (value, fieldTypesList(list.indexOf(value)))))
Now if we print it
tuples.foreach(println)
It would give
List((12345,IntegerType), (fnord,StringType))
List((42,IntegerType), (baz,StringType))
Which you can iterate over and work with programmatically

Apache Spark SQL identifier expected exception

My question is quite similar to this one: Apache Spark SQL issue : java.lang.RuntimeException: [1.517] failure: identifier expected But I just can't figure out where my problem lies. I am using SQLite as database backend. Connecting and simple select statements work fine.
The offending line:
val df = tableData.selectExpr(tablesMap(t).toSeq:_*).map(r => myMapFunc(r))
tablesMap contains the table name as key and an array of strings as expressions. Printed, the array looks like this:
WrappedArray([My Col A], [ColB] || [Col C] AS ColB)
The table name is also included in square brackets since it contains spaces. The exception I get:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: identifier expected
I already made sure not to use any Spark Sql keywords. In my opinion there are 2 possible reasons why this code fails: 1) I somehow handle spaces in column names wrong. 2) I handle concatenation wrong.
I am using a resource file, CSV-like, which contains the expressions I want to be evaluated on my tables. Apart from this file, I want to allow the user to specify additional tables and their respective column expressions at runtime. The file looks like this:
TableName,`Col A`,`ColB`,CONCAT(`ColB`, ' ', `Col C`)
Appartently this does not work. Nevertheless I would like to reuse this file, modified of course. My idea was to map the columns with the expressions from an array of strings, like now, to a sequence of spark columns. (This is the only solution for me I could think of, since I want to avoid pulling in all hive dependecies just for this one feature.) I would introduce a small syntax for my expressions to mark raw column names with a $ and some keywords for functions like concat and as. But how could I do this? I tried something like this but it's far far away from even compiling.
def columnsMapFunc( expr: String) : Column = {
if(expr(0) == '$')
return expr.drop(1)
else
return concat(extractedColumnNames).as(newName)
}
Generally speaking using names containing whitespaces is asking for problems but replacing square brackets with backticks should solve the problem:
val df = sc.parallelize(Seq((1,"A"), (2, "B"))).toDF("f o o", "b a r")
df.registerTempTable("foo bar")
df.selectExpr("`f o o`").show
// +-----+
// |f o o|
// +-----+
// | 1|
// | 2|
// +-----+
sqlContext.sql("SELECT `b a r` FROM `foo bar`").show
// +-----+
// |b a r|
// +-----+
// | A|
// | B|
// +-----+
For concatenation you have to use concat function:
df.selectExpr("""concat(`f o o`, " ", `b a r`)""").show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
but it requires HiveContext in Spark 1.4.0.
In practice I would simply rename columns after loading data
df.toDF("foo", "bar")
// org.apache.spark.sql.DataFrame = [foo: int, bar: string]
and use functions instead of expression strings (concat function is available only in Spark >= 1.5.0, for 1.4 and earlier you'll need an UDF):
import org.apache.spark.sql.functions.concat
df.select($"f o o", concat($"f o o", lit(" "), $"b a r")).show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
There is also concat_ws function which takes separator as the first argument:
df.selectExpr("""concat_ws(" ", `f o o`, `b a r`)""")
df.select($"f o o", concat_ws(" ", $"f o o", $"b a r"))