how to select all columns that starts with a common label - scala

I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
I know I can do like this to select specific columns:
df.select("colA", "colB", "colE")
but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas?

The process canbe broken down into following steps:
First grab the column names with df.columns,
then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings.
But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)),
and finally turn the Array of Columns into a var arg with : _*.
df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):
df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show
If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.
df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show

Python (tested in Azure Databricks)
selected_columns = [column for column in df.columns if column.startswith("colF")]
df2 = df.select(selected_columns)

In PySpark, use: colRegex to select columns starting with colF
Whit the sample:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
Apply:
df.select(col("colA"), col("colB"), df.colRegex("`(colF)+?.+`")).show()
The result is:
colA, colB, colF-0, colF-1, colF-2

I wrote a function that does that. Read the comments to see how it works.
/**
* Given a sequence of prefixes, select suitable columns from [[DataFrame]]
* #param columnPrefixes Sequence of prefixes
* #param dF Incoming [[DataFrame]]
* #return [[DataFrame]] with prefixed columns selected
*/
def selectPrefixedColumns(columnPrefixes: Seq[String], dF: DataFrame): DataFrame = {
// Find out if given column name matches any of the provided prefixes
def colNameStartsWith: String => Boolean = (colName: String) =>
columnsPrefix.map(prefix => colName.startsWith(prefix)).reduce(_ || _)
// Filter columns list by checking against given prefixes sequence
val columns = dF.columns.filter(colNameStartsWith)
// Select filtered columns list
dF.select(columns.head, columns.tail:_*)
}

Related

How to append 'explode'd columns to a dataframe keeping all existing columns?

I'm trying to add exploded columns to a dataframe:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Convenience function for turning JSON strings into DataFrames.
def jsonToDataFrame(json, schema=None):
# SparkSessions are available with Spark 2.0+
reader = spark.read
if schema:
reader.schema(schema)
return reader.json(sc.parallelize([json]))
schema = StructType().add("a", MapType(StringType(), IntegerType()))
events = jsonToDataFrame("""
{
"a": {
"b": 1,
"c": 2
}
}
""", schema)
display(
events.withColumn("a", explode("a").alias("x", "y"))
)
However, I'm hitting the following error:
AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got a
Any ideas?
In the end, I used the following:
display(
events.select(explode("a").alias("x", "y"), *[c for c in events.columns])
)
This approach uses select to specify the columns to return.
The first argument explodes the data:
explode("a").alias("x", "y")
The second argument specifies all existing columns should be included in the select:\
*[c for c in events.columns]
Note that I'm prefixing the list with * - this sends each column name as a separate parameter.
Simpler Method
The API docs specify:
Parameters
colsstr, Column, or list
column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.
We can simplify the first approach by passing in "*" to select all the columns:
display(
events.select("*", explode("a").alias("x", "y"))
)

Dataframe column substring based on the value during join

I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"
I need to compare this column with another column in a different dataframe based on the column value.
If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")
If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.
For example:
val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))
I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?
You can try adding a new column based on the conditions and join on the new column. Something like this.
val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )
DF4.show(false)
The new column will have values like this.
+------------------+-------------+
|column1 |joinCol |
+------------------+-------------+
|COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
|xxxxx-xx-xxxx |xxxxx-xx-xxxx|
+------------------+-------------+
You can now join based on the new column added.
val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))
Hope this helps.
Simply create a new column to use in the join:
DF2.withColumn("column2",
when($"column1" rlike "COR//.*",
$"column1".substr(lit(4), length($"column1")).
otherwise($"column1"))
Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.
Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.

Dataframe : GroupBy by list of column names [duplicate]

This question already has an answer here:
Scala-Spark Dynamically call groupby and agg with parameter values
(1 answer)
Closed 4 years ago.
I have a Dataframe with multiple columns and a List of column names.
I want to process my Dataframe by grouping it according to my list.
Here is an example of what I am trying to do :
val tagList = List("col1","col3","col5")
var tagsForGroupBy = tagList(0)
if(tagList.length>1){
for(i <- 1 to tagList.length-1){
tagsForGroupBy = tagsForGroupBy+","+tags(i)
}
}
// df is a Dataframe with schema (col0, col1, col2, col3, col4, col5)
df.groupBy("col0",tagsForGroupBy)
I understand why it does not work, but I don't know how to make it work.
What is the best solution to do that ?
EDIT :
Here is a more complete example of what I am doing (including SCouto solution) :
I have my tagList that contains some column names ("col3","col5"). I also want to include "col0" and "col1" in my groupBy, independently of my list.
After my groupBy and my aggregations, I want to select all columns used for group By and the new columns from aggregation.
val tagList = List("col3","col5")
val tmpListForGroup = new ListBuffer[String]()
val tmpListForSelect = new ListBuffer[String]()
tmpListForGroup +=tagList (0)
tmpListForSelect +=tagList (0)
for(i <- 1 to tagList .length-1){
tmpListForGroup +=(tagList (i))
tmpListForSelect +=(tagList (i))
}
tmpListForGroup +="col0"
tmpListForGroup +="col1"
tmpListForSelect +="aggValue1"
tmpListForSelect +="aggValue2"
// df is a Dataframe with schema (col0, col1, col2, col3, col4, col5)
df.groupBy(tmpListForGroup.head,tmpListForGroup.tail:_*)
.agg(
[aggFunction].as("aggValue1"),
[aggFunction].as("aggValue1"))
)
.select(tmpListForSelect .head,tmpListForSelect .tail:_*)
This code do exactly what I want, but it look very ugly and complicated for something that (I think) should be simple.
Is there another solution for that ?
When sending column names as Strings, groupBy receives a column as first parameter and a sequence of them as second:
def groupBy(col1: String,cols: String*)
So you need to send two arguments and convert the second one to a sequence:
This will work fine for you:
df.groupBy(tagsForGroupBy.head, tagsForGroupBy.tail:_*)
Or if you want to separete col0 from the list as in your example:
df.groupBy("col0", tagsForGroupBy:_*)

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.
You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)