How to cast String column to List - scala

My dataframe looks like this:
df.schema results in:
StructType(
StructField(a,StringType,true),
StructField(b,StringType,true),
StructField(c,IntegerType,true),
StructField(d,StringType,true)
)
I want to cast column b to a List of Ints and column d to List of Strings. How do I do this?

Strip [] and split on ,:
import org.apache.spark.sql.functions._
val p = "^\\[(.*)\\]$"
df
.withColumn("b", split(regexp_extract(col("b"), p, 1), "\\s*,\\s*").cast("array<int>"))
.withColumn("d", split(regexp_extract(col("d"), p, 1), "\\s*,\\s*"))

Related

how to make a new column by pairing elements of the other column?

I have a big data dataframe and I want to make pairs from elements of the other column.
col
['summer','book','hot']
['g','o','p']
output:
the pair of the above rows:
new_col
['summer','book'],['summer','hot'],['hot','book']
['g','o'],['g','p'],['p','o']
Note that tuple will work instead of list. like ('summer','book').
I know in pandas I can do this:
df['col'].apply(lambda x: list(itertools.combinations(x, 2)))
but not sure in pyspark.
You can use a UDF to do the same as you would do in python. Then cast the output to an array of array of strings.
import itertools
from pyspark.sql import functions as F
combinations_udf = F.udf(
lambda x: list(itertools.combinations(x, 2)), "array<array<string>>"
)
df = spark.createDataFrame([(['hot','summer', 'book'],),
(['g', 'o', 'p'], ),
], ['col1'])
df1 = df.withColumn("new_col", combinations_udf(F.col("col1")))
display(df1)

can I filter columns based in a dataframe based on schema in databricks scala

I have a dataframe which has 7 columns (A, B, C, D, E, F, G)
df.schema // output
StructType(
StructField(A,StringType,true),
StructField(B,StringType,true),
StructField(C,true),
StructField(D,true),
StructField(E,StringType,true),
StructField(F,StringType,true),
StructField(G,true)
)
Is there any way, I can filter the columns of a dataframe by using another schema as below
val newSchema = StructType(
StructField(A,StringType,true),
StructField(B,StringType,true),
StructField(C,StringType,true)
)
At the end, I would like to select columns A, B, C from dataframe df using newSchema
Please suggest possible ways
val cols = newSchema.fields.map(_.name)
display(df.select(cols.head, cols.tail: _*)
This is working

How can I insert values of a DataFrame column into a list

I want to add the values of a DataFrame column(named as prediction) into a List, so that I can write a csv file using that list values which will further split that column into 3 more columns.
I have tried creating a new list and assigning the column to the list but it only adds the schema of the column instead of the data.
//This is the prediction column which is basically a model stored in the Value PredictionModel
val PredictionModel = model.transform(testDF)
PredictionModel.select("features","label","prediction")
val ListOfPredictions:List[String]= List(PredictionModel.select("prediction").toString()
The expected result is basically the data of the column being assigned to the list so that it can be used further.
But the actual outcome is only the schema of the column being assigned to the list as follows:
[prediction: double]
You can write whole DataFrame as csv:
PredictionModel.select("features","label","prediction")
.write
.option("header","true")
.option("delimiter",",")
.csv("C:/yourfile.csv")
But if you want dataframe as List of concatenated df columns you can try this:
val data = Seq(
(1, 99),
(1, 99),
(1, 70),
(1, 20)
).toDF("id", "value")
val ok: List[String] = data
.select(concat_ws(",", data.columns.map(data(_)): _*))
.map(s => s.getString(0))
.collect()
.toList
output:
ok.foreach(println(_))
1,99
1,99
1,70
1,20

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

How to convert all column of dataframe to numeric spark scala?

I loaded a csv as dataframe. I would like to cast all columns to float, knowing that the file is to big to write all columns names:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
Given this DataFrame as example:
val df = sqlContext.createDataFrame(Seq(("0", 0),("1", 1),("2", 0))).toDF("id", "c0")
with schema:
StructType(
StructField(id,StringType,true),
StructField(c0,IntegerType,false))
You can loop over DF columns by .columns functions:
val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, col(c).cast("float")))
So the new DF schema looks like:
StructType(
StructField(id,FloatType,true),
StructField(c0,FloatType,false))
EDIT:
If you wanna exclude some columns from casting, you could do something like (supposing we want to exclude the column id):
val exclude = Array("id")
val someCastedDF = (df.columns.toBuffer --= exclude).foldLeft(df)((current, c) =>
current.withColumn(c, col(c).cast("float")))
where exclude is an Array of all columns we want to exclude from casting.
So the schema of this new DF is:
StructType(
StructField(id,StringType,true),
StructField(c0,FloatType,false))
Please notice that maybe this is not the best solution to do it but it can be a starting point.