How to convert all column of dataframe to numeric spark scala? - scala

I loaded a csv as dataframe. I would like to cast all columns to float, knowing that the file is to big to write all columns names:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")

Given this DataFrame as example:
val df = sqlContext.createDataFrame(Seq(("0", 0),("1", 1),("2", 0))).toDF("id", "c0")
with schema:
StructType(
StructField(id,StringType,true),
StructField(c0,IntegerType,false))
You can loop over DF columns by .columns functions:
val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, col(c).cast("float")))
So the new DF schema looks like:
StructType(
StructField(id,FloatType,true),
StructField(c0,FloatType,false))
EDIT:
If you wanna exclude some columns from casting, you could do something like (supposing we want to exclude the column id):
val exclude = Array("id")
val someCastedDF = (df.columns.toBuffer --= exclude).foldLeft(df)((current, c) =>
current.withColumn(c, col(c).cast("float")))
where exclude is an Array of all columns we want to exclude from casting.
So the schema of this new DF is:
StructType(
StructField(id,StringType,true),
StructField(c0,FloatType,false))
Please notice that maybe this is not the best solution to do it but it can be a starting point.

Related

spark.createDataFrame () not working with Seq RDD

CreateDataFrame takes 2 arguments , an rdd and schema.
my schema is like this
val schemas= StructType(
Seq(
StructField("number",IntegerType,false),
StructField("notation", StringType,false)
)
)
in one case i am able to create dataframe from RDD like below:
`val data1=Seq(Row(1,"one"),Row(2,"two"))
val rdd=spark.sparkContext.parallelize(data1)
val final_df= spark.createDataFrame(rdd,schemas)`
In other case like below .. i am not able to
`val data2=Seq((1,"one"),(2,"two"))
val rdd=spark.sparkContext.parallelize(data2)
val final_df= spark.createDataFrame(rdd,schemas)`
Whats wrong with data2 for not able to become a valid rdd for Dataframe?
but we can able to create dataframe using toDF() with data2 but not CreateDataFrame.
val data2_DF=Seq((1,"one"),(2,"two")).toDF("number", "notation")
Please help me understand this behaviour.
Is Row mandatory while creating dataframe?
In the second case, just do :
val final_df = spark.createDataFrame(rdd)
Because your RDD is an RDD of Tuple2 (which is a Product), the schema is known at compile time, so you don't need to specify a schema

Convert header (column names) to new dataframe

I have a dataframe with headers for example outputDF. I now want to take outputDF.columns and create a new dataframe with just one row which contains column names.
I then want to union both these dataframes with option("head=false") which spark can then write to a HDFS.
How do i do that?
below is an example
Val df = spark.read.csv("path")
val newDf = df.columns.toSeq.toDF
val unoindf= df.union(newDf);

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

Join two text file with one column different in their schema in spark scala

I have two text files and I am creating data frame out of that. Both files have the same no of columns except one column.
When I crate schema and join both I get error like
java.lang.ArrayIndexOutOfBoundsException
Basically my schema has columns and my one of text file has only 5 columns.
No how to append some null value to already created schema and then do join?
Here is my code
val schema = StructType(Array(
StructField("TimeStamp", StringType),
StructField("Id", StringType),
StructField("Name", StringType),
StructField("Val", StringType),
StructField("Age", StringType),
StructField("Dept", StringType)))
val textRdd1 = sc.textFile("s3://test/Text1.txt")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split(",", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)
val textRdd2 = sc.textFile("s3://test/Text2.txt")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split(",", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)
val df3 = df1.join(df2)
TimeStamp column is not present in the first text file ...
Why don't you just exclude TimeStamp field from schema for first DataFrame?
val df1 = sqlContext.createDataFrame(rowRdd1, new StructType(schema.tail.toArray))
As mentioned in comments, you need not schemas to be similar. You also can specify you join condition and select columns to join on.
Add the Timestamp column to the 1st dataframe
import spark.sql.functions._
import org.apache.spark.sql.types.DataType
val df1Final = df1.withColumn("TimeStamp", lit(null).cast(Long))
Then proceed with the join
You can create a new schema without this field, and use this schema. What Dmitri was suggesting is to use the original schema and remove the field that you don’t need to save you writing a second schema definition.
Once you have the 2 files loaded in to a dataset you perform the JOIN base in the common fields and remove the duplicate columns, that I guess is what you want, doing this:
df3 = df1.join(df2, (df1("Id") === df2("Id")) && (df1("Name") === df2("Name")) && (df1("Val") === df2("Val")) && (df1("Age") === df2("Age")) && (df1("Dept") === df2("Dept")))
.drop(df2("Id"))
.drop(df2("Name"))
.drop(df2("Val"))
.drop(df2("Age"))
.drop(df2("Dept"))

How to join multiple dataFrames in spark with different column names and types without converting into RDD

My df1 has column of type Double, df2 has column of type Timestamp and df3 has column of type Integer.
I'm trying to achieve something like this:
df1 = ...
df2 = ...
df3 = ...
val df4 = df1.zip(df2).zip(df3)
However there's no such function like "zip". How can I archive such result?
There's no explicit zip for DataFrames. You can do workaround:
val df1Ordered = df1.withColumn("rowNr", row_number().over(Window.orderBy('someColumn));
// the same for other DataFrames
// now join those DataFrames
val newDF = df1Ordered.join(df2Ordered, "rowNr").join("df3Ordered", "rowNr")
However it will be quite slow, because there is no partitionBy in Window operation.