I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()
Related
CreateDataFrame takes 2 arguments , an rdd and schema.
my schema is like this
val schemas= StructType(
Seq(
StructField("number",IntegerType,false),
StructField("notation", StringType,false)
)
)
in one case i am able to create dataframe from RDD like below:
`val data1=Seq(Row(1,"one"),Row(2,"two"))
val rdd=spark.sparkContext.parallelize(data1)
val final_df= spark.createDataFrame(rdd,schemas)`
In other case like below .. i am not able to
`val data2=Seq((1,"one"),(2,"two"))
val rdd=spark.sparkContext.parallelize(data2)
val final_df= spark.createDataFrame(rdd,schemas)`
Whats wrong with data2 for not able to become a valid rdd for Dataframe?
but we can able to create dataframe using toDF() with data2 but not CreateDataFrame.
val data2_DF=Seq((1,"one"),(2,"two")).toDF("number", "notation")
Please help me understand this behaviour.
Is Row mandatory while creating dataframe?
In the second case, just do :
val final_df = spark.createDataFrame(rdd)
Because your RDD is an RDD of Tuple2 (which is a Product), the schema is known at compile time, so you don't need to specify a schema
I have a dataframe with headers for example outputDF. I now want to take outputDF.columns and create a new dataframe with just one row which contains column names.
I then want to union both these dataframes with option("head=false") which spark can then write to a HDFS.
How do i do that?
below is an example
Val df = spark.read.csv("path")
val newDf = df.columns.toSeq.toDF
val unoindf= df.union(newDf);
I am using spark scala 1.6 version.
I have 2 files, one is a schema file which has hundreds of columns separated by commas and another file is .gz file which contains data.
I am trying to read the data using the schema file and apply different transformation logic on a set of few columns .
I tried running a sample code but I have hardcoded the columns numbers in the attached pic.
Also I want to write a udf which could read any set of columns and apply the transformation like replacing a special character and give the output.
Appreciate any suggestion
import org.apache.spark.SparkContext
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t"))
val rdd2 = rdd1.map(line => line.split("\t")(1)).toDF
val replaceUDF = udf{s: String => s.replace(".", "")}
rdd2.withColumn("replace", replaceUDF('_1)).show
You can read the field name file with simple scala code and create a list of column names as
// this reads the file and creates a list of columnnames
val line = Source.fromFile("path to file").getLines().toList.head
val columnNames = line.split(",")
//read the text file as an rdd and convert to Dataframe
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t")(1))
.toDF(columnNames : _*)
This creates a dataframe with columns names that you have in a separate file.
Hope this helps!
I have a .csv file that I read in to a RDD:
val dataH = sc.textFile(filepath).map(line => line.split(",").map(elem => elem.trim))
I would like to iterate over this RDD in order and compare adjacent elements, this comparison is only dependent of one column of the datastructure. It is not possible to iterate over RDDs so instead, the idea is to first convert the column of RDD to either a Dataset or Dataframe.
You can convert a RDD to a dataset like this (which doesn't work if my structure is RDD[Array[String]]:
val sc = new SparkContext(conf)
val sqc = new SQLContext(sc)
import sqc.implicits._
val lines = sqc.createDataset(dataH)
How do I obtain just the one column that I am interested in from dataH and thereafter create a dataset just from it?
I am using Spark 1.6.0.
You can just map your Array to the desired index, e.g. :
dataH.map(arr => arr(0)).toDF("col1")
Or safer (avoids Exception in case the index is out of bound):
dataH.map(arr => arr.lift(0).orElse(None)).toDF("col1")
I am reading 2 different .csv files which has only column as below:
val dF1 = sqlContext.read.csv("some.csv").select($"ID")
val dF2 = sqlContext.read.csv("other.csv").select($"PID")
trying to search if dF2("PID") exists in dF1("ID"):
val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
This gives me null pointer exception.
but if I convert dF1 outside and use list in udf it works:
val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
val getIdUdf = udf((x:String)=>{dF1.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
I know I can use join to get this done but want to know what is the reason of null pointer exception here.
Thanks.
Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.