Spark write columns of array<double> to Hive table - scala

With Spark 1.6, I try to save Arrays to a Hive-Table myTable consisting of two columns, each of type array<double>:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val x = Array(1.0,2.0,3.0,4.0)
val y = Array(-1.0,-2.0,-3.0,-4.0)
val mySeq = Seq(x,y)
val df = sc.parallelize(mySeq).toDF("x","y")
df.write.insertInto("myTable")
But then I get the message:
error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[Double]]
val df = sc.parallelize(mySeq).toDF("x","y")
What is the correct way to do this simple task?

I'm assuming the actual structure you're going for looks like this:
x|y
1.0|-1.0
2.0|-2.0
3.0|-3.0
4.0|-4.0
For this the code you want is this:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val x = Array(1.0,2.0,3.0,4.0)
val y = Array(-1.0,-2.0,-3.0,-4.0)
val mySeq = x.zip(y)
val df = sc.parallelize(mySeq).toDF("x","y")
df.write.insertInto("myTable")
Essentially you need a collection of row like objects (ie: Array[Row]). It'd be better to use a case class as mentioned in another comment as opposed to just the tuple.

Related

Modify udf to display values beyond 99999 in databricks spark scala

created a dataset with below schema
org.apache.spark.sql.Dataset[Records] = [value: string, RowNo: int]
Here value field is fixed length position which I would like to convert it to individual columns and add RowNo as last column using a UDF.
def ReadFixWidthFileWithRDD(SrcFileType:String, rdd: org.apache.spark.rdd.RDD[(String, String)], inputFileLength: Int = 6): DataFrame = {
val postapendSchemaRowNo=StructType(Array(StructField("RowNo", StringType, true)))
val inputLength =List(inputFileLength)
val FileInfoList = FixWidth_Dictionary.get(SrcFileType).toList
val fileSchema = FileInfoList(0)._1
val fileColumnSize = FileInfoList(0)._2
val fileSchemaWithFileName = StructType(fileSchema++postapendSchemaRowNo)
val fileColumnSizeWithFileNameLength = fileColumnSize:::inputLength
val data = rdd
var retDF = spark.createDataFrame(data.map{ x =>;
lsplit(fileColumnSizeWithFileNameLength,x._1+x._2)},fileSchemaWithFileName )
retDF
}
Now in the above function, I want to use a dataset instead of Rdd, as my RowNo is not displaying values beyond 99999.
can someone suggest an alternative
I got the solution.
I had created a Hashkey and associated sequence number into a dataframe.
The hashkey is also associated with a dataframe as well.
I joined those two after splitting the fixed length position.

dataframe.select, select dataframe columns from file

I am trying to create a child dataframe from parent dataframe. but I have more than 100 cols to select.
so in Select statement can I give the columns from a file?
val Raw_input_schema=spark.read.format("text").option("header","true").option("delimiter","\t").load("/HEADER/part-00000").schema
val Raw_input_data=spark.read.format("text").schema(Raw_input_schema).option("delimiter","\t").load("/DATA/part-00000")
val filtered_data = Raw_input_data.select(all_cols)
how can I send the columns names from file in all_cols
I would assume you would read file somewhere from hdfs or from shared config file? Reason for this, that on the cluster this code, would be executed on individual node etc.
In this case I would approach this with next pice of code:
import org.apache.spark.sql.functions.col
val lines = Source.fromFile("somefile.name.csv").getLines
val cols = lines.flatMap(_.split(",")).map( col(_)).toArray
val df3 = df2.select(cols :_ *)
Essentially, you just have to provide array of strings and use :_ * notation for variable number of arguments.
finally this worked for me;
val Raw_input_schema=spark.read.format("csv").option("header","true").option("delimiter","\t").load("headerFile").schema
val Raw_input_data=spark.read.format("csv").schema(Raw_input_schema).option("delimiter","\t").load("dataFile")
val filtered_file = sc.textFile("filter_columns_file").map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList
//or
val filtered_file = sc.textFile(filterFile).map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList.map(x => new Column(x))
val final_df=Raw_input_data.select(filtered_file.head, filtered_file.tail: _*)
//or
val final_df = Raw_input_data.select(filtered_file:_*)'

Spark: How can DataFrame be Dataset[Row] if DataFrame's have a schema

This article claims that a DataFrame in Spark is equivalent to a Dataset[Row], but this blog post shows that a DataFrame has a schema.
Take the example in the blog post of converting an RDD to a DataFrame: if DataFrame were the same thing as Dataset[Row], then converting an RDD to a DataFrameshould be as simple
val rddToDF = rdd.map(value => Row(value))
But instead it shows that it's this
val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]
Clearly a dataframe is actually a dataset of rows and a schema.
In Spark 2.0, in code there is:
type DataFrame = Dataset[Row]
It is Dataset[Row], just because of definition.
Dataset has also schema, you can print it using printSchema() function. Normally Spark infers schema, so you don't have to write it by yourself - however it's still there ;)
You can also do createTempView(name) and use it in SQL queries, just like DataFrames.
In other words, Dataset = DataFrame from Spark 1.5 + encoder, that converts rows to your classes. After merging types in Spark 2.0, DataFrame becomes just an alias for Dataset[Row], so without specified encoder.
About conversions: rdd.map() also returns RDD, it never returns DataFrame. You can do:
// Dataset[Row]=DataFrame, without encoder
val rddToDF = sparkSession.createDataFrame(rdd)
// And now it has information, that encoder for String should be used - so it becomes Dataset[String]
val rDDToDataSet = rddToDF.as[String]
// however, it can be shortened to:
val dataset = sparkSession.createDataset(rdd)
Note (in addition to the answer of T Gaweda) that there is a schema associated to each Row (Row.schema). However, this schema is not set until it is integrated in a DataFrame (or Dataset[Row])
scala> Row(1).schema
res12: org.apache.spark.sql.types.StructType = null
scala> val rdd = sc.parallelize(List(Row(1)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[5] at parallelize at <console>:28
scala> spark.createDataFrame(rdd,schema).first
res15: org.apache.spark.sql.Row = [1]
scala> spark.createDataFrame(rdd,schema).first.schema
res16: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28

Dynamically create dataframes in scala

I am new to scala and spark. I have a requirement to create the dataframes dynamically by reading a file. each line of a file is a query. at last join all dataframes and store the result in a file.
I wrote below basic code, having trouble to dynamically create the dataframes.
import org.apache.spark.sql.SQLContext._
import org.apache.spark.sql.SQLConf
import scala.io.Source
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val empFile = "/user/sri/sample2.txt"
sqlContext.load("com.databricks.spark.csv", Map("path" -> empFile, "header" -> "true")).registerTempTable("emp")
var cnt=0;
val filename ="emp.sql"
for (line <- Source.fromFile(filename).getLines)
{
println(line)
cnt += 1
//var dis: String = "emp"+cnt
val "emp"+cnt = sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
println(dis)
//val dis = sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
}
println(cnt)
exit
Please help me, suggest me if it can be done better way
What error are you getting? I assume you're code won't compile, considering this line:
val "emp"+cnt = sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
In Scala, defining the name of some construct (val "emp"+cnt in your case) programatically is not easily possible.
In your case, you could use a collection to hold the results.
val queries = for (line <- Source.fromFile(filename).getLines) yield {
sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
}
val cnt = queries.length