I want to pass schema (metadata) as an argument from Spark dataframe/dataset (dataframe name as an argument) - scala

I want to pass schema (metadata) as an argument from Spark dataframe/dataset.
I m using spark 2.x
Code: (Sample)
//Define metadata like below.
val df_emp_metadata = StructType(
List(
StructField("emp_id", StringType,true),
StructField("emp_hier_dt",DateType,true),
StructField("dept_id",IntegerType,true)
))
val df_dept_metadata = StructType(
List(
StructField("dept_id", IntegerType,true),
StructField("dept_name",StringType,true)
))
I want to pass the df_emp_metadata/df_dept_metadata as an argument while I am executing the Spark-Submit and pass it as a variable in schema below.
val meta_Data = arg(0) //(df_emp_metadata or df_dept_metadata from Spark-Submit)
val readFileIn = spark.sqlContext.read
.format("csv")
.schema($meta_Data)
.load("data/source_file.csv")
Spark is not allowing to pass the dataframe name as an argument.
Please suggest if any other alternative ways to do this in Spark/Scala programming.

Simple if-else statement. You can select it by putting 1 and 2 or else.
val argument = arg(0)
val schema = if (argument == "1") df_emp_metadata else df_dept_metadata
val readFileIn = spark.sqlContext.read
.format("csv")
.schema(schema)
.load("data/source_file.csv")

Related

Unable to filter CSV columns stored in dataframe in spark 2.2.0

I am reading a CSV file from my local machine using spark and scala and storing into a dataframe (called df). I have to select only few selected columns with new aliasing names from the df and save to new dataframe newDf. I have tried to do the same but I am getting the error below.
main" org.apache.spark.sql.AnalysisException: cannot resolve '`history_temp.time`' given input columns: [history_temp.time, history_temp.poc]
Below is the code written to read the csv file from my local machine.
import org.apache.spark.sql.SparkSession
object DataLoadConversion {
def main(args: Array[String]): Unit = {
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse")
val spark = SparkSession.builder().master("local").appName("DataConversion").getOrCreate()
val df = spark.read.format("com.databricks.spark.csv")
.option("quote", "\"")
.option("escape", "\"")
.option("delimiter", ",")
.option("header", "true")
.option("mode", "FAILFAST")
.option("inferSchema","true")
.load("file:///C:/Users/an/Desktop/ct_temp.csv")
df.show(5) // Till this code is working fine
val newDf = df.select("history_temp.time","history_temp.poc")
Below are the code which I tried but not working.
// val newDf = df.select($"history_temp.time",$"history_temp.poc")
// val newDf = df.select("history_temp.time","history_temp.poc")
// val newDf = df.select( df("history_temp.time").as("TIME"))
// val newDf = df.select(df.col("history_temp.time"))
// df.select(df.col("*")) // This is working
newDf.show(10)
}
}
from the looks of it. your column name format is the issue here. i am guessing they are just regular stringType but when you have something like history_temp.time spark thinks it as an arrayed column. which is not the case. I would rename all of the columns and replace "." to "". then you can run the same select and it should work. you can use foldleft to rplace all "." with "" like below.
val replacedDF = df.columns.foldleft(df){ (newdf, colname)=>
newdf.withColumnRenamed (colname, colname.replace(".","_"))
}
With that done you can select from replacedDF with below
val newDf= replacedDf.select("history_temp_time","history_temp_poc")
Let me know how it works out for you.

How to write file to cassandra from spark

I am new to spark and Cassandra. I Use this code but it give me error.
val dfprev = df.select(col = "se","hu")
val a = dfprev.select("se")
val b = dfprev.select("hu")
val collection = sc.parallelize(Seq(a,b))
collection.saveToCassandra("keyspace", "table", SomeColumns("se","hu"))
When I enter this code on savetocassandra, it give me error and the error is:
java.lang.IllegalArgumentException: Multiple constructors with the same number of parameters not allowed.
at com.datastax.spark.connector.util.Reflect$.methodSymbol(Reflect.scala:16)
at com.datastax.spark.connector.util.ReflectionUtil$.constructorParams(ReflectionUtil.scala:63)
at com.datastax.spark.connector.mapper.DefaultColumnMapper.(DefaultColumnMapper.scala:45) at com.datastax.spark.connector.mapper.LowPriorityColumnMapper$class.defaultColumnMapper(ColumnMapper.scala:51)
at
om.datastax.spark.connector.mapper.ColumnMapper$.defaultColumnMapper(ColumnMapper.scala:55)
val dfprev = df.select("se","hu")
dfprev.write.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace"->"YOUR_KEYSPACE_NAME","table"->"YOUR_TABLE_NAME"))
.mode(SaveMode.Append)
.save()
variable a and b are of type dataframe. sc.parallelize creates a RDD from collection of elements, it doesn't accepts dataframe as input.
Note: Set spark.cassandra.connection.host AND spark.cassandra.auth.username & spark.cassandra.auth.password (if authentication is enabled) in sparkconf

how to get Dataframe from list spark scala

I have a list of RDD. I have iterated the rdd and for each elemet of rdd I am doing some parsing logic. Finally I getting
val mRdd = nRdd.map {
ele => //parsing logic, I have the below field
colum = Array[String] // example ['id','name','dept']<br>
c_type = Array[String] // example ['Int','String','String']<br>
value = ArrayBuffer[String] // [1,lucy,it][2,denis,cs]<br>
}
How I can get the list of dataframe in mRdd
I tried a logic to create dataframe, in this case I have to rdd first. But I can't create rdd inside rdd.
I am new in spark. I am using spark 1.6.3
Please help me
In order to convert an RDD into a Dataframe, you would need to do the following:
Approach 1 - Use createDaframe function:
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val schema = StructType(Seq(
StructField("id", IntegerType),
StructField("name", StringType),
StructField("dept", StringType)
))
createDataframe(parsedRDD, schema)
}
Read more about this approach here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
Approach 2 - Use toDF implicit function:
import sqlContext.implicits._
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val columns = Seq("id", "name", "dept")
parsedRDD.toDF(columns: _*)
}

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?
Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"
I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()
If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)
A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")
I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!

How to convert RDD of Avro's GenericData.Record to DataFrame?

Perhaps this question may seem a bit abstract, here it is:
val originalAvroSchema : Schema = // read from a file
val rdd : RDD[GenericData.Record] = // From some streaming source
// Looking for a handy:
val df: DataFrame = rdd.toDF(schema)
I explore spark-avro but it has support only to read from a file, not from existing RDD.
import com.databricks.spark.avro._
val sqlContext = new SQLContext(sc)
val rdd : RDD[MyAvroRecord] = ...
val df = rdd.toAvroDF(sqlContext)