I have 2 data files:
1 file is header file and other is data file.
Header file is having 2 columns (Id,Tags):header.txt
Id,Tags
Now I am trying to create a dataFrame Schema Out of the header file:(I have to use this approach as in real time ,there are 1000 of columns in header.txt and data.txt. So,manually creating case class with 1000 column is not possible.
val dataFile=sparkSession.read.format("text").load("data.txt")
val headerFile=sparkSession.sparkContext.textFile("header.txt")
val fields=
headerFile.flatMap(x=>x.split(",")).map(fieldName=>StructField(fieldName,StringType,true))
val schema=StructType(fields)
But above line is failing with Cannot resolve overloaded method StructType.
Can some one please help
StructType needs an array of StructField and you are using fields variable which is an RDD[String] so collect the rdd to create StructType.
val fields= headerFile.flatMap(x=>x.split(","))
.map(fieldName=>StructField(fieldName,StringType,true))
val schema=StructType(fields.collect)
Related
I have column names in one .csv file and want to assign these as column headers to Data Frame in scala. Since it is generic script, I don't want to hard code in the script rather pass the values from csv file.
You can do it:
val columns = spark.read.option("header","true").csv("path_to_csv").schema.fieldNames
val df: DataFrame = ???
df.toDF(columns:_*).write.format("orc").save("your_orc_dir")
in pyspark:
columns = spark.read.option("header","true").csv("path_to_csv").columns
df.toDF(columns).write.format("orc").save("your_orc_dir")
but store data schema separately from data is bad idea
When should I use Structtype, and when should I use case class.
I am trying to create a spark dataset.
I have an input CSV file, I am trying to create a dataframe first and then convert it to the dataset using df.as[].
Now in order to generate the schema, should I use structtype or case class? Please help.
You don't have to use StructType when reading your CSV file but :
By default all fields would be Strings unless you specify the inferschema option
You'd have to name every field like this if you don't have a header
sparkSession.read.csv("my/csv/path.csv").toDF("id","product","customer","time").as[Transaction]
I don't want to use infer schema and headers options. The only way is I should read a file containing only column headers and should use it dynamically to create a dataframe.
I am using Spark 2 and for loading a single csv file with my user defined schema but I want to handle this dynamically so that once I provide the path of only the schema file it will read that and use it as headers for the data and convert it to dataframe with the schema provided in the schema file.
Suppose in the folder I have provided contains 2 files. One file will have only the data, header is not compulsory. The 2nd file will have the schema (column names). So I have to read the schema file first followed by the file containing data and have to apply the schema to the data file and show it in dataframe.
Small example, schema.txt contains:
Custid,Name,Product
while the data file have:
1,Ravi,Mobile
From your comments I'm assuming the schema file only contains the column names and is formatted like a csv file (with the columns names as header and without any data). The column types will be inferred from the actual data file and are not specified by the schema file.
In this case, the easiest solution would be to read the schema file as a csv, setting header to true. This will give an empty dataframe but with the correct header. Then read the datafile and change the default column names to the ones in the schema dataframe.
val schemaFile = ...
val dataFile = ...
val colNames = spark.read.option("header", true).csv(schemaFile).columns
val df = spark.read
.option("header", "false")
.option("inferSchema", "true")
.csv(dataFile)
.toDF(colNames: _*)
I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
The schema line is not inside the same RDD, but in a another variable:
val header = "name,account,state,age"
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
I did search and saw a post:
Can I read a CSV represented as a string into Apache Spark using spark-csv
.
However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
Your help is greatly appreciated.
The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).
Assuming the input data are in rdd:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
Resulting dataframe:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+
How do I create a RDD from CSV file which does not have a header and how do i combine 2 RDD on a column. Not using Spark SQL
rdd1 = sc.textFile('transactions.csv')
It depends if you want a DataFrame or an RDD. If it is the former try:
spark.read.format("csv").option("header", "false").load("transactions.csv")
The columns will be auto-generated. If it is the latter, do something like:
rdd1 = sc.textFile('transactions.csv').map(lambda row: row.split(","))