Link to (data.csv) and (output.csv)
import org.apache.spark.sql._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val tempDF=spark.read.csv("data.csv")
tempDF.coalesce(1).write.parquet("Parquet")
val rdd = sc.textFile("Parquet")
I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file.Link of (data.csv) and (output.csv)
val header = rdd.first
val rdd1 = rdd.filter(_ != header)
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
import spark.sqlContext.implicits._
val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
dataSet.coalesce(1).write.option("header","true").csv("output")
}
case class CountryAgg(country: String, values: String)
}
I reckon, you are trying to add up corresponding elements from the array based on Country. I have done this using DataFrame APIs, which makes the job easier.
Code for your reference:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("path", "/path/to/input/data.csv")
.load()
val df1 = df.select(
$"Country",
(split($"Values", ";"))(0).alias("c1"),
(split($"Values", ";"))(1).alias("c2"),
(split($"Values", ";"))(2).alias("c3"),
(split($"Values", ";"))(3).alias("c4"),
(split($"Values", ";"))(4).alias("c5")
)
.groupBy($"Country")
.agg(
sum($"c1" cast "int").alias("s1"),
sum($"c2" cast "int").alias("s2"),
sum($"c3" cast "int").alias("s3"),
sum($"c4" cast "int").alias("s4"),
sum($"c5" cast "int").alias("s5")
)
.select(
$"Country",
concat(
$"s1", lit(";"),
$"s2", lit(";"),
$"s3", lit(";"),
$"s4", lit(";"),
$"s5"
).alias("Values")
)
df1.repartition(1)
.write
.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "/path/to/output")
.save()
Here is the output for your reference.
scala> df1.show()
+-------+-------------------+
|Country| Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
| China| 218;239;234;209;75|
| India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+
P.S.:
You can change the output format to parquet/orc or anything you wish.
I have repartitioned df1 into 1 partition just so that you could get a single output file. You can choose to repartition or not based
on your usecase
Hope this helps.
You could just read the file as parquet and perform the same operations on the resulting dataframe:
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")
If you need an rdd you can then just call:
val rdd = parquetFileDF.rdd
The you can proceed with the transformations as before and write as parquet like you have in your question.
Related
I am reading a CSV file from my local machine using spark and scala and storing into a dataframe (called df). I have to select only few selected columns with new aliasing names from the df and save to new dataframe newDf. I have tried to do the same but I am getting the error below.
main" org.apache.spark.sql.AnalysisException: cannot resolve '`history_temp.time`' given input columns: [history_temp.time, history_temp.poc]
Below is the code written to read the csv file from my local machine.
import org.apache.spark.sql.SparkSession
object DataLoadConversion {
def main(args: Array[String]): Unit = {
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse")
val spark = SparkSession.builder().master("local").appName("DataConversion").getOrCreate()
val df = spark.read.format("com.databricks.spark.csv")
.option("quote", "\"")
.option("escape", "\"")
.option("delimiter", ",")
.option("header", "true")
.option("mode", "FAILFAST")
.option("inferSchema","true")
.load("file:///C:/Users/an/Desktop/ct_temp.csv")
df.show(5) // Till this code is working fine
val newDf = df.select("history_temp.time","history_temp.poc")
Below are the code which I tried but not working.
// val newDf = df.select($"history_temp.time",$"history_temp.poc")
// val newDf = df.select("history_temp.time","history_temp.poc")
// val newDf = df.select( df("history_temp.time").as("TIME"))
// val newDf = df.select(df.col("history_temp.time"))
// df.select(df.col("*")) // This is working
newDf.show(10)
}
}
from the looks of it. your column name format is the issue here. i am guessing they are just regular stringType but when you have something like history_temp.time spark thinks it as an arrayed column. which is not the case. I would rename all of the columns and replace "." to "". then you can run the same select and it should work. you can use foldleft to rplace all "." with "" like below.
val replacedDF = df.columns.foldleft(df){ (newdf, colname)=>
newdf.withColumnRenamed (colname, colname.replace(".","_"))
}
With that done you can select from replacedDF with below
val newDf= replacedDf.select("history_temp_time","history_temp_poc")
Let me know how it works out for you.
I have the following file which I need to read using spark in scala -
#Version: 1.0
#Fields: date time location timezone
2018-02-02 07:27:42 US LA
2018-02-02 07:27:42 UK LN
I am currently trying to extract the fields using the following the -
spark.read.csv(filepath)
I am new to spark+scala and wanted to know know is there a better way to extract fields based on the # Fields row at the top of the file.
You should be using sparkContext's textFile api to read the text file and then filter the header line
val rdd = sc.textFile("filePath")
val header = rdd
.filter(line => line.toLowerCase.contains("#fields:"))
.map(line => line.split(" ").tail)
.first()
That should be it.
Now if you want to create a dataframe then you should parse it to form schema and then filter the data lines to form Rows. And finally use SQLContext to create a dataframe
import org.apache.spark.sql.types._
val schema = StructType(header.map(title => StructField(title, StringType, true)))
val dataRdd = rdd.filter(line => !line.contains("#")).map(line => Row.fromSeq(line.split(" ")))
val df = sqlContext.createDataFrame(dataRdd, schema)
df.show(false)
This should give you
+----------+--------+--------+--------+
|date |time |location|timezone|
+----------+--------+--------+--------+
|2018-02-02|07:27:42|US |LA |
|2018-02-02|07:27:42|UK |LN |
+----------+--------+--------+--------+
Note: if the file is tab delimited, instead of doing
line.split(" ")
you should be using \t
line.split("\t")
Sample input file "example.csv"
#Version: 1.0
#Fields: date time location timezone
2018-02-02 07:27:42 US LA
2018-02-02 07:27:42 UK LN
Test.scala
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession.Builder
import org.apache.spark.sql._
import scala.util.Try
object Test extends App {
// create spark session and sql context
val builder: Builder = SparkSession.builder.appName("testAvroSpark")
val sparkSession: SparkSession = builder.master("local[1]").getOrCreate()
val sc: SparkContext = sparkSession.sparkContext
val sqlContext: SQLContext = sparkSession.sqlContext
case class CsvRow(date: String, time: String, location: String, timezone: String)
// path of your csv file
val path: String =
"sample.csv"
// read csv file and skip firs two lines
val csvString: Seq[String] =
sc.textFile(path).toLocalIterator.drop(2).toSeq
// try to read only valid rows
val csvRdd: RDD[(String, String, String, String)] =
sc.parallelize(csvString).flatMap(r =>
Try {
val row: Array[String] = r.split(" ")
CsvRow(row(0), row(1), row(2), row(3))
}.toOption)
.map(csvRow => (csvRow.date, csvRow.time, csvRow.location, csvRow.timezone))
import sqlContext.implicits._
// make data frame
val df: DataFrame =
csvRdd.toDF("date", "time", "location", "timezone")
// display dataf frame
df.show()
}
While I am trying to create a dataframe using a decimal type it is throwing me the below error.
I am performing the following steps:
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StringType;
import org.apache.spark.sql.types.DataTypes._;
//created a DecimalType
val DecimalType = DataTypes.createDecimalType(15,10)
//Created a schema
val sch = StructType(StructField("COL1",StringType,true)::StructField("COL2",**DecimalType**,true)::Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x=>x.split(",")).map(x=>Row.fromSeq(x))
val df1= sqlContext.createDataFrame(row,sch)
df1 is getting created without any errors.But, when I issue as df1.collect() action, it is giving me the below error:
scala.MatchError: 0 (of class java.lang.String)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:326)
test_file.txt content:
test1,0
test2,0.67
test3,10.65
test4,-10.1234567890
Is there any issue with the way that I am creating DecimalType?
You should have an instance of BigDecimal to convert to DecimalType.
val DecimalType = DataTypes.createDecimalType(15, 10)
val sch = StructType(StructField("COL1", StringType, true) :: StructField("COL2", DecimalType, true) :: Nil)
val src = sc.textFile("test_file.txt")
val row = src.map(x => x.split(",")).map(x => Row(x(0), BigDecimal.decimal(x(1).toDouble)))
val df1 = spark.createDataFrame(row, sch)
df1.collect().foreach { println }
df1.printSchema()
The result looks like this:
[test1,0E-10]
[test2,0.6700000000]
[test3,10.6500000000]
[test4,-10.1234567890]
root
|-- COL1: string (nullable = true)
|-- COL2: decimal(15,10) (nullable = true)
When you read a file as sc.textFile it reads all the values as string, So error is due to applying the schema while creating dataframe
For this you can convert the second value to Decimal before applying schema
val row = src.map(x=>x.split(",")).map(x=>Row(x(0), BigDecimal.decimal(x(1).toDouble)))
Or if you reading a cav file then you can use spark-csv to read csv file and provide the schema while reading the file.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For Spark > 2.0
spark.read
.option("header", true)
.schema(sch)
.csv(file)
Hope this helps!
A simpler way to solve your problem would be to load the csv file directly as a dataframe. You can do that like this:
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false") // no header
.option("inferSchema", "true")
.load("/file/path/")
Or for Spark > 2.0:
val spark = SparkSession.builder.getOrCreate()
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "false") // no headers
.load("/file/path")
Output:
df.show()
+-----+--------------+
| _c0| _c1|
+-----+--------------+
|test1| 0|
|test2| 0.67|
|test3| 10.65|
|test4|-10.1234567890|
+-----+--------------+
I have a csv file containing double type.When i load to a dataframe i got this message telling me that the type string is java.lang.String cannot be cast to java.lang.Double although my data are numeric.How do i get a dataframe from this csv file containing double type.how should i modify my code.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{ArrayType, DoubleType}
import org.apache.spark.sql.functions.split
import scala.collection.mutable._
object Example extends App {
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val data=spark.read.csv("C://lpsa.data").toDF("col1","col2","col3","col4","col5","col6","col7","col8","col9")
val data2=data.select("col2","col3","col4","col5","col6","col7")
What sould i make to transform each row in the dataframe into double type? Thanks
Use select with cast:
import org.apache.spark.sql.functions.col
data.select(Seq("col2", "col3", "col4", "col5", "col6", "col7").map(
c => col(c).cast("double")
): _*)
or pass schema to the reader:
define the schema:
import org.apache.spark.sql.types._
val cols = Seq(
"col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9"
)
val doubleCols = Set("col2", "col3", "col4", "col5", "col6", "col7")
val schema = StructType(cols.map(
c => StructField(c, if (doubleCols contains c) DoubleType else StringType)
))
and use it as an argument for schema method
spark.read.schema(schema).csv(path)
It is also possible to use schema inference:
spark.read.option("inferSchema", "true").csv(path)
but it is much more expensive.
I believe using sparks inferSchema option comes in handy while reading the csv file. Below is the code to automatically detect your columns as double type :
val data = spark.read
.format("csv")
.option("header", "false")
.option("inferSchema", "true")
.load("C://lpsa.data").toDF()
Note: I am using spark version 2.2.0
origin.csv
no,key1,key2,key3,key4,key5,...
1,A1,B1,C1,D1,E1,..
2,A2,B2,C2,D2,E2,..
3,A3,B3,C3,D3,E3,..
WhatIwant.csv
1,A1,key1
1,B1,key2
1,C1,key3
...
3,A3,key1
3,B3,key2
...
I loaded csv with read method(origin.csv dataframe), but unable to convert it.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
Any idea of this?
Try this.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val df = Seq((1,"A1","B1","C1","D1"), (2,"A2","B2","C2","D2"), (3,"A3","B3","C3","D2")).toDF("no", "key1", "key2","key3","key4")
df.show
def myUDF(df: DataFrame, by: Seq[String]): DataFrame = {
val (columns, types) = df.dtypes.filter{ case (clm, _) => !by.contains(clm)}.unzip
require(types.distinct.size == 1)
val keys = explode(array(
columns.map(clm => struct(lit(clm).alias("key"),col(clm).alias("val"))): _*
))
val byValue = by.map(col(_))
df.select(byValue :+ keys.alias("_key"): _*).select(byValue ++ Seq($"_key.val", $"_key.key"): _*)
}
val df1 = myUDF(df, Seq("no"))
df1.show