Why difference when importing csv with spark

Why difference when importing csv with spark - postgresql

I have this csv file, payments.csv, for some particular rows the timestamp is changing by itself. the first 3 lines are the screenshots for easier understanding.
import spark.implicits._
import org.apache.spark.sql.functions.{col,when,to_date,row_number,date_add,expr}
import org.apache.spark.sql.expressions.{Window}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
//Importing the csv
val df = spark.read.option("header","true").option("inferSchema","true").csv("payment.csv")
val df2 = df.filter($"payment_id" === 21112).show()
val time_value = df2.collect(){0}{5}
println(time_value)
clueless about it as of now.
Screenshots:

Related

Spark badRecordsPath is not writing records to the Path as expected

I have a following sample csv data:
id
name
salary
1
"Raju"
1000
2
"Gautam"
15000
3
"Kishan"
30000
4
"Mike"
two hundread
The salary field in last record is corrupted.
I am trying to handle the corrupt record with badRecordsPath as shown in the code below. But it is not working. I am using Spark 3.0.3, Scala 12 and Windows 10.
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.ArrayType
object BadDataPathExample extends App{
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name", "BadDataPathExample")
sparkConf.set("spark.master", "local[2]")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val schema_string = "id int, name String, salary int"
Logger.getLogger(getClass.getName).info(">> Starting to read Data")
// read CSV
val badDF = spark.read
.format("csv")
.option("header", true)
.schema(schema_string)
.option("badRecordsPath", "D:/spark_practice/bad_dir")
.option("path", "D:/spark_practice/data/bad_emp.csv")
.load
badDF.show()
badDF.printSchema()
}
The Output from the above code is as below:
As we can see that record is present with corrupted column value set to Null., which is coming from default behavior of "PERMISSIVE" mode. Also, there is no record being written to the bad records path specified.
But same code works as expected in Databricks as shown below.
What am I doing wrong? Or is badRecordsPath a Databricks specific feature?

badRecordsPath is only a Databricks specific feature.
We can see the logic in source code FailureSafeParser.
class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null))
} catch {
case e: BadRecordException => mode match {
case PermissiveMode =>
Iterator(toResultRow(e.partialResult(), e.record))
case DropMalformedMode =>
Iterator.empty
case FailFastMode =>
throw QueryExecutionErrors.malformedRecordsDetectedInRecordParsingError(e)
}
}
}
}
emmm...
I have a idea to refactor this code...
When there have badRecordsPath option, the mode forced to be DropMalformedMode and ignore mode which user set.
DropMalformedMode parse rows with exception and write to badRecordsPath, then empty Iterator.

Dropping external table in spark is dropping the location or data too

import org.apache.hadoop.fs.{Path,FileSystem}
import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.functions.{current_date, from_unixtime, unix_timestamp}
import spark.implicits._
spark.sql("drop table if exists Staging.test_partiton_drop")
val df = Seq(("Aravind", "DIP"), ("Karthik", "DTP")).toDF("name", "dept")
.withColumn("hdp_load_dt", current_date()).withColumn("hdp_load_hr", from_unixtime(unix_timestamp(), "HH"))
var partitionBySeq = Seq("hdp_load_dt", "hdp_load_hr")
df
.write.partitionBy(partitionBySeq:_*)
.format("parquet")
.mode(SaveMode.Append)
.option("path", "/krnoir/streaming_test")
.option("compression", "snappy")
.saveAsTable("Staging.test_partiton_drop")
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val finalPath = new Path("/krnoir/streaming_test/_REPLICATION.DONE")
fs.create(finalPath,false)
When I ran the above code, /krnoir/streaming_test/ is getting overwritten in each run while I am expecting new files to be added and not overwritten.

select the first element after sorting column and convert it to list in scala

what is the most efficient way to sort one column in data frame, convert it to list, and assign the first element to variable in scala. I tried the following
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, first, regexp_replace}
import org.apache.spark.sql.functions._
println(CONFIG.getString("spark.appName"))
val conf = new SparkConf()
.setAppName(CONFIG.getString("spark.appName"))
.setMaster(CONFIG.getString("spark.master"))
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.format("com.databricks.spark.csv").option("delimiter", ",").load("file.csv")
val dfb=df.sort(desc("_c0"))
val list=df.select(df("_c0")).distinct
but I'm still no able to save the first element as variable

Use select, orderBy, map & head
Assuming column _c0 is of type string, If it is different type you have to modify your column data type in _.getAs[<your column datatype>]
Check below code.
scala> import spark.implicits._
import spark.implicits._
scala> val first = df
.select($"_c0")
.orderBy($"_c0".desc)
.map(_.getAs[String](0))
.head
Or
scala> import spark.implicits._
import spark.implicits._
scala> val first = df
.select($"_c0")
.orderBy($"_c0".desc)
.head
.getAs[String](0)

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}

In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.

How to use Spark-Scala to download a CSV file from the web?

world,
How to use Spark-Scala to download a CSV file from the web and load the file into a spark-csv DataFrame?
Currently I depend on curl in a shell command to get my CSV file.
Here is the syntax I want to enhance:
/* fb_csv.scala
This script should load FB prices from Yahoo.
Demo:
spark-shell -i fb_csv.scala
*/
// I should get prices:
import sys.process._
"/usr/bin/curl -o /tmp/fb.csv http://ichart.finance.yahoo.com/table.csv?s=FB"!
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val fb_df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("/tmp/fb.csv")
fb_df.head(9)
I want to enhance the above script so it is pure Scala with no shell syntax inside.

val content = scala.io.Source.fromURL("http://ichart.finance.yahoo.com/table.csv?s=FB").mkString
val list = content.split("\n").filter(_ != "")
val rdd = sc.parallelize(list)
val df = rdd.toDF

Found better answer from Process CSV from REST API into Spark
Here you go:
import scala.io.Source._
import org.apache.spark.sql.{Dataset, SparkSession}
var res = fromURL(url).mkString.stripMargin.lines.toList
val csvData: Dataset[String] = spark.sparkContext.parallelize(res).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.printSchema()