Not able to read pipe delimited csv - scala

Input data:
Ord_value|other_data
12345|u1=876435;u5=4356|4357|4358;u15=Mr. Noodles,n/a,Great Value;u16=0.77,4.92,7.96;u17=4,1,7;
Details of U variables
U1= order I'd --single value
U5= pid --is a list
U15= name --is a list
U16= price -- is a list
U17= quantity -- is a list
Output:
Ord_value|orderid|pid|name|price|quantity
12345|876435|4356|Mr. Noodles|0.77|4
12345|876435|4357|n/a|4.92|1
i tried reading the file using semi colon
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
object example3 {
def main(args:Array[String]):Unit={
System.setProperty("hadoop.home.dir", "C:\\hadoop\\")
val conf=new SparkConf().setAppName("first_demo").setMaster("local[*]")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
spark.sparkContext.setLogLevel("Error")
import spark.implicits._
// val rdd1=sc.textFile("file:///C://Users//User//Desktop//example3.txt")
// rdd1.map(x=>x.split(";")).foreach(println)
spark.read.option("delimiter",";").option("header","true").load("file:///C://Users//User//Desktop//example3.txt").show()
}
}
Not able to read the file. am getting above error. it looks like complex file.
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=file:/C:/Users/User/Desktop/example3.txt; isDirectory=false; length=95; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
i tried again
val df1=spark.read.option("delimiter","|").csv("file:///C://Users//User//Desktop//example3.txt")
+-----+-----------------+----+------------------------------------------------------------------+
|_c0 |_c1 |_c2 |_c3 |
+-----+-----------------+----+------------------------------------------------------------------+
|12345|u1=876435;u5=4356|4357|4358;u15=Mr. Noodles,n/a,Great Value;u16=0.77,4.92,7.96;u17=4,1,7;|
+-----+-----------------+----+------------------------------------------------------------------+

Related

Spark badRecordsPath is not writing records to the Path as expected

I have a following sample csv data:
id
name
salary
1
"Raju"
1000
2
"Gautam"
15000
3
"Kishan"
30000
4
"Mike"
two hundread
The salary field in last record is corrupted.
I am trying to handle the corrupt record with badRecordsPath as shown in the code below. But it is not working. I am using Spark 3.0.3, Scala 12 and Windows 10.
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.ArrayType
object BadDataPathExample extends App{
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name", "BadDataPathExample")
sparkConf.set("spark.master", "local[2]")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val schema_string = "id int, name String, salary int"
Logger.getLogger(getClass.getName).info(">> Starting to read Data")
// read CSV
val badDF = spark.read
.format("csv")
.option("header", true)
.schema(schema_string)
.option("badRecordsPath", "D:/spark_practice/bad_dir")
.option("path", "D:/spark_practice/data/bad_emp.csv")
.load
badDF.show()
badDF.printSchema()
}
The Output from the above code is as below:
As we can see that record is present with corrupted column value set to Null., which is coming from default behavior of "PERMISSIVE" mode. Also, there is no record being written to the bad records path specified.
But same code works as expected in Databricks as shown below.
What am I doing wrong? Or is badRecordsPath a Databricks specific feature?
badRecordsPath is only a Databricks specific feature.
We can see the logic in source code FailureSafeParser.
class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null))
} catch {
case e: BadRecordException => mode match {
case PermissiveMode =>
Iterator(toResultRow(e.partialResult(), e.record))
case DropMalformedMode =>
Iterator.empty
case FailFastMode =>
throw QueryExecutionErrors.malformedRecordsDetectedInRecordParsingError(e)
}
}
}
}
emmm...
I have a idea to refactor this code...
When there have badRecordsPath option, the mode forced to be DropMalformedMode and ignore mode which user set.
DropMalformedMode parse rows with exception and write to badRecordsPath, then empty Iterator.

Apache Spark shortest job scala

I am new to Apache Spark and scala programming. I am writing a code in scala using apache spark api docs. My goal is to create a graph and deploy objects and compute shortest path.I have written a program to generate a csv file of object which I want to use. It consists of vehicleID,source,Destination.
It is as follows:
[My sample csv file][1]
[1]: https://i.stack.imgur.com/KtSVz.png
My code to generate CSV file
import java.io.BufferedWriter
import java.io.FileWriter
import scala.collection.JavaConverters._
import scala.collection.mutable.ListBuffer
import scala.util.Random
import au.com.bytecode.opencsv.CSVWriter
import scala.collection.mutable
class MakeCSV() {
def csvBuilder(dx:Int){
val outputfile= new BufferedWriter(new FileWriter("vehicles.csv"))
val csvWriter= new CSVWriter(outputfile)
val csvFields= Array("Vehicle-id","Source","Destination")
val vehicleID=(0 to dx).toList
val sourceList=mutable.MutableList[String]()
val destinationList=mutable.MutableList[String]()
var i,sx,sy,dsx,dsy=0
for(i<-0 to dx){
sx=Random.nextInt(dx)
sy=Random.nextInt(dx)
dsx=Random.nextInt(dx)
dsy=Random.nextInt(dx)
sourceList.+=((sx,sy).toString())
destinationList.+=((dsx,dsy).toString())
}
var listOfRecords = new ListBuffer[Array[String]]()
listOfRecords += csvFields
for (i<- 0 to dx){
listOfRecords+=Array(i.toString,sourceList(Random.nextInt(sourceList.length)),destinationList(Random.nextInt(destinationList.length)))
}
csvWriter.writeAll(listOfRecords.asJava)
csvWriter.close()
}
}
My main file:
import java.io.PrintWriter
import scala.io.StdIn
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.graphx.Graph
import org.apache.spark.graphx.util.GraphGenerators
object MainFile {
def main(args:Array[String]):Unit={
// Vehicle CSV file Generation
println("Enter the number of cars")
val input=StdIn.readInt()
val makecsv= new MakeCSV()
makecsv.csvBuilder(input)
// Spark Job Configuration
val conf = new SparkConf().setAppName("DjikstraShortestPath")
val sc= new SparkContext(conf)
// Graph Generation
println("Enter the number of rows for grid")
val row= StdIn.readInt()
println("Enter the number of columns for grid")
val column = StdIn.readInt()
val graph:Graph[(Int, Int), Double]=GraphGenerators.gridGraph(sc,row,column)
// Vehicle File opening
// For each Vehicle compute shortest path using source destination in csv file
}
}
Now I want to open that csv file and using its source and destination I want to compute shortest path for each vehicle using the graph generated above. Can anyone help me? How to open the csv file read it and find shortest path

how to fix Scala error with "Not found type"

I'm newbie in Scala, just trying to learn it in Spark. Now I'm writing a Scala app to load csv file from hadoop into dataframe, then I want to add a new column in that dataframe. There is a function to populate the content of that new column, for testing the function just uppercase the column from csv file, the csv file only contains one column: emp_id and it's string.. the function is defined in Object TestService. My IDE is Eclipse. Now I have error: not found: type TestService
Very appreciate if anyone can help me.
\\This is the main:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
import com.poc.spark.service.TestService;
object SparkIntTest {
def main(args:Array[String]){
sys.props.+=(("hadoop.home.dir","C:\\OpenSource\\Hadoop"))
val sparkConf = new SparkConf().setMaster("local").setAppName("employee").set("spark.testing.memory", "2147480000")
val sparkContext = new SparkContext(sparkConf)
val spark = SparkSession.builder().appName("employee").getOrCreate()
val df = spark.read.option("header", "true").csv(".\\src\\main\\resources\\employee.csv")
df.show();
println(df.schema);
val df_Applied = df.withColumn("award_rule",runAllRulesUDF(df("emp_id")))
df_Applied.show();
println(df_Applied.schema)
}
def runAllRulesUDF = udf(new TestService().runAllRulesForUDF(_:String))
}
Here is the Object TestService:
package com.poc.spark.service
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
object TestService {
def runAllRulesForUDF(empid: String): String = {
empid.toUpperCase();
}
}
TestService is an object, which means that it is a statically created singleton. So instead of
new TestService()
You can just say
TestService

Spark-Scala writing the output in a textfile

I am executing the wordcount program in spark and trying to store the result in a text file.
I have a scala script to count the word as SparkWordCount.scala. I am trying to execute the script from Spark console as below.
scala> :load /opt/spark-2.0.2-bin-hadoop2.7/bin/SparkWordCount.scala
Loading /opt/spark-2.0.2-bin-hadoop2.7/bin/SparkWordCount.scala...
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
defined object SparkWordCount
scala>
after the program is exectued i am getting the message as "defined object SparkWordCount" but I am not able to see the output result in the text file.
My Word count program is below.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext( "local", "Word Count", "/opt/spark-2.0.2-bin-hadoop2.7",/opt/spark-2.0.2-bin-hadoop2.7/jars,map())
val input = sc.textFile("demo.txt")
val count = input.flatMap(line ⇒ line.split(" ")).map(word ⇒ (word, 1)).reduceByKey(_ + _)
count.saveAsTextFile("outfile")
}
}
Please can anyone suggest. Thanks.
Once object is defined you can call the method to execute your code. Spark-shell won't execute the main method automatically. In your case you can use SparkWordCount.main(Array()) to execute your word-count program.

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}
In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.