Trying to create Data frame from a file with delimiter '|' - scala

I want to load a text file which has delimiter "|" into Dataframe in spark.
one way is to create the RDD and use toDF to create Dataframe. However I was wondering if I can create DF directly.
As of now I am using the below command
val productsDF = sqlContext.read.text("/user/danishdshadab786/paper2/products/")

For Spark 2.x
val df = spark.read.format("csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
For Spark<2.0
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
You can add more options like option("header", "true") for reading headers in the same statement.

You can specify the delimiter in the 'read' options:
spark.read
.option("delimiter", "|")
.csv("/user/danishdshadab786/paper2/products/")

Related

Load CSVs - unable to pass file paths from dataframe

Below code works fine:
val Path = Seq (
"dbfs:/mnt/testdata/2019/02/Calls2019-02-03.tsv",
"dbfs:/mnt/testdata/2019/02/Calls2019-02-02.tsv"
)
val Calls = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "\t")
.schema(schema)
.load(Path: _*)
But I want to get the paths from the dataframe and the below code is not working.
val tsvPath =
Seq(
FinalFileList
.select($"Path")
.filter($"FileDate">MaxStartTime)
.collect.mkString(",")
.replaceAll("[\\[\\]]","")
)
val Calls = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "\t")
.schema(schema)
.load(tsvPath: _*)
Error:
org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/mnt/testdata/2019/02/Calls2019-02-03.tsv,dbfs:/mnt/testdata/2019/02/Calls2019-02-02.tsv;
Looks like it is taking the path as "/mnt/file1.tsv, /mnt/file2.tsv" instead of "/mnt/file1.tsv","/mnt/file2.tsv"
Looks like it is taking the path as "/mnt/file1.tsv, /mnt/file2.tsv" instead of "/mnt/file1.tsv","/mnt/file2.tsv"
I suspect your problem is here:
.collect.mkString(",")
.replaceAll("[\\[\\]]","")
.mkString combines the strings together into one. One possible solution here is to split again after replacing:
.collect.mkString(",")
.replaceAll("[\\[\\]]","")
.split(",")
Another would be to just replace each element instead of combining into a string:
.collect.foreach(_.replaceAll("[\\[\\]]",""))
Whichever one is more suited to you.

How to read only new files in spark

Im reading csv files with spark and scala,the files are coming from another spark streaming job.
I need to read only the new files ?
val df= spark
.read //
.schema(test_raw)
.option("header", "true")
.option("sep", ",")
.csv(path).toDF().cache()
event3.registerTempTable("test")
I resolved the problem by adding a checkpoint on the dataframe like this
val df= spark
.read //
.schema(test_raw)
.option("header", "true")
.option("sep", ",")
.csv(path).toDF().checkpoint().cache()

Create dataframe with header using header and data file

I have two files data.csv and headers.csv. I want to create dataframe in Spark/Scala with headers.
var data = spark.sqlContext.read.format(
"com.databricks.spark.csv").option("header", "true"
).option("inferSchema", "true").load(data_path)
Can you help me customizing above lines to do this?
you can read the headers.csv by using the above method and use the schema of headers dataframe to read the data.csv as below
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
I hope the answer is helpful

spark parquet conversion issue with malformed lines in file

I have a "\u0001" delimited file reading with spark for parquet conversion and I don't have any issues with schema, but, data has quotes(") in between without an end quote. I tried different solutions but couldn't figured out any.
val df = sparkSession.sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "\u0001")
//.option("quote", "\"")
//.option("quote", null)
//.option("quoteMode", "ALL")
.option("header", "false")
.option("mode","FAILFAST")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue"," ")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.schema(schema)
.load(fileLocation)
Thanks in advance and appreciate your help
You can use sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter","\u0001")
and read as textFile
val sentences = sparkContext.textFile(directoryPath)

Spark - CSV text loading parsing error

I am using following code to load a csv file that has text/notes in it.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
Notes are not in any specific format. During loading I am getting this error:
com.univocity.parsers.common.TextParsingException: Error processing input: null
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'.
I'd appreciate any help. Thanks.
I do not have privilege to comment on question, I'm adding answer.
As you are doing na.drop(), may use option("mode", "DROPMALFORMED") as well.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
BTW, databricks spark csv is inbuilt in Spark 2.0+