How to read .DAT file using SCALA

How to read .DAT file using SCALA - scala

I am trying to read DAT file using below syntax but getting below error:
spark.read.format("dat").option("header", "true").option("delimiter","!^")
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: dat. Please find packages at http://spark.apache.org/third-party-projects.html

You can try this:
spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.csv(spark.read.textFile("filename")
.map(line => line.split("YOUR DOUBLE DELIMITER").mkString("\t")))
The answer is from here:
How to use double pipe as delimiter in CSV?

Related

Getting error in spark-sftp, no such file

In a databricks cluster Spark 2.4.5, Scala 2.1.1 I am trying to read a file into a spark data frame using the following code.
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "*")
.option("username", "*")
.option("password", "*")
.option("fileType", "csv")
.option("delimiter", ";")
.option("inferSchema", "true")
.load("/my_file.csv")
However, I get the following error
org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/local_disk0/tmp/my_file.csv;
I think I need to specify an option to save that file temporarily, but I can't find a way to do so. How can I solve that?

ERROR: org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV

I try this basic command to read a CSV in scala:
val df = spark.read
.option("header", "true")
.option("sep","|")
.option("inferSchema", "true")
.csv("path/to/_34File.csv")
And I get:
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.
What could be the solution?

The solution is to rename de file from "_34File.csv" to "34File.csv". It's a peculiar case and that worked for me.

how to read a .dat file with delimiter /u0001 and record next record will be separating by next line in spark with scala

I have .dat extension file which not having any header
1.fields separated by '\u0001'
2.next record will be in new line
how can i read this file in spark with scala and convert to a dataframe.

Try below code, I assume you are using spark > 2.x version -
val df = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\01")
.csv("<CSV_FILE_PATH_GOES_HERE>")

Reading CSV using Spark from Zeppelin in EMR

I have a simple CSV file in S3 where I have read it many times using Spark in EMR.
Now I want to use Zeppelin so, I can do some analysis.
My code is very simple
val path="s3://somewhere/some.csv"
val df=
_spark
.read
.format("csv")
.option("delimiter", "\t")
.option("header", false)
.option("mode", ParseModes.DROP_MALFORMED_MODE)
.option("nullValue", "NULL")
.option("charset", "UTF-8")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.load(path)
But when I try to collect the dataframe
df.collect
I get an error
java.io.InvalidClassException:
org.apache.commons.lang3.time.FastDateFormat; local class
incompatible: stream classdesc serialVersionUID = 1, local class
serialVersionUID = 2
which is the different versions commons-lang3 between Zeppelin and Spark use.
reference:
http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/InvalidClassException-using-Zeppelin-master-and-spark-2-1-on-a-standalone-spark-cluster-td4900.html
I have used many different EMR version from 5.3.1 to 5.7.0
I have tried to add in --jars in spark
commons-lang3-3.4.jar
but with no luck.
Has anyone, had the same error?

Not able to load file from HDFS in spark Dataframe

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
.show()
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0

Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.
https://issues.apache.org/jira/browse/SPARK-15899

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to read .DAT file using SCALA - scala

You can try this: spark.read .option("header", "true") .option("inferSchema", "true") .option("delimiter", "\t") .csv(spark.read.textFile("filename") .map(line => line.split("YOUR DOUBLE DELIMITER").mkString("\t"))) The answer is from here: How to use double pipe as delimiter in CSV?

Related

Getting error in spark-sftp, no such file

ERROR: org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV

how to read a .dat file with delimiter /u0001 and record next record will be separating by next line in spark with scala

Reading CSV using Spark from Zeppelin in EMR

Not able to load file from HDFS in spark Dataframe

Categories

Resources