Read csv from hdfs with spark/scala - scala

I'm using spark 2.3.0 and hadoop 2.9.1
I'm trying to load a CSV file located in hdfs with spark
scala> val dataframe = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("hdfs://127.0.0.1:50075/filesHDFS/data.csv")
But I get the following error:
2018-11-14 11:47:58 WARN FileStreamSink:66 - Error while looking for metadata directory.
java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "Desktop-Presario-CQ42-Notebook-PC/127.0.0.1"; destination host is: "localhost":50070;

Instead of using 127.0.0.1 use the default FS name. You can find it in the core-site.xml file under the property fs.defaultFS
It should solve your problem.

Related

Spark Dataframe writes part files to _temporary in instead directly creating partFiles in output directory [duplicate]

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Spark 2.2.0 unable to connect to Phoenix 4.11.0 version in loading the table to DF

I'm using the below techstack and trying to connect Phoenix tables using PySpark code. I have downloaded the following jars from the url and tried executing the below code. In logs the connection to hbase is established but the console is stuck with out doing nothing. Please let me know if anybody encountered and fixed similar issue.
https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark/4.11.0-HBase-1.2
jars:
phoenix-spark-4.11.0-HBase-1.2.jar
phoenix-client.jar
Tech Stack all running in same host:
Apache Spark 2.2.0 Version
Hbase 1.2 Version
Phoenix 4.11.0 Version
Copied the hbase-site.xml in the folder path /spark/conf/hbase-site.xml.
Command executed ->
usr/local/spark> spark-submit phoenix.py --jars /usr/local/spark/jars/phoenix-spark-4.11.0-HBase-1.2.jar --jars /usr/local/spark/jars/phoenix-client.jar
Phoenix.py:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("pysparkPhoenixLoad").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.apache.phoenix.spark").option("table",
"schema.table1").option("zkUrl", "localhost:2181").load()
df.show()
Error log: Hbase Connection is established, however in the console it is stuck and timing out error is thrown
18/07/30 12:28:15 WARN HBaseConfiguration: Config option "hbase.regionserver.lease.period" is deprecated. Instead, use "hbase.client.scanner.timeout.period"
18/07/30 12:28:54 INFO RpcRetryingCaller: Call exception, tries=10, retries=35, started=38367 ms ago, cancelled=false, msg=row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=master01,16020,1532591192223, seqNum=0
Take a look at these answers :
phoenix jdbc doesn't work, no exceptions and stuck
HBase Java client - unknown host: localhost.localdomain
Both of the issues happened in Java (with JDBC), but it looks like it's a similar issue here.
Try to add ZooKeeper hostname (master01, as I see in the error message) to your /etc/hosts :
127.0.0.1 master01
if you are running all your stack locally.

How to use s3 with Apache spark 2.2 in the Spark shell

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.
I have consulted the following resources:
Parsing files from Amazon S3 with Apache Spark
How to access s3a:// files from Apache Spark?
Hortonworks Spark 1.6 and S3
Cloudera
Custom s3 endpoints
I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key
spark.hadoop.fs.s3a.secret.key=secret-key
I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:
bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar
In the shell, here is how I try to load data from the S3 bucket:
val p = spark.read.textFile("s3a://sparkcookbook/person")
And here is the error that results:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
When I instead try to start the Spark shell as follows:
bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1
Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:
:: problems summary ::
:::: ERRORS
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
And here is the second:
val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)
Could someone suggest how to get this working? Thanks.
If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.
$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar
After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.

Spark 2.2.0 - unable to read recursively into directory structure

Problem summary:
I am unable to read from nested subdirectories using my Spark program, despite setting the required Hadoop configuration (see attempted).
I get the error pasted below.
Any help is appreciated.
Version:
Spark 2.2.0
Input directory layout:
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939225073/part-00000-3a44cd00-e895-4a01-9ab9-946064b739d4-c000.parquet
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939234036/part-00000-cbd47353-0590-4cc1-b10d-c18886df1c25-c000.parquet
...
Input directory parameter passed:
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/*/*
Attempted (1):
Set parameter in code...
val sparkSession: SparkSession =SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support & loglevel
import sparkSession.implicits._sparkSession.sparkContext.hadoopConfiguration.setBoolean("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", true)
Did not see the configuration in place in Spark UI.
Attempted (2):
Passed the config from the CLI - spark-submit, and set it in code (see below).
spark-submit --conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true \...
I do see the configuration in the Spark UI, but same error – cannot traverse into the directory structure..
Code:
//Spark Session
val sparkSession: SparkSession=SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support
val conf= new SparkConf()
val cliRecursiveGlobConf=conf.get("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive")
import sparkSession.implicits._
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", cliRecursiveGlobConf)
Error & overall output:
Full error is at - https://gist.github.com/airawat/77fbdb821410a5a87dfd29ffaf60fdf9
17/08/18 15:59:29 INFO state.StateStoreCoordinatorRef: Registered
StateStoreCoordinator endpoint
Exception in thread "main" java.io.FileNotFoundException: File /user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=*/* does not exist.

Spark JobServer JDBC-ClassNotFound error

I have:
- Hadoop
- Spark JobServer
- SQL Database
I have created a file to access my SQL Database from a local instance of the Spark JobServer. In order to do this, I first have to load my JDBC-driver with this command: Class.forName("com.mysql.jdbc.Driver");. However, when I try to execute the file on Spark JobServer, I get a classNotFound error:
"message": "com.mysql.jdbc.Driver",
"errorClass": "java.lang.ClassNotFoundException",
I have read that in order to load the JDBC driver, you have to change some configurations in either the application.conf file of the Spark JobServer or its server_start.sh file. I have done this as follows. In server_start.sh I have changed the cmd value which is sent with as spark-submit command:
cmd='$SPARK_HOME/bin/spark-submit --class $MAIN --driver-memory $JOBSERVER_MEMORY
--conf "spark.executor.extraJavaOptions=$LOGGING_OPTS spark.executor.extraClassPath = hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
--driver-java-options "$GC_OPTS $JAVA_OPTS $LOGGING_OPTS $CONFIG_OVERRIDES"
--driver-class-path "hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
--jars "hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
$# $appdir/spark-job-server.jar $conffile'
I also changed some lines of the application.conf file of the Spark JobServer which is used when starting the instance:
# JDBC driver, full classpath
jdbc-driver = com.mysql.jdbc.Driver
# dependent-jar-uris = ["hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"]
But the error that JDBC class cannot be found still comes back.
Already checked for the following errors:
ERROR1:
In case somebody thinks that I just have the wrong file path (which could very well be the case as far as I know myself), I have checked for the correct file on HDFS with hadoop fs -ls hdfs://quickstart.cloudera:8020/user/cloudera/ and the file was there:
-rw-r--r-- 1 cloudera cloudera 983914 2016-01-26 02:23 hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar
ERROR2:
I have the necessary dependency loaded in my build.sbt file: libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.+" and the import command in my scala-file import java.sql._.
How can I solve this ClassNotFound error?
Are there any good alternatives to JDBC to connect to SQL?
We have something like this in local.conf
# JDBC driver, full classpath
jdbc-driver = org.postgresql.Driver
# Directory where default H2 driver stores its data. Only needed for H2.
rootdir = "/var/spark-jobserver/sqldao/data"
jdbc {
url = "jdbc:postgresql://dbserver/spark_jobserver"
user = "****"
password = "****"
}
dbcp {
maxactive = 20
maxidle = 10
initialsize = 10
}
And in the start script I have
EXTRA_JARS="/opt/spark-jobserver/lib/*"
CLASSPATH="$appdir:$appdir/spark-job-server.jar:$EXTRA_JARS:$(dse spark-classpath)"
And all dependent files that are used by Spark Jobserver is put in /opt/spark-jobserver/lib
I have not used HDFS to load jar for job-server.
But if you need mysql driver to be loaded on spark worker nodes then you should do it via dependent-jar-uris. I think that is what you are doing now.
I have packaged the project using sbt assembly and it finally works and I am happy.
But it's actually not working to have HDFS files in your dependent-jar-uri. So don't use HDFS links as your dependent-jar-uris.
Also, read this link in case you are curious: https://github.com/spark-jobserver/spark-jobserver/issues/372