df
.coalesce(1)
.write
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("header", "true")
.mode(SaveMode.Append)
.save(s"s3://$bucket/$etlFolderPrefix/a.xlsx")
ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B at org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104) at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.(UnsynchronizedByteArrayOutputStream.java:51) at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.(UnsynchronizedByteArrayOutputStream.java:38) at shadeio.poi.xssf.usermodel.XSSFWorkbook.newPackage(XSSFWorkbook.java:528) at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:245) at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:241) at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:229) at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:45) at com.crealytics.spark.excel.DefaultSource.createRelation Exception in User Class: java.lang.NoClassDefFoundError : org/apache/commons/io/output/UnsynchronizedByteArrayOutputStream
Tried with different versions, currently using spark version 2.12
Related
I am trying to load the data from HDFS to hive using spark, below are the spark conf
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
public class MigrationdatainoHive1 {
public static void main(String[] args) {
// TODO Auto-generated method stub
SparkSession spark = SparkSession
.builder()
.appName("Read csv File to DataSet")
.config("spark.driver.bindAddress", "0.0.0.0")
.config("spark.master", "local")
.enableHiveSupport()
.getOrCreate();
spark.sparkContext().setLogLevel("ERROR");
String files = "D:\\zomato-india-data-set\\Agra\\1-Agrahotels.csv";
Dataset<Row> df = spark.read().format("csv")
.option("header", "true")
.option("multiline", true)
.option("sep", "|")
.option("quote", "\"")
.option("dateFormat", "M/d/y")
.option("inferSchema", true)
.csv(files);
df.write().mode(SaveMode.Append).saveAsTable("hadoop.Sample");
df.printSchema();
}
}
where i was add .enableHiveSupport() in spark conf then I am facing below error
**Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: D:/tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-;**
and also I have run the command C:\Hadoop\Hadoop-2.7.1\bin>winutils.exe chmod 777 C:\tmp\hive after that also I am facing the same issue.kindly help to resolve to this.
and also if possible kindly share any sample code for loading from HDFS to hive using spark.
I have following simple Scala class , which i will later modify to fit some machine learning models.
I need to create a jar file out of this as i am going to run these models in amazon-emr . I am a beginner in this process. So i first tested whether i can successfully import the following csv file and write to another file by creating a jar file using the Scala class mention below.
The csv file looks like this and its include a Date column as one of the variables.
+-------------------+-------------+-------+---------+-----+
| Date| x1 | y | x2 | x3 |
+-------------------+-------------+-------+---------+-----+
|0010-01-01 00:00:00|0.099636562E8|6405.29| 57.06|21.55|
|0010-03-31 00:00:00|0.016645123E8|5885.41| 53.54|21.89|
|0010-03-30 00:00:00|0.044308936E8|6260.95|57.080002|20.93|
|0010-03-27 00:00:00|0.124928214E8|6698.46|65.540001|23.44|
|0010-03-26 00:00:00|0.570222885E7|6768.49| 61.0|24.65|
|0010-03-25 00:00:00|0.086162414E8|6502.16|63.950001|25.24|
Data set link : https://drive.google.com/open?id=18E6nf4_lK46kl_zwYJ1CIuBOTPMriGgE
I created a jar file out of this using intelliJ IDEA. And it was successfully done.
object jar1 {
def main(args: Array[String]): Unit = {
val sc: SparkSession = SparkSession.builder()
.appName("SparkByExample")
.getOrCreate()
val data = sc.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load(args(0))
data.write.format("text").save(args(1))
}
}
After that I upload this jar file along with the csv file mentioned above in amazon-s3 and tried to ran this in a cluster of amazon-emr .
But it was failed and i got following error message :
ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support timestamp data type.;
I am sure this error is something to do with the Date variable in the data set. But i dont know how to fix this.
Can anyone help me to figure this out ?
Updated :
I tried to open the same csv file that i mentioned earlier without the date column . In this case i am getting this error :
ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support double data type.;
Thank you
As I paid attention later that your are going to write to a text file. Spark's .format(text) doesn't support any specific types except String/Text. So to achive a goal you need to first convert the all the types to String and store:
df.rdd.map(_.toString().replace("[","").replace("]", "")).saveAsTextFile("textfilename")
If it's you could consider other oprions to store the data as file based, then you can have benefits of types. For example using CSV or JSON.
This is working code example based on your csv file for csv.
val spark = SparkSession.builder
.appName("Simple Application")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
import spark.sqlContext.implicits._
val df = spark.read
.format("csv")
.option("delimiter", ",")
.option("header", "true")
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.load("datat.csv")
df.printSchema()
df.show()
df.write
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "\t")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.option("escape", "\\")
.save("another")
There is no need custom encoder/decoder.
I am trying load the data from excel sheet to hive table. It throws the error below
.Map(treatemptyvaluesasnulls -> true, location -> "input", useheader -> true, inferschema -> true, addcolorcolumns -> false, sheetname ->"INPUT") (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap)
Code used:
val df = spark.read.format("com.crealytics.spark.excel")
.option("location", tname) .option("sheetName", fname) .option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "true")
.option("addColorColumns", "false") .load() //df.printSchema() //df.show(100)
df.createOrReplaceTempView(s"""$fname""")
//val d = hqlContext.sql(s"select * from $fname")
spark.sql(s"""drop table if exists $tdb.$ttab PURGE""")
I tried with different dependecies.
Dependencies used:
spark-excel_2.11:0.10.2
com.crealytics
spark-excel_2.10
0.8.3
Can anyone help?
solved the issue :
used --packages com.crealytics:spark-excel_2.11:0.10.2
while running spark submit
worked fine
I am joining two DataFrames created by reading two very large CSV files to compute some statistics. The code is running on a web server and it is triggered by a request, that's why Spark's session is kept always alive without calling sparkSession.close().
Sporadically, the code throws java.lang.IllegalArgumentException: spark.sql.execution.id is already set. I tried to make sure that the code doesn't get executed more than once at a time and but the problem wasn't resolved.
I am using Spark 2.1.0 and I know that there is an issue here which would hopefully be resolved in Spark 2.2.0.
Could you please suggest any workarounds in the mean time to avoid this problem?
A simplified version of the code that throws the exception:
val spark = SparkSession.builder().appName("application").master("local[*]").getOrCreate()
val itemCountry = spark.read.format("csv")
.option("header", "true")
.schema(StructType(Array(
StructField("itemId", IntegerType, false),
StructField("countryId", IntegerType, false))))
.csv("/item_country.csv") // This file matches the schema provided
val itemPerformance = spark.read.format("csv")
.option("header", "true")
.schema(StructType(Array(
StructField("itemId", IntegerType, false),
StructField("date", TimestampType, false),
StructField("performance", IntegerType, false))))
.csv("/item_performance.csv") // This file matches the schema provided
itemCountry.join(itemPerformance, itemCountry("itemId") === itemPerformance("itemId"))
.groupBy("countryId")
.agg(sum(when(to_date(itemPerformance("date")) > to_date(lit("2017-01-01")), itemPerformance("performance")).otherwise(0)).alias("performance")).show()
The stack trace for the exception:
java.lang.IllegalArgumentException: spark.sql.execution.id is already set
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2351)
at .... [Custom caller functions]
Sample CSV files:
item_country.csv
itemId,countryId
1,1
2,1
3,2
4,3
item_performance.csv
itemId,date,performance
1,2017-04-15,10
1,2017-04-16,10
1,2017-04-17,10
2,2017-04-15,15
3,2017-04-20,12
4,2017-04-18,18
I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.
Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val sqlContext = new SQLContext(sc)
val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")
The documentation says you can specify the delimiter but I am unclear about how to specify that option.
All of the option parameters are passed in the option() function as below:
val segments = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load("s3n://michaeldiscenza/data/test_segments")
With Spark 2.0+ use the built-in CSV connector to avoid third party dependancy and better performance:
val spark = SparkSession.builder.getOrCreate()
val segments = spark.read.option("sep", "\t").csv("/path/to/file")
You May also try to inferSchema and check for schema.
val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("sep","\t")
.option("header", "true")
.load(tmp_loc)
df.printSchema()