Error saving RDD into HDFS - scala

I'm trying to save a RDD into HDFS using Scala and I get this error:
WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, quickstart.cloudera, executor 3): java.lang.NumberFormatException: empty String
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1020)
at java.lang.Float.parseFloat(Float.java:452)
at scala.collection.immutable.StringLike$class.toFloat(StringLike.scala:231)
at scala.collection.immutable.StringOps.toFloat(StringOps.scala:31)
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:33)
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:33)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1196)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1195)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1195)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1279)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1203)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1183)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
First, I read a file located into HDFS and it reads correctly. After, I try to make some transformations like changing field delimiters (pipes) and then write it back into HDFS. Here is my code if someone can help me.
val productsRDD= sc.textFile("/user/cloudera/products/products")
val products2RDD=productsRDD.map(a=>a.split(","))
case class clas1(product_id: Int,product_category_id: Int,product_name: String,product_description: String,product_price: Float,product_image: String)
val products = products2RDD.map(b => clas1(Integer.parseInt(0),Integer.parseInt(1),(2).toString,(3).toString,(4).toFloat,(5).toString))
val r = products.toDF()
r.registerTempTable("productsDF")
val prodDF = sqlContext.sql("select * from productsDF where product_price > 100")
/* everything goes fine until this line*/
prodDF.map(c => c(0)+"|"+c(1)+"|"+c(2)+"|"+c(3)+"|"+c(4)+"|"+c(5)).saveAsTextFile("/user/cloudera/problem1/pipes1")
The fields of the Data Frame:
| Field | Type | Null | Key | Default | Extra |
+---------------------+--------------+------+-----+---------+----------------+
| product_id | int(11) | NO | PRI | NULL | auto_increment |
| product_category_id | int(11) | NO | | NULL | |
| product_name | varchar(45) | NO | | NULL | |
| product_description | varchar(255) | NO | | NULL | |
| product_price | float | NO | | NULL | |
| product_image | varchar(255) | NO | | NULL | |
I'm new with Scala and I appreciate your help...
thank you!

Looking from your error - java.lang.NumberFormatException: empty String
it looks like that your error exists while you are trying to parse integer from String for which string is empty, so you will this particular erorr.
What you can do is you can use coalesce before doing transformation and after splitting. Create a dataframe and there is a coalesce feature in spark-sql that will replace your null values with "NULL"

Depending on your version of CDH, Spark2 has a builtin CSV reader.
case class Product(product_id: Int,product_category_id: Int,product_name: String,product_description: String,product_price: Float,product_image: String)
val productsDs = spark.csv("/user/cloudera/products/products").as[Product]
val expensiveProducts = productDs.where($"product_price" > 100.0)
If not using Spark2, you should definitely upgrade some local clients to point to your same YARN cluster, or use spark-csv to not have to deal with a poor CSV parser of map(... split(","))
Note: I don't know if the case class will work anyway if your columns are empty as the error says
And if all you're trying to do is change a delimiter, you can write it out too using CSV formatter
expensiveProducts.write
.option("sep", "|")
.csv("/user/cloudera/problem1/pipes1")

Related

Spark: choose default value for MergeSchema fields

I have a parquet that has an old schema like this :
| name | gender | age |
| Tom | Male | 30 |
And as our schema got updated to :
| name | gender | age | office |
we used mergeSchema when reading from the old parquet :
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
But when reading from these old parquet files, I got the following output :
| name | gender | age | office |
| Tom | Male | 30 | null |
which is normal. But I would like to take a default value for office (e.g. "California"), if and only if the field is not present in old schema. Is it possible ?
You don't have any simple method to put a default value when column doesn't exist in some parquet files but exists in other parquet files
In Parquet file format, each parquet file contains the schema definition. By default, when reading parquet, Spark get the schema from parquet file. The only effect of mergeSchema option is that instead of retrieving schema from one random parquet file, with mergeSchema Spark will read all schema of all parquet files and merge them.
So you can't put a default value without modifying the parquet files.
The other possible method is to provide your own schema when reading parquets by setting the option .schema() like that:
spark.read.schema(StructType(Array(FieldType("name", StringType), ...)).parquet(...)
But in this case, there is no option to set a default value.
So the only remaining solution is to add column default value manually
If we have two parquets, first one containing the data with the old schema:
+----+------+---+
|name|gender|age|
+----+------+---+
|Tom |Male |30 |
+----+------+---+
and second one containing the data with the new schema:
+-----+------+---+------+
|name |gender|age|office|
+-----+------+---+------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
+-----+------+---+------+
If you don't care to replace all the null value in "office" column, you can use .na.fill as follow:
spark.read.option("mergeSchema", "true").parquet(path).na.fill("California", Array("office"))
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |California|
|Tom |Male |30 |California|
+-----+------+---+----------+
If you want that only the old data get the default value, you have to read each parquet file to a dataframe, add the column with default value if necessary, and union all the resulting dataframes:
import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
import org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable
import org.apache.spark.sql.util.CaseInsensitiveStringMap
ParquetTable("my_table",
sparkSession = spark,
options = CaseInsensitiveStringMap.empty(),
paths = Seq(path),
userSpecifiedSchema = None,
fallbackFileFormat = classOf[ParquetFileFormat]
).fileIndex.allFiles().map(file => {
val dataframe = spark.read.parquet(file.getPath.toString)
if (dataframe.columns.contains("office")) {
dataframe
} else {
dataframe.withColumn("office", lit("California"))
}
}).reduce(_ unionByName _)
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
|Tom |Male |30 |California|
+-----+------+---+----------+
Note that all the part with ParquetTable([...].allFiles() is to retrieve the list of parquet files. It can be simplified if you are on hadoop or on local file system.

Spark 1.6: Store dataframe into multiple csv file in hdfs (partition by id)

I'm trying to save a dataFrame into csv partition by id, for that I'm using spark 1.6 and scala.
The function partitionBy("id") dont give me the right result.
My code is here :
validDf.write
.partitionBy("id")
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ";")
.mode("overwrite")
.save("path_hdfs_csv")
My Dataframe looks like :
-----------------------------------------
| ID | NAME | STATUS |
-----------------------------------------
| 1 | N1 | S1 |
| 2 | N2 | S2 |
| 3 | N3 | S1 |
| 4 | N4 | S3 |
| 5 | N5 | S2 |
-----------------------------------------
This code create 3 csv default partitions (part_0, part_1, part_2) not based on column ID.
What I expect is : getting sub dir or partition for each id.
Any help ?
Spark-csv in spark1.6 (or all spark versions lower than 2) does not support partitioning.
Your code would work for spark > 2.0.0.
For your spark version, you will need to prepare the csv first and save it as text (partitioning works forspark-text):
import org.apache.spark.sql.functions.{col,concat_ws}
val key = col("ID")
val concat_col = concat_ws(",",df.columns.map(c=>col(c)):_*) // concat cols to one col
val final_df = df.select(col("ID"),concat_col) // dataframe with 2 columns: id and string
final_df.write.partitionBy("ID").text("path_hdfs_csv") //save to hdfs

how to update all the values of a column in a dataFrame

I have a data frame which has a non formated Date column :
+--------+-----------+--------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+--------+
| AAA|bbbbbbbbbbb|13190326|
| AAA|bbbbbbbbbbb|10190309|
| AAA|bbbbbbbbbbb|36190908|
| AAA|bbbbbbbbbbb|07190214|
| AAA|bbbbbbbbbbb|13190328|
| AAA|bbbbbbbbbbb|23190608|
| AAA|bbbbbbbbbbb|13190330|
| AAA|bbbbbbbbbbb|26190630|
+--------+-----------+--------+
the date column is formated as : wwyyMMdd (week, year, month, day) which I want to format to YYYYMMdd, for that a have a method : format that do that.
so my question is how could I map all the values of column Date to the needed format? here is the output that I want :
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+
Spark 2.4.3 using unix_timestamp you can convert data to the expected output.
scala> var df2 =spark.createDataFrame(Seq(("AAA","bbbbbbbbbbb","13190326"),("AAA","bbbbbbbbbbb","10190309"),("AAA","bbbbbbbbbbb","36190908"),("AAA","bbbbbbbbbbb","07190214"),("AAA","bbbbbbbbbbb","13190328"),("AAA","bbbbbbbbbbb","23190608"),("AAA","bbbbbbbbbbb","13190330"),("AAA","bbbbbbbbbbb","26190630"))).toDF("CDOPEINT","bbbbbbbbbb","Date")
scala> df2.withColumn("Date",from_unixtime(unix_timestamp(substring(col("Date"),3,7),"yyMMdd"),"yyyy/MM/dd")).show
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+
let me know if you have any query related to this.
If the date involves values from 2000 and the Date column in your original dataframe is of Integer type,you could try something like this
def getDate =(date:Int) =>{
val dateString = date.toString.drop(2).sliding(2,2)
dateString.zipWithIndex.map{
case(value,index) => if(index ==0) 20+value else value
}.mkString("/")
}
Then create a UDF which calls this function
val updateDateUdf = udf(getDate)
If originalDF is the original Dataframe that you have, you could then change the dataframe like this
val updatedDF = originalDF.withColumn("Date",updateDateUdf(col("Date")))

Spark Scala Dataframe - replace/join column values with values from another dataframe (but is transposed)

I have a table with ~300 columns filled with characters (stored as String):
valuesDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| U | C | ...
| U | E | ...
| I | B | ...
| C | U | ...
| ... | ... | ...
I have a Data Summary, which maps the characters onto their actual meaning. It is in this form:
summaryDF:
| Field | Value | ValueDesc |
|------------------|-------|---------------|
| FavouriteBeer | U | Unknown |
| FavouriteBeer | C | Carlsberg |
| FavouriteBeer | I | InnisAndGunn |
| FavouriteBeer | D | DoomBar |
| FavouriteCheese | C | Cheddar |
| FavouriteCheese | E | Emmental |
| FavouriteCheese | B | Brie |
| FavouriteCheese | U | Unknown |
| ... | ... | ... |
I want to programmatically replace the character values of each column in valuesDF with the Value Descriptions from summaryDF. This is the result I'm looking for:
finalDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| Unknown | Cheddar | ...
| Unknown | Emmental | ...
| InnisAndGunn | Brie | ...
| Carlsberg | Unknown | ...
| ... | ... | ...
As there are ~300 columns, I'm not keen to type out withColumn methods for each one.
Unfortunately I'm a bit of a novice when it comes to programming for Spark, although I've picked up enough to get by over the last 2 months.
What I'm pretty sure I need to do is something along the lines of:
valuesDF.columns.foreach { col => ...... } to iterate over each column
Filter summaryDF on Field using col String value
Left join summaryDF onto valuesDF based on current column
withColumn to replace the original character code column from valuesDF with new description column
Assign new DF as a var
Continue loop
However, trying this gave me Cartesian product error (I made sure to define the join as "left").
I tried and failed to pivot summaryDF (as there are no aggregations to do??) then join both dataframes together.
This is the sort of thing I've tried, and always getting a NullPointerException. I know this is really not the right way to do this, and can see why I'm getting Null Pointer... but I'm really stuck and reverting back to old, silly & bad Python habits in desperation.
var valuesDF = sourceDF
// I converted summaryDF to a broadcasted RDD
// because its small and a "constant" lookup table
summaryBroadcast
.value
.foreach{ x =>
// searchValue = Value (e.g. `U`),
// replaceValue = ValueDescription (e.g. `Unknown`),
val field = x(0).toString
val searchValue = x(1).toString
val replaceValue = x(2).toString
// error catching as summary data does not exactly mapping onto field names
// the joys of business people working in Excel...
try {
// I'm using regexp_replace because I'm lazy
valuesDF = valuesDF
.withColumn( attribute, regexp_replace(col(attribute), searchValue, replaceValue ))
}
catch {case _: Exception =>
null
}
}
Any ideas? Advice? Thanks.
First, we'll need a function that executes a join of valuesDf with summaryDf by Value and the respective pair of Favourite* and Field:
private def joinByColumn(colName: String, sourceDf: DataFrame): DataFrame = {
sourceDf.as("src") // alias it to help selecting appropriate columns in the result
// the join
.join(summaryDf, $"Value" === col(colName) && $"Field" === colName, "left")
// we do not need the original `Favourite*` column, so drop it
.drop(colName)
// select all previous columns, plus the one that contains the match
.select("src.*", "ValueDesc")
// rename the resulting column to have the name of the source one
.withColumnRenamed("ValueDesc", colName)
}
Now, to produce the target result we can iterate on the names of the columns to match:
val result = Seq("FavouriteBeer",
"FavouriteCheese").foldLeft(valuesDF) {
case(df, colName) => joinByColumn(colName, df)
}
result.show()
+-------------+---------------+
|FavouriteBeer|FavouriteCheese|
+-------------+---------------+
| Unknown| Cheddar|
| Unknown| Emmental|
| InnisAndGunn| Brie|
| Carlsberg| Unknown|
+-------------+---------------+
In case a value from valuesDf does not match with anything in summaryDf, the resulting cell in this solution will contain null. If you want just to replace it with Unknown value, instead of .select and .withColumnRenamed lines above use:
.withColumn(colName, when($"ValueDesc".isNotNull, $"ValueDesc").otherwise(lit("Unknown")))
.select("src.*", colName)

Date type null value in dataframe not storing in cassandra

I am working in Apache Spark 1.6.0. I have a dataframe of 280 columns in which some of the columns are of type timestamp. A few values of the timestamp field are null. When I'm trying to write the same dataframe to cassandra, I'm getting an IllegalArgumentException.
The column looks like -
+------------------------+
| LoginDate|
+-------------------------+
| null|
| 2014-06-25T12:27:...|
| 2014-06-25T12:27:...|
| null|
| 2014-06-25T12:27:...|
| 2014-06-25T12:27:...|
| null|
| null|
| 2014-06-25T12:27:...|
| 2014-06-25T12:27:...|
+-------------------------+
When I'm trying to save the whole dataframe to cassandra, it comes up the error -
05:39:22 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 106.0 (TID 5136,): java.lang.IllegalArgumentException: Invalid date:
at com.datastax.spark.connector.types.TimestampParser$.parse(TimestampParser.scala:50)
at com.datastax.spark.connector.types.TypeConverter$DateConverter$$anonfun$convertPF$13.applyOrElse(TypeConverter.scala:323)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$DateConverter$.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:313)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$DateConverter$.convert(TypeConverter.scala:313)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter$$anonfun$convertPF$31.applyOrElse(TypeConverter.scala:812)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:795)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.convert(TypeConverter.scala:795)
at com.datastax.spark.connector.writer.SqlRowWriter$$anonfun$readColumnValues$1.apply$mcVI$sp(SqlRowWriter.scala:26)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:24)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:12)
at com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:100)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:157)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
The type of the respective field in cassandra is of timestamp type.
Anyone can help to solve the issue ?
Add the following parameter to your spark Cassandra connection settings
spark.cassandra.output.ignoreNulls=true
It will ignore the NULL values in the input and also has benefit of avoiding creation of a corresponding tombstone column in Cassandra.