I'm reading a CSV in Spark with scala and it handles line one correctly in the example below, but in line two of the example the line has an ending quote character, but no leading quote character for the first column. This causes an issue by moving the data over and outputting bad|col in the final result, which is incorrect.
"good,col","good,col"
bad,col","good,col"
Is there an option to handle quote characters that don't have a leading (or ending) quote in the option specification when reading the file in spark with scala?
Hm... by using rdd and with some replacements, I can obtain what you want.
val df = rdd.map(r => (r.replaceAll("\",\"", "|").replaceAll("\"", "").split("\\|"))).map{ case Array(a, b) => (a, b) }.toDF("col1", "col2")
df.show()
+--------+--------+
| col1| col2|
+--------+--------+
|good,col|good,col|
| bad,col|good,col|
+--------+--------+
Related
I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala.
The values from the column have different lengths and also the suffix is different.
For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0
As #werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.
Whereas #AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
+-----------------------------------+
|col0 |
+-----------------------------------+
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
+-----------------------------------+
[1] Is there a performance penalty when composing spark UDFs
I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
Test code:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
Result:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
Test code:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1 and part2 work correctly.
part3 only has 2 characters and rest of the string is truncated.
part3:
P
I want to get the entire part3 string.
Any help is appreciated.
You're almost there – just need to escape | within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[UPDATE]
You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).
I'm trying to create an RDD using a CSV dataset.
The problem is that I have a column location that has a structure like (11112,222222) that I dont use.
So when I use the map function with split(",") its resulting in two columns.
Here is my code :
val header = collisionsRDD.first
case class Collision (date:String,time:String,borogh:String,zip:String,
onStreet:String,crossStreet:String,
offStreet:String,numPersInjured:Int,
numPersKilled:Int,numPedesInjured:Int,numPedesKilled:Int,
numCyclInjured:Int,numCycleKilled:Int,numMotoInjured:Int)
val collisionsPlat = collisionsRDD.filter(h => h != header).
map(x => x.split(",").map(x => x.replace("\"","")))
val collisionsCase = collisionsPlat.map(x => Collision(x(0),
x(1), x(2), x(3),
x(8), x(9), x(10),
x(11).toInt,x(12).toInt,
x(13).toInt,x(14).toInt,
x(15).toInt,x(16).toInt,
x(17).toInt))
collisionsCase.take(5)
How can I catch the , inside this field and not consider it as a CSV delimiter?
Use spark-csv to read the file because it has the option quote enabled
For Spark 1.6 :
sqlContext.read.format("com.databticks.spark.csv").load(file)
or for Spark 2 :
spark.read.csv(file)
From the Docs:
quote: by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored
$ cat abc.csv
a,b,c
1,"2,3,4",5
5,"7,8,9",10
scala> case class ABC (a: String, b: String, c: String)
scala> spark.read.option("header", "true").csv("abc.csv").as[ABC].show
+---+-----+---+
| a| b| c|
+---+-----+---+
| 1|2,3,4| 5|
| 5|7,8,9| 10|
+---+-----+---+
This question already has answers here:
How do I convert csv file to rdd
(12 answers)
Closed 5 years ago.
I have a csv file with one of the columns containing value enclosed in double quotes. This column also has commas in it. How do I read this type of columns in CSV in Spark using Scala into an RDD. Column values enclosed in double quotes should be read as Integer type as they are values like Total assets, Total Debts.
example records from csv is
Jennifer,7/1/2000,0,,0,,151,11,8,"25,950,816","5,527,524",51,45,45,45,48,50,2,,
John,7/1/2003,0,,"200,000",0,151,25,8,"28,255,719","6,289,723",48,46,46,46,48,50,2,"4,766,127,272",169
I would suggest you to read with SQLContext as a csv file as it has well tested mechanisms and flexible apis to satisfy your needs
You can do
val dataframe =sqlContext.read.csv("path to your csv file")
Output would be
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
| _c0| _c1|_c2| _c3| _c4| _c5|_c6|_c7|_c8| _c9| _c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17| _c18|_c19|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
| Jennifer|7/1/2000| 0|null| 0|null|151| 11| 8|25,950,816|5,527,524| 51| 45| 45| 45| 48| 50| 2| null|null|
|Afghanistan|7/1/2003| 0|null|200,000| 0|151| 25| 8|28,255,719|6,289,723| 48| 46| 46| 46| 48| 50| 2|4,766,127,272| 169|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
Now you can change the header names, change the required columns to integers and do a lot of things
You can even change it to rdd
Edited
If you prefer to read in RDD and stay in RDD, then
Read the file with sparkContext as a textFile
val rdd = sparkContext.textFile("/home/anahcolus/IdeaProjects/scalaTest/src/test/resources/test.csv")
Then split the lines with , by ignoring , in between "
rdd.map(line => line.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)", -1))
#ibh this is not Spark or Scala specific stuff. In Spark you will read file the usual way
val conf = new SparkConf().setAppName("app_name").setMaster("local")
val ctx = new SparkContext(conf)
val file = ctx.textFile("<your file>.csv")
rdd.foreach{line =>
// cleanup code as per regex below
val tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1)
// side effect
val myObject = new MyObject(tokens)
mylist.add(myObject)
}
See this regex also.
My question is quite similar to this one: Apache Spark SQL issue : java.lang.RuntimeException: [1.517] failure: identifier expected But I just can't figure out where my problem lies. I am using SQLite as database backend. Connecting and simple select statements work fine.
The offending line:
val df = tableData.selectExpr(tablesMap(t).toSeq:_*).map(r => myMapFunc(r))
tablesMap contains the table name as key and an array of strings as expressions. Printed, the array looks like this:
WrappedArray([My Col A], [ColB] || [Col C] AS ColB)
The table name is also included in square brackets since it contains spaces. The exception I get:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: identifier expected
I already made sure not to use any Spark Sql keywords. In my opinion there are 2 possible reasons why this code fails: 1) I somehow handle spaces in column names wrong. 2) I handle concatenation wrong.
I am using a resource file, CSV-like, which contains the expressions I want to be evaluated on my tables. Apart from this file, I want to allow the user to specify additional tables and their respective column expressions at runtime. The file looks like this:
TableName,`Col A`,`ColB`,CONCAT(`ColB`, ' ', `Col C`)
Appartently this does not work. Nevertheless I would like to reuse this file, modified of course. My idea was to map the columns with the expressions from an array of strings, like now, to a sequence of spark columns. (This is the only solution for me I could think of, since I want to avoid pulling in all hive dependecies just for this one feature.) I would introduce a small syntax for my expressions to mark raw column names with a $ and some keywords for functions like concat and as. But how could I do this? I tried something like this but it's far far away from even compiling.
def columnsMapFunc( expr: String) : Column = {
if(expr(0) == '$')
return expr.drop(1)
else
return concat(extractedColumnNames).as(newName)
}
Generally speaking using names containing whitespaces is asking for problems but replacing square brackets with backticks should solve the problem:
val df = sc.parallelize(Seq((1,"A"), (2, "B"))).toDF("f o o", "b a r")
df.registerTempTable("foo bar")
df.selectExpr("`f o o`").show
// +-----+
// |f o o|
// +-----+
// | 1|
// | 2|
// +-----+
sqlContext.sql("SELECT `b a r` FROM `foo bar`").show
// +-----+
// |b a r|
// +-----+
// | A|
// | B|
// +-----+
For concatenation you have to use concat function:
df.selectExpr("""concat(`f o o`, " ", `b a r`)""").show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
but it requires HiveContext in Spark 1.4.0.
In practice I would simply rename columns after loading data
df.toDF("foo", "bar")
// org.apache.spark.sql.DataFrame = [foo: int, bar: string]
and use functions instead of expression strings (concat function is available only in Spark >= 1.5.0, for 1.4 and earlier you'll need an UDF):
import org.apache.spark.sql.functions.concat
df.select($"f o o", concat($"f o o", lit(" "), $"b a r")).show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
There is also concat_ws function which takes separator as the first argument:
df.selectExpr("""concat_ws(" ", `f o o`, `b a r`)""")
df.select($"f o o", concat_ws(" ", $"f o o", $"b a r"))