Just to illustrate the problem I have taken a testset csv file. But in real case scenario, the problem has to handle more than a TeraByte data.
I have a CSV file, where the columns are enclosed by quotes("col1"). But when the data import was done. One column contains new line character(\n). This is leading me to lot of problems, when I want to save them as Hive tables.
My idea was to replace the \n character with "|" pipe in spark.
I achieved so far :
1. val test = sqlContext.load(
"com.databricks.spark.csv",
Map("path" -> "test_set.csv", "header" -> "true", "inferSchema" -> "true", "delimiter" -> "," , "quote" -> "\"", "escape" -> "\\" ,"parserLib" -> "univocity" ))#read a csv file
2. val dataframe = test.toDF() #convert to dataframe
3. dataframe.foreach(println) #print
4. dataframe.map(row => {
val row4 = row.getAs[String](4)
val make = row4.replaceAll("[\r\n]", "|")
(make)
}).collect().foreach(println) #replace not working for me
Sample set :
(17 , D73 ,525, 1 ,testing\n , 90 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,526, 1 ,null , 89 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,529, 1 ,once \n again, 10 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,531, 1 ,test3\n , 10 ,20.07.2011 ,null ,F10 , R)
Expected result set :
(17 , D73 ,525, 1 ,testing| , 90 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,526, 1 ,null , 89 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,529, 1 ,once | again, 10 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,531, 1 ,test3| , 10 ,20.07.2011 ,null ,F10 , R)
what worked for me:
val rep = "\n123\n Main Street\n".replaceAll("[\\r\\n]", "|") rep: String = |123| Main Street|
but why I am not able to do on Tuple basis?
val dataRDD = lines_wo_header.map(line => line.split(";")).map(row => (row(0).toLong, row(1).toString,
row(2).toLong, row(3).toLong,
row(4).toString, row(5).toLong,
row(6).toString, row(7).toString, row(8).toString,row(9).toString))
dataRDD.map(row => {
val wert = row._5.replaceAll("[\\r\\n]", "|")
(row._1,row._2,row._3,row._4,wert,row._6, row._7,row._8,row._9,row._10)
}).collect().foreach(println)
Spark --version 1.3.1
If you can use Spark SQL 1.5 or higher, you may consider using the functions available for columns. Assuming you don't know (or don't have) the names for the columns, you can do as in the following snippet:
val df = test.toDF()
import org.apache.spark.sql.functions._
val newDF = df.withColumn(df.columns(4), regexp_replace(col(df.columns(4)), "[\\r\\n]", "|"))
If you know the name of the column, you can replace df.columns(4) by its name in both occurences.
I hope that helps.
Cheers.
My idea was to replace the \n character with "|" pipe in spark.
I tried replaceAll method but it is not working. Here is an alternative to achieve the same:
val test = sq.load(
"com.databricks.spark.csv",
Map("path" -> "file:///home/veda/sample.csv", "header" -> "false", "inferSchema" -> "true", "delimiter" -> "," , "quote" -> "\"", "escape" -> "\\" ,"parserLib" -> "univocity" ))
val dataframe = test.toDF()
val mapped = dataframe.map({
row => {
val str = row.get(0).toString()
var fnal=new StringBuilder(str)
//replace newLine
var newLineIndex=fnal.indexOf("\\n")
while(newLineIndex != -1){
fnal.replace(newLineIndex,newLineIndex+2,"|")
newLineIndex = fnal.indexOf("\\n")
}
//replace carriage returns
var cgIndex=fnal.indexOf("\\r")
while(cgIndex != -1){
fnal.replace(cgIndex,cgIndex+2,"|")
cgIndex = fnal.indexOf("\\r")
}
(fnal.toString()) //tuple modified
}
})
mapped.collect().foreach(println)
Note: You might want to move the duplicate code to separate function.
The multi line support for CSV is added in spark version 2.2 JIRA and spark 2.2 is not yet released.
I had faced same issue and resolved it with the help us hadoop input format and reader.
Copy InputFormat and reader classes from git and implement like this:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
//implementation
JavaPairRDD<LongWritable, Text> rdd =
context.
newAPIHadoopFile(path, FileCleaningInputFormat.class, null, null, new Configuration());
JavaRDD<String> inputWithMultiline= rdd.map(s -> s._2().toString())
Another solution- use CSVInputFormat from Apache crunch to read CSV file then parse each CSV line using opencsv:
sparkContext.newAPIHadoopFile(path, CSVInputFormat.class, null, null, new Configuration()).map(s -> s._2().toString());
Apache crunch maven dependency:
<dependency>
<groupId>org.apache.crunch</groupId>
<artifactId>crunch-core</artifactId>
<version>0.15.0</version>
</dependency>
Related
Working on a problem where I need to add a new column that holds the length of all characters under all columns.
My sample data set :
ItemNumber,StoreNumber,SaleAmount,Quantity, Date
2231 , 1 , 400 , 2 , 19/01/2020
2145 , 3 , 500 , 10 , 14/01/2020
The expected output would be
19 20
The ideal output am expecting to build is with new column Length added to the data frame
ItemNumber,StoreNumber,SaleAmount,Quantity, Date , Length
2231 , 1 , 400 , 2 , 19/01/2020, 19
2145 , 3 , 500 , 10 , 14/01/2020, 20
My code
val spark = SparkSession.builder()
.appName("SimpleNewIntColumn").master("local").enableHiveSupport().getOrCreate()
val df = spark.read.option("header","true").csv("./data/sales.csv")
var schema = new StructType
df.schema.toList.map{
each => schema = schema.add(each)
}
val encoder = RowEncoder(schema)
val charLength = (row :Row) => {
var len :Int = 0
row.toSeq.map(x => {
x match {
case a : Int => len = len + a.toString.length
case a : String => len = len + a.length
}
})
len
}
df.map(row => charLength(row))(encoder) // ERROR - Required Encoder[Int] Found EncoderExpression[Row]
df.withColumn("Length", ?)
I have two issues
1) How to solve the error "ERROR - Required Encoder[Int] Found EncodeExpression[Row]"?
2) How do I add the output of charLength function as new column value? - df.withColumn("Length", ?)
Thank you.
Gurupraveen
If you are just trying to add a column, with total length of that Row
You can simply concat all the columns cast to String and use length function
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
val concatCol = concat(df.columns.map(col(_).cast(StringType)):_*)
df.withColumn("Length", length(concatCol))
Output:
+----------+-----------+----------+--------+----------+------+
|ItemNumber|StoreNumber|SaleAmount|Quantity| Date|length|
+----------+-----------+----------+--------+----------+------+
| 2231| 1| 400| 2|19/01/2020| 19|
| 2145| 3| 500| 10|14/01/2020| 20|
+----------+-----------+----------+--------+----------+------+
I'm running spark-shell to compare 2 csv files. Each file has the same number of columns and all have 600,000 rows. I'm expecting the 2 files have all the same rows. Here is my script.
val a =
spark
.read
.option("header", "true")
.option("delimiter", "|")
.csv("/tmp/1.csv")
.drop("unwanted_column").
.cache()
val b =
spark
.read
.option("header", "true")
.option("delimiter", "|")
.csv("/tmp/2.csv")
.drop("unwanted_column")
.cache()
val c = a.join(b, Seq("id", "year"), "left_outer").cache()
c.count() // this is returning 600,000
Now I'm trying to find out the difference by randomly picking a line with the same id and year in 2 datasets a and b.
val a1 = a.filter(i => i.get(0).equals("1") && i.get(1).equals("2016")).first()
val b1 = b.filter(i => i.get(0).equals("1") && i.get(1).equals("2016")).first()
Then I try to compare each column in a1 and b1.
(0 to (a1.length -1)).foreach { i =>
if (a1.getString(i) != null && !a1.getString(i).equals(b1.getString(i))) {
System.out.println(i + " = " + a1.getString(i) + " = " + b1.getString(i))
}
}
It didn't print anything. In other words, there is no difference.
I can't tell why c.count() is returning 600,000 like that.
Sorry guys, I guess it was my fault. Actually I was after a.subtract(b). My purpose is to find out the difference between a and b. I was confused about left_outer join.
I want to find the distinct values from this query in scala
select
key,
count(distinct suppKey)
from
file
group by
key ;
I write this code in scala, but didn't working.
val count= file.map(line=> (line.split('|')(0),line.split('|')(1)).distinct().count())
I make split, because key is in the first row in file, and suppkey in the second.
File:
1|52|3956|337.0
1|77|4069|357.8
1|7|14|35.2
2|3|8895|378.4
2|3|4969|915.2
2|3|8539|438.3
2|78|3025|306.3
Expected output:
1|3
2|2
Instead of a file, for simpler testing, I use a String:
scala> val s="""1|52|3956|337.0
| 1|77|4069|357.8
| 1|7|14|35.2
| 2|3|8895|378.4
| 2|3|4969|915.2
| 2|3|8539|438.3
| 2|78|3025|306.3"""
scala> s.split("\n").map (line => {val sp = line.split ('|'); (sp(0), sp(1))}).distinct.groupBy (_._1).map (e => (e._1, e._2.size))
res198: scala.collection.immutable.Map[String,Int] = Map(2 -> 2, 1 -> 3)
Imho, we need a groupBy to specify what to group over, and to count groupwise.
Done in spark REPL. test.txt is the file with the text you've provided
val d = sc.textFile("test.txt")
d.map(x => (x.split("\\|")(0), x.split("\\|")(1))).distinct.countByKey
scala.collection.Map[String,Long] = Map(2 -> 2, 1 -> 3)
I'm reading the CSV File and adding data to Map in Scala.
val br = new BufferedReader(new InputStreamReader(new FileInputStream(new File(fileName)), "UTF-8"))
val inputFormat = CSVFormat.newFormat(delimiter.charAt(0)).withHeader().withQuote('"')
import scala.collection.JavaConverters._
import org.apache.commons.csv.{CSVFormat, CSVParser}
val csvRecords = new CSVParser(br, inputFormat).getRecords.asScala
val buffer = for (csvRecord <- csvRecords; if csvRecords != null && csvRecords.nonEmpty)
yield csvRecord.toMap.asScala
buffer.toList
But as the Map is not ordered I'm not able to read the columns in order. Is there any way to read the csvRecords in order?
The CSV file contains comma separated values along with the header. It should generate the output in List[mutable.LinkedHashMap[String, String]] format something like [["fname", "A", "lname", "B"], ["fname", "C", "lname", "D"]].
The above code is working but it is not preserving the order. For Ex: if CSV file contains columns in order fname, lname, the output map is having lname first and fname last.
If I understand you question correctly, here's one way to create a list of LinkedHashMaps with elements in order:
// Assuming your CSV File has the following content:
fname,lname,grade
John,Doe,A
Ann,Cole,B
David,Jones,C
Mike,Duke,D
Jenn,Rivers,E
import collection.mutable.LinkedHashMap
// Get indexed header from CSV
val indexedHeader = io.Source.fromFile("/path/to/csvfile").
getLines.take(1).next.
split(",").
zipWithIndex
indexedHeader: Array[(String, Int)] = Array((fname,0), (lname,1), (grade,2))
// Aggregate LinkedHashMap using foldLeft
val ListOfLHM = for ( csvRecord <- csvRecords ) yield
indexedHeader.foldLeft(LinkedHashMap[String, String]())(
(acc, x) => acc += (x._1 -> csvRecord.get(x._2))
)
ListOfLHM: scala.collection.mutable.Buffer[scala.collection.mutable.LinkedHashMap[String,String]] = ArrayBuffer(
Map(fname -> John, lname -> Doe, grade -> A),
Map(fname -> Ann, lname -> Cole, grade -> B),
Map(fname -> David, lname -> Jones, grade -> C),
Map(fname -> Mike, lname -> Duke, grade -> D),
Map(fname -> Jenn, lname -> Rivers, grade -> E)
)
I have a dataframe df, which contains below data:
**customers** **product** **Val_id**
1 A 1
2 B X
3 C
4 D Z
i have been provided 2 rules, which are as below:
**rule_id** **rule_name** **product value** **priority**
123 ABC A,B 1
456 DEF A,B,D 2
Requirement is to apply these rules on dataframe df in priority order, customers who have passed rule 1, should not be considered for rule 2 and in final dataframe add two more columns rule_id and rule_name, i have written below code to achieve it:
val rule_name = when(col("product").isin("A","B"), "ABC").otherwise(when(col("product").isin("A","B","D"), "DEF").otherwise(""))
val rule_id = when(col("product").isin("A","B"), "123").otherwise(when(col("product").isin("A","B","D"), "456").otherwise(""))
val df1 = df_customers.withColumn("rule_name" , rule_name).withColumn("rule_id" , rule_id)
df1.show()
Final output looks like below:
**customers** **product** **Val_id** **rule_name** **rule_id**
1 A 1 ABC 123
2 B X ABC 123
3 C
4 D Z DEF 456
Is there any better way to achieve it, adding both columns by just going though entire dataset once instead of going through entire dataset twice?
Question : Is there any better way to achieve it, adding both columns
by just going though entire dataset once instead of going through
entire dataset twice?
Answer : you can have a Map return type in scala...
Limitation : This udf if you are using with With Column for example
column name is ruleIDandRuleName then you can use a single fuction
with Map data type or any acceptable data type of spark sql column.
Other wise you cant use the below mentioned approach
shown in the below example snippet
def ruleNameAndruleId = udf((product : String) => {
if(Seq("A", "B").contains(product)) Map("ruleName"->"ABC","ruleId"->"123")
else if(Seq("A", "B", "D").contains(product)) (Map("ruleName"->"DEF","ruleId"->"456")
else (Map("ruleName"->"","ruleId"->"") })
caller will be
df.withColumn("ruleIDandRuleName",ruleNameAndruleId(product here) ) // ruleNameAndruleId will return a map containing rulename and rule id
An alternative to your solution would be to use udf functions. Its almost similar to when function as both required serialization and deserialization. Its upto you to test which is faster and efficient.
def rule_name = udf((product : String) => {
if(Seq("A", "B").contains(product)) "ABC"
else if(Seq("A", "B", "D").contains(product)) "DEF"
else ""
})
def rule_id = udf((product : String) => {
if(Seq("A", "B").contains(product)) "123"
else if(Seq("A", "B", "D").contains(product)) "456"
else ""
})
val df1 = df_customers.withColumn("rule_name" , rule_name(col("product"))).withColumn("rule_id" , rule_id(col("product")))
df1.show()