split the file into multiple files based on a string in spark scala

split the file into multiple files based on a string in spark scala - scala

I have a text file with the below data having no particular format
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
I want the output as two files as below :
Based on string abc, I want to split the file.
file 1:
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
file 2:
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.

There is this little dirty trick that I used for a project :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")
You can set the delimiter of your spark context for reading files. So you could do something like this :
val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
.map(x => (delimit ++ x))
.toDF("delimit_column")
.filter(col("delimit_column") !== delimit)
Then you can map each element of your DataFrame (or RDD) to be written to a file.
It's a dirty method but it might help you !
Have a good day
PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter

You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)
val rdd = sc.wholeTextFiles("path to the file")
.flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()
And then loop the array to save the texts to output
for(str <- rdd){
//saving codes here
}

Related

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!

Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")

How to read with Spark constantly updating HDFS directory and split output to multiple HDFS files based on String (row)?

Elaborated scenario -> HDFS directory which is "fed" with new log data of multiple types of bank accounts activity.
Each row represents a random activity type, and each row (String) contains the text "ActivityType=<TheTypeHere>".
In Spark-Scala, what's the best approach to read the input file/s in the HDFS directory and output multiple HDFS files, where each ActivityType is written to its own new file?

Adapted first answer to the statement:
The location of the "key" string is random within the parent String,
the only thing that is guaranteed is that it contains that sub-string,
in this case "ActivityType" followed by some val.
The question is really about this. Here goes:
// SO Question
val rdd = sc.textFile("/FileStore/tables/activitySO.txt")
val rdd2 = rdd.map(x => (x.slice (x.indexOfSlice("ActivityType=<")+14, x.indexOfSlice(">", (x.indexOfSlice("ActivityType=<")+14))), x))
val df = rdd2.toDF("K", "V")
df.write.partitionBy("K").text("SO_QUESTION2")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
3,4,4,ActivityType=<ACT_002>,A,1,2
ABC,ActivityType=<ACT_0033>
DEF,ActivityType=<ACT_0033>
Output is 3 files whereby the key is e.g. not ActivityType=, but rather ACT_001, etc. The key data is not stripped, it is still there in the String. You can modify that if you want as well as output location and format.

You can use MultipleOutputFormat for this.Convert rdd into key value pairs such that ActivityType is the key.Spark will create different files for different keys.You can decide based on the key where to place the files and what their names will be.

You can do something like this using RDDs whereby I assume you have variable length files and then converting to DFs:
val rdd = sc.textFile("/FileStore/tables/activity.txt")
val rdd2 = rdd.map(_.split(","))
.keyBy(_(0))
val rdd3 = rdd2.map(x => (x._1, x._2.mkString(",")))
val df = rdd3.toDF("K", "V")
//df.show(false)
df.write.partitionBy("K").text("SO_QUESTION")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
ActivityType=<ACT_002>,A,1,2
ActivityType=<ACT_003>,ABC
I get then as output 3 files, in this case 1 for each record. A bit hard to show as did it in Databricks.
You can adjust your output format and location, etc. partitionBy is the key here.

Read and processing data in spark output is not deliminated correctly

So my stored output looks like this, it is one column with
\N|\N|\N|8931|\N|1
Where | is suppose to be the deliminated column. So it should have 6 columns, but it only has one.
My code to generate this is
val distData = sc.textFile(inputFileAdl).repartition(partitions.toInt)
val x = new UdfWrapper(inputTempProp, "local")
val wrapper = sc.broadcast(x)
distData.map({s =>
wrapper.value.exec(s.toString)
}).toDF().write.parquet(outFolder)
Nothing inside of the map can be changed. wrapper.value.exec(s.toString) returns a deliminated string(This cannot be changed). I want to write this deliminated string to a parquet file, but have it be correctly deliminated by a given deliminator. How can I accomplish this?
So current output - One column which is a deliminated string
Exepcted out - Six columns from the single deliminated string

how to parse specific portion of text file using two delimiters or strings in SCALA

I have sample.txt file
The file contains logs with date and time.
For example,
10.10.2012:
erewwetrt=1
wrtertret=2
ertertert=3
;
10.10.2012:
asdafdfd=1
adadfadf=2
adfdafdf=3
;
10.12.2013:
adfsfsdfgg=1
sdfsdfdfg=2
sdfsdgsdg=3
;
12.12.2012:
asdasdas=1
adasfasdf=2
dfsdfsdf=3
;
I just want to retrive only year 2012 data, that is between12.12.2012: to ;
How can I do this in scla or spark scala.
finally I need to remove = with comma and save it in csv format.
How can I do it.

To extract that specific part you can use this:
def main(args:Array[String]):Unit={
val text = "10.10.2012:\nerewwetrt=1\nwrtertret=2\nertertert=3\n;\n10.10.2012:\nasdafdfd=1\nadadfadf=2\nadfdafdf=3\n;\n10.12.2013:\nadfsfsdfgg=1\nsdfsdfdfg=2\nsdfsdgsdg=3\n;\n12.12.2012:\nasdasdas=1\nadasfasdf=2\ndfsdfsdf=3\n;"
val lines = text.split("\n")
val extracted = lines.dropWhile(_ != "12.12.2012:").drop(1).takeWhile(_ != ";")
extracted.foreach(println(_))
}

remove pipe delimiter from data using spark

i am new to spark, i am using scala to separate pipe delimited file and save in hdfs without pipe delimited, for that i have written this code.
object WordCount {
def main(args: Array[String])
{
val textfile = sc.textFile("/user/cloudera/xxxx/xxxx")
val word = textfile.map( l => l.split("|"))
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")
}
}
but when i am executing it i am not getting any error's but in my hdfs i am getting below data.
[Ljava.lang.String;#10ed847f
[Ljava.lang.String;#4316ebe
[Ljava.lang.String;#495d7e18
[Ljava.lang.String;#19017f49
[Ljava.lang.String;#314b9e72
[Ljava.lang.String;#5b8f67a6
[Ljava.lang.String;#23ddf240
[Ljava.lang.String;#404b5a25
[Ljava.lang.String;#130b541d
[Ljava.lang.String;#4cbf45af
[Ljava.lang.String;#21780b86
[Ljava.lang.String;#503c9b94
[Ljava.lang.String;#3b0a3ab3
i don't know what i am doing wrong.
Please help

That's because you are splitting each string into a Array of Strings. To save as text file, you'll need to use mkString(",") if you wish to concatenate with a comma. But I don't see any purpose in that.
If you want to replace pipe separator by a comma, you can use _.replaceAll("|",",") instead and save it :
val word = textfile.map(_.replaceAll("\\|",",").replaceFirst(",","").trim)
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")
PS : You can replace the comma with anything you want e.g a whitespace, a word, etc.
So Why does the pipe need to be escaped ?
A string split expects a regular expression argument. An unescaped | is parsed as a regex meaning "empty string or empty string," which isn't what you mean.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

split the file into multiple files based on a string in spark scala - scala

Related

To split data into good and bad rows and write to output file using Spark program

How to read with Spark constantly updating HDFS directory and split output to multiple HDFS files based on String (row)?

Read and processing data in spark output is not deliminated correctly

how to parse specific portion of text file using two delimiters or strings in SCALA

remove pipe delimiter from data using spark

Categories

Resources