how to parse specific portion of text file using two delimiters or strings in SCALA - scala

I have sample.txt file
The file contains logs with date and time.
For example,
10.10.2012:
erewwetrt=1
wrtertret=2
ertertert=3
;
10.10.2012:
asdafdfd=1
adadfadf=2
adfdafdf=3
;
10.12.2013:
adfsfsdfgg=1
sdfsdfdfg=2
sdfsdgsdg=3
;
12.12.2012:
asdasdas=1
adasfasdf=2
dfsdfsdf=3
;
I just want to retrive only year 2012 data, that is between12.12.2012: to ;
How can I do this in scla or spark scala.
finally I need to remove = with comma and save it in csv format.
How can I do it.

To extract that specific part you can use this:
def main(args:Array[String]):Unit={
val text = "10.10.2012:\nerewwetrt=1\nwrtertret=2\nertertert=3\n;\n10.10.2012:\nasdafdfd=1\nadadfadf=2\nadfdafdf=3\n;\n10.12.2013:\nadfsfsdfgg=1\nsdfsdfdfg=2\nsdfsdgsdg=3\n;\n12.12.2012:\nasdasdas=1\nadasfasdf=2\ndfsdfsdf=3\n;"
val lines = text.split("\n")
val extracted = lines.dropWhile(_ != "12.12.2012:").drop(1).takeWhile(_ != ";")
extracted.foreach(println(_))
}

Related

PySpark - Return first row from each file in a folder

I have multiple .csv files in a folder on Azure. Using PySpark I am trying to create a dataframe that has two columns, filename and firstrow, which are captured for each file within the folder.
Ideally I would like to avoid having to read the files in full as some of them can be quite large.
I am new to PySpark so I do not yet understand the basics so I would appreciate any help.
I have write a code for your scenario and it is working fine.
Create a empty list and append it with all the filenames stored in the source
# read filenames
filenames = []
l = dbutils.fs.ls("/FileStore/tables/")
for i in l:
print(i.name)
filenames.append(i.name)
# converting filenames to tuple
d = [(x,) for x in filenames]
print(d)
Read the data from multiple files and store in a list
# create data by reading from multiple files
data = []
i = 0
for n in filename:
temp = spark.read.option("header", "true").csv("/FileStore/tables/" + n).limit(1)
temp = temp.collect()[0]
temp = str(temp)
s = d[i] + (temp,)
data.append(s)
i+=1
print(data)
Now create DataFrame from the data with column names.
column = ["filename", "filedata"]
df = spark.createDataFrame(data, column)
df.head(2)

Read and processing data in spark output is not deliminated correctly

So my stored output looks like this, it is one column with
\N|\N|\N|8931|\N|1
Where | is suppose to be the deliminated column. So it should have 6 columns, but it only has one.
My code to generate this is
val distData = sc.textFile(inputFileAdl).repartition(partitions.toInt)
val x = new UdfWrapper(inputTempProp, "local")
val wrapper = sc.broadcast(x)
distData.map({s =>
wrapper.value.exec(s.toString)
}).toDF().write.parquet(outFolder)
Nothing inside of the map can be changed. wrapper.value.exec(s.toString) returns a deliminated string(This cannot be changed). I want to write this deliminated string to a parquet file, but have it be correctly deliminated by a given deliminator. How can I accomplish this?
So current output - One column which is a deliminated string
Exepcted out - Six columns from the single deliminated string

split the file into multiple files based on a string in spark scala

I have a text file with the below data having no particular format
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
I want the output as two files as below :
Based on string abc, I want to split the file.
file 1:
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
file 2:
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.
There is this little dirty trick that I used for a project :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")
You can set the delimiter of your spark context for reading files. So you could do something like this :
val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
.map(x => (delimit ++ x))
.toDF("delimit_column")
.filter(col("delimit_column") !== delimit)
Then you can map each element of your DataFrame (or RDD) to be written to a file.
It's a dirty method but it might help you !
Have a good day
PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter
You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)
val rdd = sc.wholeTextFiles("path to the file")
.flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()
And then loop the array to save the texts to output
for(str <- rdd){
//saving codes here
}

remove pipe delimiter from data using spark

i am new to spark, i am using scala to separate pipe delimited file and save in hdfs without pipe delimited, for that i have written this code.
object WordCount {
def main(args: Array[String])
{
val textfile = sc.textFile("/user/cloudera/xxxx/xxxx")
val word = textfile.map( l => l.split("|"))
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")
}
}
but when i am executing it i am not getting any error's but in my hdfs i am getting below data.
[Ljava.lang.String;#10ed847f
[Ljava.lang.String;#4316ebe
[Ljava.lang.String;#495d7e18
[Ljava.lang.String;#19017f49
[Ljava.lang.String;#314b9e72
[Ljava.lang.String;#5b8f67a6
[Ljava.lang.String;#23ddf240
[Ljava.lang.String;#404b5a25
[Ljava.lang.String;#130b541d
[Ljava.lang.String;#4cbf45af
[Ljava.lang.String;#21780b86
[Ljava.lang.String;#503c9b94
[Ljava.lang.String;#3b0a3ab3
i don't know what i am doing wrong.
Please help
That's because you are splitting each string into a Array of Strings. To save as text file, you'll need to use mkString(",") if you wish to concatenate with a comma. But I don't see any purpose in that.
If you want to replace pipe separator by a comma, you can use _.replaceAll("|",",") instead and save it :
val word = textfile.map(_.replaceAll("\\|",",").replaceFirst(",","").trim)
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")
PS : You can replace the comma with anything you want e.g a whitespace, a word, etc.
So Why does the pipe need to be escaped ?
A string split expects a regular expression argument. An unescaped | is parsed as a regex meaning "empty string or empty string," which isn't what you mean.

matlab: reading numbers that also contain characters

I have to read a textfile which contains a list of companycodes. The format of the textfile is:
[1233A12; 1233B88; 2342Q85; 2266738]
Even if I have read the file? Is it possible to compare these numbers with regular numbers? Because I have the codes from two different data-bases and one of them has regular firmnumbers (no characters) and the other has characters inside the firmnumbers.
Btw the file is big (50+mb).
Edit: I have added an additional number in the example because not all the numbers have a character inside
If you want to compare part of a string with a number, you could do it as follows:
combiString = '1234AB56'
myNumber= 1234
str2num(combiString(1:4))==myNumber
str2num(combiString(7:8))==myNumber
You can achieve this result by using regular expressions. For example, if str = '1233A12' you can write
nums = regexp(str, '(\d+)[A-Z]*(\d+)', 'tokens');
str1 = nums{1}(1);
num1 = str2num(str1{1});
str2 = nums{1}(2);
num2 = str2num(str2{1});