how to apply pattern matching on the file name using spark(scala)

how to apply pattern matching on the file name using spark(scala) - scala

I am monitoring a directory in hdfs and if a file is put into it i want to retrieve the name of the file entering the directory and apply pattern matching so as to sort them based on names so far i have been able to retrieve the file path.which gives me the file name but i don't know how to proceed further with pattern matching
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.{NewHadoopRDD}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
object path {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val fc = classOf[TextInputFormat]
val kc = classOf[LongWritable]
val vc = classOf[Text]
val path :String = "/home/hduser/Desktop/foldername"
val text = sc.newAPIHadoopFile(path, fc ,kc, vc, sc.hadoopConfiguration)
println("++++++++++++++ "+text)
val linesWithFileNames = text.asInstanceOf[NewHadoopRDD[LongWritable, Text]].mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
println(">>>>>>>>>>>>>>>> "+file.getPath)
iterator.map(tup => (file.getPath, tup._2))
}
)
linesWithFileNames.collect()
}
}
this gives me the path like
/home/hduser/Desktop/folder_name/XYZ_123_pqr_abc_FILENAME
and i want to apply pattern matching based on the file name start....here it will based on XYZ.
Any help will be appreciated.

You are not specific enough, i think. From context of your question i assume XYZ are digits and you intend to sort files by that numbers, if that's the case consider the following example:
val matchStart = """^(\d{3})""".r.unanchored
val number = file match {
case matchStart(number, _*) => number.toInt
case _ => ???//error handling, not necessarily unimplemented error
}
Now having extracted the number from the begining of the file name you can perform sorting etc.
Edit
Following information in comment answer would be similar:
val matchStart = """^([a-zA-Z1-9]{3})""".r.unanchored
val prefix = file match {
case matchStart(a, _*) => a
case _ => ???//error handling, not necessarily unimplemented error
}
Having the prefix extracted you can do anything you want with it.

Related

Scala : List the files which are greater than a file based on its name with timestamp pattern in it

I have to list out all files which are greater than particular file based on its timestamp in naming pattern in scala. Below is the example.
Files available:
log_20200601T123421.log
log_20200601T153432.log
log_20200705T093425.log
log_20200803T049383.log
Condition file:
log_20200601T123421.log - I need to list all the file names, which are greater than equal to 20200601T123421 in its name. The result would be,
Output list:
log_20200601T153432.log
log_20200705T093425.log
log_20200803T049383.log
How to achieve this in scala? I was trying with apache common, but i couldn't see greater than equal to NameFileFilter for it.

Perhaps the following code snippet could be a starting point:
import java.io.File
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(x => x.getName > "log_20200601T123421.log").toList
val files = getListOfFiles(new File("/tmp"))
For the extended task to collect files from different sub-directories:
import java.io.File
def recursiveListFiles(f: File): Array[File] = {
val these = f.listFiles
these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
}
val files = recursiveListFiles(new File("/tmp")).filter(x => x.getName > "log_20200601T123421.log")

creating function for loading conf file and store all properties in case class

I have one app.conf file like below:
some_variable_a = some_infohere_a
some_variable_b = some_infohere_b
Now I need to write scala function to load this app.conf file and create a scala case class to store all these properties. with try-catch and file checking conditions and corner cases.
I am very new to scala do not have much knowledge on this please provide me a correct way to do this.
Whatever I tried I am writing below:
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
import com.typesafe.config._
import java.nio.file.Paths
private def ReadConfFile(path: String) = {
val fileTemp = new File(path)
if (fileTemp.exists) {
val confFile = Paths.get(Path).toFile
val config = ConfigFactory.parseFile(confFile)
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
}
}

Assuming that app.conf is in root folder of your application, this should be enough to access those variables from config file:
val config = ConfigFactory.load("app.conf")
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
In case you need to load it from the file using absolute path, you can do that in this way:
val myConfigFile = new File("/home/user/location/app.conf")
val config = ConfigFactory.parseFile(myConfigFile)
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
Or something similar as you did. In your code there is a typo I guess in Paths.get(Path).toFile. "Path" should be "path". If you If you don't have some variable Path, you should get compiling error for that. If that is not the problem, then check if you are providing correct path:
private def ReadConfFile(path: String) = {
val fileTemp = new File(path)
if (fileTemp.exists) {
val confFile = Paths.get(path).toFile
val config = ConfigFactory.parseFile(confFile)
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
}
}
ReadConfFile("/home/user/location/app.conf")

how get element from List based on its name?

I have a path that contains subpath ,each subpath contains files
path="/data"
I implemented tow function to get csv files from each sub path
def getListOfSubDirectories(directoryName: String): Array[String] = {
(new File(directoryName))
.listFiles
.filter(_.isDirectory)
.map(_.getName)
}
def getListOfFiles(dir: String, extensions: List[String]): List[File] = {
val d = new File(dir)
d.listFiles.filter(_.isFile).toList.filter { file =>
extensions.exists(file.getName.endsWith(_))
}
}
each sub path contain 5 csv files : contextfile.csv,datafile.csv,datesfiles.csv,errors.csv,testfiles so my problem that i'll work with each file in a separate dataframe how I can get name of file for the right dataframe for example I want to get the name of files that concern context (i.e contextfile.csv). I worked like this but for each iteration the logic and the ranking in th List change
val dir=getListOfSubDirectories(path)
for (sup_path <- dir)
{ val Files = getListOfFiles(path + "//" + sup_path, List(".csv"))
val filename_context = Files(1).toString
val filename_datavalue = Files(0).toString
val filename_error = Files(3).toString
val filename_testresult = Files(4).toString
}
any help and thanks

I solve it just a simple filter
val filename_context = Files.filter(f =>f.getName.contains("context")).last.toString
val filename_datavalue = Files.filter(f =>f.getName.contains("data")).last.toString
val filename_error = Files.filter(f =>f.getName.contains("error")).last.toString
val filename_testresult = Files.filter(f =>f.getName.contains("test")).last.toString

Best way to convert online csv to dataframe scala

I am trying to figure out the most efficient way to accomplish putting this online csv file into a data frame in Scala.
To save a download, the csv file in the code looks like this:
"Symbol","Name","LastSale","MarketCap","ADR
TSO","IPOyear","Sector","Industry","Summary Quote"
"DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"
"MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"
....
From my research, I start by downloading the csv, and placing it into a list buffer (since you can't do this with a list because it's immutable):
import scala.collection.mutable.ListBuffer
val sc = new SparkContext(conf)
var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()
import scala.io.Source
val bufferedSource =
Source.fromURL("http://www.nasdaq.com/screening/companies-by-
industry.aspx?exchange=NYSE&render=download")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"
}
bufferedSource.close
val stockInfoNYSE_List = stockInfoNYSE_ListBuffer.toList
So we have a list. You can basically get each value like this:
// SYMBOL : stockInfoNYSE_List(1).split(",")(0)
// COMPANY NAME : stockInfoNYSE_List(1).split(",")(1)
// IPOYear : stockInfoNYSE_List(1).split(",")(5)
// Sector : stockInfoNYSE_List(1).split(",")(6)
// Industry : stockInfoNYSE_List(1).split(",")(7)
Here is where I get stuck- how do I get this to a dataframe? The wrong approaches I have taken. I didn't put all the values in just yet- was a simple test.
case class StockMap(Symbol: String, Name: String)
val caseClassDS = Seq(StockMap(stockInfoNYSE_List(1).split(",")(0),
StockMap(stockInfoNYSE_List(1).split(",")(1))).toDS()
caseClassDS.show()
The problem with the approach above: I can only figure out how to add one sequence (row) by hard coding it. I want every Row in the list.
My second failed attempt:
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val test = stockInfoNYSE_List.toDF
This will just give you the array, and I want to divide up the values.
Array(["Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote"], ["DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"], ["MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"],.......

case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String
| )
defined class TestClass
var stockDF= stockInfoNYSE_ListBuffer.drop(1)
val demoDS = stockDF.map(line => {
val fields = line.replace("\"","").split(",")
TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})
scala> demoDS.toDS.show
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
|Symbol| Name|LastSale| MarketCap| ADR_TSO|IPOyear| Sector| Industry| Summary_Quote|
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
| DDD|3D Systems Corpor...| 18.09| 2058834640.41| n/a| n/a| Technology|Computer Software...|http://www.nasdaq...|
| MMM| 3M Company| 211.68|126423673447.68| n/a| n/a| Health Care|Medical/Dental In...|http://www.nasdaq...|

In case anyone is trying to get this example working, here is the code using the above solution:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import scala.collection.mutable.ListBuffer
import sqlContext.implicits._
var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()
import scala.io.Source
val bufferedSource =
Source.fromURL("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE&render=download")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"
}
bufferedSource.close
case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String )
var stockDF= stockInfoNYSE_ListBuffer.drop(1)
val demoDS = stockDF.map(line => {
val fields = line.replace("\"","").split(",")
TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})
demoDS.toDF().show

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong

Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()

For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD

What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to apply pattern matching on the file name using spark(scala) - scala

Related

Scala : List the files which are greater than a file based on its name with timestamp pattern in it

creating function for loading conf file and store all properties in case class

how get element from List based on its name?

Best way to convert online csv to dataframe scala

String filter using Spark UDF

Categories

Resources