How to create log of folder is read in scala spark - scala

The folder of hdfs is like :
/test/data/2020-03-01/{multiple inside files csv}
/test/data/2020-03-02/{multiple files csv}
/test/data/2020-03-03/{multiple files csv }
i want to read data inside folder one by one not whole by
spark.read.csv("/test/data/*") //not in such manner
Not in above manner , i want to read file one by one; so that i can make the log entry in some database that date folder is read ; so that on next time i can skip that folder in next day or same day if program run accidentally:

val conf = new Configuration()
val iterate = org.apache.hadoop.fs.FileSystem.get(new URI(strOutput), conf).listLocatedStatus(new org.apache.hadoop.fs.Path(strOutput))
while (iterate.hasNext) {
val pathStr = iterate.next().getPath.toString
println("log---->"+pathStr)
val df = spark.read.text(pathStr)
}
Try something like above and read as data frame, if you want you can union new date df with old df.

Related

Merging too many small files into single large files in Datalake using Apache Spark

I Have Following Directory Structure In HDFS.
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-3.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-3.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-3.txt
I want to Merge the files DayWise.
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt
I have used below code.
val inputDir="/user/hdfs/landing_zone/year=2021/month=11/"
val hadoopConf = spark.sparkContext.hadoopConfiguration
val hdfsConf = new Configuration();
val fs: FileSystem = FileSystem.get(hdfsConf)
val sc = spark.sparkContext
val baseFolder = new Path(inputDir)
val files = baseFolder.getFileSystem(sc.hadoopConfiguration).listStatus(baseFolder).map(_.getPath.toString)
for (path <- files) {
var Folder_Path = fs.listStatus(new Path(path)).map(_.getPath).toList
for (eachfolder <- Folder_Path) {
var New_Folder_Path: String = eachfolder.toString
var Fs1 = FileSystem.get(spark.sparkContext.hadoopConfiguration)
var FilePath = Fs1.listStatus(new Path(s"${New_Folder_Path}")).filter(_.isFile).map(_.getPath).toList
var NewFiles = Fs1.listStatus(new Path(s"${New_Folder_Path}")).filter(_.isFile).map(_.getPath.getName).toList
"FilePath" : Generating the List of Complete Path for all the files recursively.
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-3.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-3.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-3.txt)
"NewFiles" : - Generating the list of FileNames for all the files recursively
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
Can Someone Suggest/Guide me How should I modify the code so that It can Generate the files DayWise and merge 3 file(1 day=3 files) into a single file (1 day = 1 file) recursively for all the days.
They're are easier ways that getting into low level manipulations.I would suggest "picking the table up and putting it back down"
Literally, create a table based on the files, write it to a new table. This should concatenate the small files. (Without you having to manipulate it)
If you have created a hive table based on this you could use hive to do the work:
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

How to process multiple parquet files individually in a for loop?

I have multiple parquet files (around 1000). I need to load each one of them, process it and save the result to a Hive table. I have a for loop but it only seems to work with 2 or 5 files, but not with 1000, as it seems Sparks tries to load them all at the same time, and I need it do it individually in the same Spark session.
I tried using a for loop, then a for each, and I ussed unpersist() but It fails anyway.
val ids = get_files_IDs()
ids.foreach(id => {
println("Starting file " + id)
var df = load_file(id)
var values_df = calculate_values(df)
values_df.write.mode(SaveMode.Overwrite).saveAsTable("table.values_" + id)
df.unpersist()
})
def get_files_IDs(): List[String] = {
var ids = sqlContext.sql("SELECT CAST(id AS varchar(10)) FROM table.ids WHERE id IS NOT NULL")
var ids_list = ids.select("id").map(r => r.getString(0)).collect().toList
return ids_list
}
def calculate_values(df:org.apache.spark.sql.DataFrame): org.apache.spark.sql.DataFrame ={
val values_id = df.groupBy($"id", $"date", $"hr_time").agg(avg($"value_a") as "avg_val_a", avg($"value_b") as "avg_value_b")
return values_id
}
def load_file(id:String): org.apache.spark.sql.DataFrame = {
val df = sqlContext.read.parquet("/user/hive/wh/table.db/parquet/values_for_" + id + ".parquet")
return df
}
What I would expect is for Spark to load file ID 1, process the data, save it to the Hive table and then dismiss that date and cotinue with the second ID and so on until it finishes the 1000 files. Instead of it trying to load everything at the same time.
Any help would be very appreciated! I've been stuck on it for days. I'm using Spark 1.6 with Scala Thank you!!
EDIT: Added the definitions. Hope it can help to get a better view. Thank you!
Ok so after a lot of inspection I realised that the process was working fine. It processed each file individualy and saved the results. The issue was that in some very specific cases the process was taking way way way to long.
So I can tell that with a for loop or for each you can process multiple files and save the results without problem. Unpersisting and clearing cache do helps on performance.

Generating a single output file for each processed input file in Apach Flink

I am using Scala and Apache Flink to build an ETL that reads all the files under a directory in my local file system periodically and write the result of processing each file in a single output file under another directory.
So an example of this is would be:
/dir/to/input/files/file1
/dir/to/intput/files/fil2
/dir/to/input/files/file3
and the output of the ETL would be exactly:
/dir/to/output/files/file1
/dir/to/output/files/file2
/dir/to/output/files/file3
I have tried various approaches including reducing the parallel processing to one when writing to the dataSink but I still can't achieve the required result.
This is my current code:
val path = "/path/to/input/files/"
val format = new TextInputFormat(new Path(path))
val socketStream = env.readFile(format, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 10)
val wordsStream = socketStream.flatMap(value => value.split(",")).map(value => WordWithCount(value,1))
val keyValuePair = wordsStream.keyBy(_.word)
val countPair = keyValuePair.sum("count")
countPair.print()
countPair.writeAsText("/path/to/output/directory/"+
DateTime.now().getHourOfDay.toString
+
DateTime.now().getMinuteOfHour.toString
+
DateTime.now().getSecondOfMinute.toString
, FileSystem.WriteMode.NO_OVERWRITE)
// The first write method I trid:
val sink = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink.setBucketer(new DateTimeBucketer[WordWithCount]("yyyy-MM-dd--HHmm"))
// The second write method I trid:
val sink3 = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink3.setUseTruncate(false)
sink3.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
sink3.setWriter(new StringWriter[WordWithCount])
sink3.setBatchSize(3)
sink3.setPendingPrefix("file-")
sink3.setPendingSuffix(".txt")
Both writing methods fail in producing the wanted result.
Can some with experience with Apache Flink guide me to the write approach please.
I solved this issue importing the next dependencies to run on local machine:
hadoop-aws-2.7.3.jar
aws-java-sdk-s3-1.11.183.jar
aws-java-sdk-core-1.11.183.jar
aws-java-sdk-kms-1.11.183.jar
jackson-annotations-2.6.7.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.jar
joda-time-2.8.1.jar
httpcore-4.4.4.jar
httpclient-4.5.3.jar
You can review it on :
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html
Section "Provide S3 FileSystem Dependency"

Read multiple files from a directory using Spark

I am trying to solve this problem at kaggle using spark:
the hierarchy of input is like this :
drivers/{driver_id}/trip#.csv
e.g., drivers/1/1.csv
drivers/1/2.csv
drivers/2/1.csv
I want to read the parent directory "drivers" and for each sub directory i would like to create a pairRDD with key as (sub_directory,file_name) and value as the content of the file
I checked this link and tried to use
val text = sc.wholeTextFiles("drivers")
text.collect()
this failed with error :
java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:591)
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:283)
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:243)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:267)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1779)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:885)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.collect(RDD.scala:884)
but when i run the below code, it works.
val text = sc.wholeTextFiles("drivers/1")
text.collect()
but I don't want to do this, since here i will have to read the directory drivers and loop the files and call wholeTextFiles for each entry.
Instead of using
sc.textfile("path/*/**") or sc.wholeTextFiles("path/*")
You can use this piece of code. Because spark internally lists all the possible values of a folder and subfolder so it can cost you time on large datasets. Instead of that you can use Unions for the same purpose.
Pass this List object which contains the locations to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).