scala code to read parquet file by passing dynamic values using widgets - scala

I have folders structure like process/YYYY/MM/DD i need to write a Scala code to read those files till process/YYYY and i dynamically pass the month and date using widgets.
I pass the mm and dd dynamically using widgets
code
val ReadDf = spark.read.format("parquet").option("header","true").load(""mnt/pnt/process/YYYY")

You can use the following code to get the month from a widget and then create a path to load from:
dbutils.widgets.text("Month", "1")
val widget_month = dbutils.widgets.get("Month").toInt
val path_month = "%02d".format(widget_month)
val pathToReadFrom = s"/mnt/pnt/process/yyyy=2020/mm=${path_month}"
Databricks output shows:
widget_month: Int = 1
path_month: String = 01
pathToReadFrom: String = /mnt/pnt/process/yyyy=2020/mm=01
Now if you want to pass arguments to a notebook through the widgets, you can run it from another notebook using Notebook workflows. This is an example from that link:
dbutils.notebook.run("notebook-name", 60, {"argument": "data", "argument2": "data2", ...})

Related

Synapse - Notebook not working from Pipeline

I have a notebook in Azure Synapse that reads parquet files into a data frame using the synapsesql function and then pushes the data frame contents into a table in the SQL Pool.
Executing the notebook manually is successful and the table is created and populated in the Synapse SQL pool.
When I try to call the same notebook from an Azure Synapse pipeline it returns successful however does not create the table. I am using the Synapse Notebook activity in the pipeline.
What could be the issue here?
I am getting deprecated warnings around the synapsesql function but don't know what is actually deprecated.
The code is below.
%%spark
val pEnvironment = "t"
val pFolderName = "TestFolder"
val pSourceDatabaseName = "TestDatabase"
val pSourceSchemaName = "TestSchema"
val pRootFolderName = "RootFolder"
val pServerName = pEnvironment + "synas01"
val pDatabaseName = pEnvironment + "syndsqlp01"
val pTableName = pSourceDatabaseName + "" + pSourceSchemaName + "" + pFolderName
// Import functions and Synapse connector
import org.apache.spark.sql.DataFrame
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.functions.
import org.apache.spark.sql.SqlAnalyticsConnector.
// Get list of "FileLocation" from control.FileLoadStatus
val fls:DataFrame = spark.read.
synapsesql(s"${pDatabaseName}.control.FileLoadStatus").
select("FileLocation","ProcessedDate")
// Read all parquet files in folder into data frame
// Add file name as column
val df:DataFrame = spark.read.
parquet(s"/source/${pRootFolderName}/${pFolderName}/").
withColumn("FileLocation", input_file_name())
// Join parquet file data frame to FileLoadStatus data frame
// Exclude rows in parquet file data frame where ProcessedDate is not null
val df2 = df.
join(fls,Seq("FileLocation"), "left").
where(fls("ProcessedDate").isNull)
// Write data frame to sql table
df2.write.
option(Constants.SERVER,s"${pServerName}.sql.azuresynapse.net").
synapsesql(s"${pDatabaseName}.xtr.${pTableName}",Constants.INTERNAL)
This case happens often and to get the output after pipeline execution. Follow the steps mentioned.
Pick up the Apache Spark application name from the output of pipeline
Navigate to Apache Spark Application under Monitor tab and search for the same application name .
These 4 tabs would be available there: Diagnostics,Logs,Input data,Output data
Go to Logs ad check 'stdout' for getting the required output.
https://www.youtube.com/watch?v=ydEXCVVGAiY
Check the above video link for detailed live procedure.

What input does my user defined function in spark dataframe take in?

I try to combine the two columns "Format Group" and "Format SubGroup" to a single column called Format.
The O/P in the final Format column should be in the form of Format Group:Format Subgroup
I need to create my own UDF using some given data, but I am not sure why my UDF doesn't like the input I have given it.
This is the first rows of the data I use:
checkoutDF:
BibNumber, ItemBarcode, ItemType, Collection, CallNumber, CheckoutDateTime
1842225, 0010035249209, acbk, namys, MYSTERY ELKINS1999, 05/23/2005 03:20:00 PM
dataDictionaryDF:
Code, Description, Code Type, Format Group, Format Subgroup
acdvd, DVD: Adult/YA, ItemType, Media, Video Disc
Here's how it looks in the IntelliJ IDEA
Updated the code: changed seq[seq[string]] to String
def numberCheckoutRecordsPerFormat(checkoutDF: DataFrame, dataDictionaryDF: DataFrame): DataFrame = {
val createFeatureVector = udf{(Format_Group:String, Format_Subgroup:String) => {
dataDictionaryDF.map(x => if(Format_Group.flatten.contains(x)) 1.0 else 0.0)++Array(Format_Subgroup)
}
}
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", createFeatureVector(dataDictionaryDF("Format_Group"), dataDictionaryDF("Format_Subgroup")))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
}
Furthermore, the numberCheckoutRecordsPerFormat should return a DataFrame of Format and number of Checkouts for a given item - but I got this part covered myself.
The data set used is the Seattle Library Checkout Records from Kaggle
Thanks, people!
Doomdaam, you can try to use the concat_ws built-in function (always use built-in functions when possible). Your code will look like :
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", concat_ws(":",$"Format_Group", $"Format_Subgroup"))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
Otherwise your UDF would have been :
val createFeatureVector = udf{(formatGroup:String, formatSubgroup:String) => Seq(formatGroup,formatSubgroup).mkString(":")}

Generating a single output file for each processed input file in Apach Flink

I am using Scala and Apache Flink to build an ETL that reads all the files under a directory in my local file system periodically and write the result of processing each file in a single output file under another directory.
So an example of this is would be:
/dir/to/input/files/file1
/dir/to/intput/files/fil2
/dir/to/input/files/file3
and the output of the ETL would be exactly:
/dir/to/output/files/file1
/dir/to/output/files/file2
/dir/to/output/files/file3
I have tried various approaches including reducing the parallel processing to one when writing to the dataSink but I still can't achieve the required result.
This is my current code:
val path = "/path/to/input/files/"
val format = new TextInputFormat(new Path(path))
val socketStream = env.readFile(format, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 10)
val wordsStream = socketStream.flatMap(value => value.split(",")).map(value => WordWithCount(value,1))
val keyValuePair = wordsStream.keyBy(_.word)
val countPair = keyValuePair.sum("count")
countPair.print()
countPair.writeAsText("/path/to/output/directory/"+
DateTime.now().getHourOfDay.toString
+
DateTime.now().getMinuteOfHour.toString
+
DateTime.now().getSecondOfMinute.toString
, FileSystem.WriteMode.NO_OVERWRITE)
// The first write method I trid:
val sink = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink.setBucketer(new DateTimeBucketer[WordWithCount]("yyyy-MM-dd--HHmm"))
// The second write method I trid:
val sink3 = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink3.setUseTruncate(false)
sink3.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
sink3.setWriter(new StringWriter[WordWithCount])
sink3.setBatchSize(3)
sink3.setPendingPrefix("file-")
sink3.setPendingSuffix(".txt")
Both writing methods fail in producing the wanted result.
Can some with experience with Apache Flink guide me to the write approach please.
I solved this issue importing the next dependencies to run on local machine:
hadoop-aws-2.7.3.jar
aws-java-sdk-s3-1.11.183.jar
aws-java-sdk-core-1.11.183.jar
aws-java-sdk-kms-1.11.183.jar
jackson-annotations-2.6.7.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.jar
joda-time-2.8.1.jar
httpcore-4.4.4.jar
httpclient-4.5.3.jar
You can review it on :
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html
Section "Provide S3 FileSystem Dependency"

How to dump nested list using SnakeYAML

I'm parsing this yaml file
View:
from : 01.01.2007
to : 04.01.2007
driver : sun.jdbc.odbc.JdbcOdbcDriver
using SnakeYAML in Scala like this:
val stream = getClass.getResourceAsStream("/config_view.yml")
var configMap: Map[String, Any] = new Yaml().load(stream).asInstanceOf[java.util.Map[String, Any]].asScala
var view = configMap("View").asInstanceOf[java.util.LinkedHashMap[String, String]].asScala
view = view + ("from" -> "neu") // some test modifying
and I dump it like this:
val fileWriter = new FileWriter(System.getProperty("user.home") + "\\Desktop\\test.yml")
new Yaml().dump(Map[String, Any]("View" -> view.asJava).asJava, fileWriter)
which saves the new yaml file like this:
View: {driver: sun.jdbc.odbc.JdbcOdbcDriver, from: neu, to: 04.01.2007}
But I want it to save it like this:
View:
driver: sun.jdbc.odbc.JdbcOdbcDriver
from: neu
to: 04.01.2007
How can I tell SnakeYAML to save it in the desired format you see above?
By default SnakeYAML uses the DumperOptions.FlowStyle.FLOW but it can be changed to the DumperOptions.FlowStyle.BLOCK that will dump the data with the desired format.
An example in Kotlin:
val options = DumperOptions()
options.indent = 2
options.defaultFlowStyle = DumperOptions.FlowStyle.BLOCK
Yaml(options).dump(yourObject)
How about manually handling the indentation and key: value formatting:
view.map{ case (k,v) => s"\t$k: $v\n" }
In the case of nested maps you will want a method that
accepts the current "level" of nesting. Place the level tabs in front of the output to give the proper output nesting
checks each of the entries. If it were another collection type then it needs to recursively invoke itself - which will increase the indention level

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).