What input does my user defined function in spark dataframe take in? - scala

I try to combine the two columns "Format Group" and "Format SubGroup" to a single column called Format.
The O/P in the final Format column should be in the form of Format Group:Format Subgroup
I need to create my own UDF using some given data, but I am not sure why my UDF doesn't like the input I have given it.
This is the first rows of the data I use:
checkoutDF:
BibNumber, ItemBarcode, ItemType, Collection, CallNumber, CheckoutDateTime
1842225, 0010035249209, acbk, namys, MYSTERY ELKINS1999, 05/23/2005 03:20:00 PM
dataDictionaryDF:
Code, Description, Code Type, Format Group, Format Subgroup
acdvd, DVD: Adult/YA, ItemType, Media, Video Disc
Here's how it looks in the IntelliJ IDEA
Updated the code: changed seq[seq[string]] to String
def numberCheckoutRecordsPerFormat(checkoutDF: DataFrame, dataDictionaryDF: DataFrame): DataFrame = {
val createFeatureVector = udf{(Format_Group:String, Format_Subgroup:String) => {
dataDictionaryDF.map(x => if(Format_Group.flatten.contains(x)) 1.0 else 0.0)++Array(Format_Subgroup)
}
}
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", createFeatureVector(dataDictionaryDF("Format_Group"), dataDictionaryDF("Format_Subgroup")))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
}
Furthermore, the numberCheckoutRecordsPerFormat should return a DataFrame of Format and number of Checkouts for a given item - but I got this part covered myself.
The data set used is the Seattle Library Checkout Records from Kaggle
Thanks, people!

Doomdaam, you can try to use the concat_ws built-in function (always use built-in functions when possible). Your code will look like :
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", concat_ws(":",$"Format_Group", $"Format_Subgroup"))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
Otherwise your UDF would have been :
val createFeatureVector = udf{(formatGroup:String, formatSubgroup:String) => Seq(formatGroup,formatSubgroup).mkString(":")}

Related

Add element to the xml string in scala

I have the following relatively simple scenario, but it’s working.
I need an append to my xml string, here's the scenario:
val xmlStr = "<return> <numberPin> 123456 </numberPin> </return>"
I need some way to add the element data and return the string below, I would like some solution with regular expression if possible
"<return> <numberPin> 123456 </numberPin> <date> 2019-09-04 00:00:00 </date> </return>"
You can create a template xml at first that can be updated at runtime.
You can do something like below:
def updateXml (xmlStr:String, dateContent: String) = {
xmlStr.replace("DATE_DATA", dateContent)
}
val xmlStr = "<return> <numberPin> 123456 </numberPin> DATE_DATA </return>"
val dateData = "<date> 2019-09-04 00:00:00 </date>"
updateXml(xmlStr, dateData)
Another alternative is to create an xml template in a file(if the xml content is like a big file). Read it in your code and insert required data at run-time as shown in the above example(where i stuffed DATE_DATA in template and replaced it at runtime using the method).

Optimize a piece of code that uses map action

The following piece of code takes a lot of time on 4Gb of raw data in a cluster:
df.select("type", "user_pk", "item_pk","timestamp")
.withColumn("date",to_date(from_unixtime($"timestamp")))
.filter($"date" > "2018-04-14")
.select("type", "user_pk", "item_pk")
.map {
row => {
val typef = row.get(0).toString
val user = row.get(1).toString
val item = row.get(2).toString
(typef, user, item)
}
}
The output should be of type Dataset[(String,String,String)].
I guess that map part takes a lot of time. Is there any way to optimize this piece of code?
I seriously doubt the map is the problem, nonetheless I wouldn't use it at all and go with standard Dataset converter
import df.sparkSession.implicits._
df.select("type", "user_pk", "item_pk","timestamp")
.withColumn("date",to_date(from_unixtime($"timestamp")))
.filter($"date" > "2018-04-14")
.select($"type" cast "string", $"user_pk" cast "string", $"item_pk" cast "string")
.as[(String,String,String)]
You're creating date column with Date type and then compare it with string??
I'd assume some implicit transformation is happening underneath (for each row while filtering).
Instead I'd convert that string to date to timestamp and do integer comparison (as you're using from_unixtime I assume timestamp is stored as System.currenttimemillis or similar):
timestamp = some_to_timestamp_func("2018-04-14")
df.select("type", "user_pk", "item_pk","timestamp")
.filter($"timestamp" > timestamp)
... etc

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

how to update a value in dataframe and drop a row on this basis of a given value in scala

I need to update the value and if the value is zero then drop that row. Here is the snapshot.
val net = sc.accumulator(0.0)
df1.foreach(x=> {net += calculate(df2, x)})
def calculate(df2:DataFrame, x : Row):Double = {
var pro:Double = 0.0
df2.foreach(y => {if(xxx){ do some stuff and update the y.getLong(2) value }
else if(yyy){ do some stuff and update the y.getLong(2) value}
if(y.getLong(2) == 0) {drop this row from df2} })
return pro;
}
Any suggestions? Thanks.
You cannot change the DataFrame or RDD. They are read only for a reason. But you can create a new one and use transformations by all the means available. So when you want to change for example contents of a column in dataframe just add new column with updated contents by using functions like this:
df.withComlumn(...)
DataFrames are immutable, you can not update a value but rather create new DF every time.
Can you reframe your use case, its not very clear what you are trying to achieve with the above snippet (Not able to understand the use of accumulator) ?
You can rather try df2.withColumn(...) and use your udf here.

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).