I am currently importing the dataset from an excel sheet which has a column name with a dot character like this "abc.xyz".
I went through a couple of stackOverflow questions and it says that we can replace it with the column names with backtick like this: "'abc.xyz'". So, I renamed all the column names which have a dot in it with the same name but enclosed in backticks like this:
df.columns.foreach(item => {
if(item.contains("."))
{
df.withColumnRenamed(item, s"`$item`")
}
})
Now when I pass this dataframe inside the ConstraintSuggestionRunner class like this:
val suggestionResult = ConstraintSuggestionRunner()
.onData(df)
.addConstraintRules(Rules.DEFAULT)
.setKLLParameters(KLLParameters(sketchSize = 2048, shrinkingFactor = 0.64, numberOfBuckets = 10))
.run()
I am getting errors like :
ERROR Main: org.apache.spark.sql.AnalysisException: cannot resolve
'`abc.xyz`' given input columns:
How can I resolve this error?
The escaping must be handled in Deequ but the issue is always open. What you did here is adding the backticks as part of the column names, not escaping them.
You can try to replace the dots by another caracheter like underscore _ then pass the dataframe with the renamed columns to the ConstraintSuggestionRunner:
val df1 = df.toDF(df.columns.map(_.replaceAll("[.]+", "_")):_*)
val suggestionResult = ConstraintSuggestionRunner()
.onData(df1)
.addConstraintRules(Rules.DEFAULT)
.setKLLParameters(KLLParameters(sketchSize = 2048, shrinkingFactor = 0.64, numberOfBuckets = 10))
.run()
Related
Currently I have a configuration file like this:
project {
inputs {
baseFile {
paths = ["project/src/test/resources/inputs/parquet1/date=2020-11-01/"]
type = parquet
applyConversions = false
}
}
}
And I want to change the date "2020-11-01" to another one during run time. I read I need a new config object since it's immutable, I'm trying this but I'm not quite sure how to edit paths since it's a list and not a String and it definitely needs to be a list or else it's going to say I haven't configured a path for the parquet.
val newConfig = config.withValue("project.inputs.baseFile.paths"(0),
ConfigValueFactory.fromAnyRef("project/src/test/resources/inputs/parquet1/date=2020-10-01/"))
But I'm getting a:
Error com.typesafe.config.ConfigException$BadPath: path parameter: Invalid path 'project.inputs.baseFile.': path has a leading, trailing, or two adjacent period '.' (use quoted "" empty string if you want an empty element)
What's the correct way to set the new path?
One option you have, is to override the entire array:
import scala.collection.JavaConverters._
val mergedConfig = config.withValue("project.inputs.baseFile.paths",
ConfigValueFactory.fromAnyRef(Seq("project/src/test/resources/inputs/parquet1/date=2020-10-01/").asJava))
But a more elegant way to do this (IMHO), is to create a new config, and to use the existing as a fallback.
For example, we can create a new config:
val newJsonString = """project {
|inputs {
|baseFile {
| paths = ["project/src/test/resources/inputs/parquet1/date=2020-10-01/"]
|}}}""".stripMargin
val newConfig = ConfigFactory.parseString(newJsonString)
And now to merge them:
val mergedConfig = newConfig.withFallback(config)
The output of:
println(mergedConfig.getList("project.inputs.baseFile.paths"))
println(mergedConfig.getString("project.inputs.baseFile.type"))
is:
SimpleConfigList(["project/src/test/resources/inputs/parquet1/date=2020-10-01/"])
parquet
As expected.
You can read more about Merging config trees. Code run at Scastie.
I didn't find any way to replace one element of the array with withValue.
I try to combine the two columns "Format Group" and "Format SubGroup" to a single column called Format.
The O/P in the final Format column should be in the form of Format Group:Format Subgroup
I need to create my own UDF using some given data, but I am not sure why my UDF doesn't like the input I have given it.
This is the first rows of the data I use:
checkoutDF:
BibNumber, ItemBarcode, ItemType, Collection, CallNumber, CheckoutDateTime
1842225, 0010035249209, acbk, namys, MYSTERY ELKINS1999, 05/23/2005 03:20:00 PM
dataDictionaryDF:
Code, Description, Code Type, Format Group, Format Subgroup
acdvd, DVD: Adult/YA, ItemType, Media, Video Disc
Here's how it looks in the IntelliJ IDEA
Updated the code: changed seq[seq[string]] to String
def numberCheckoutRecordsPerFormat(checkoutDF: DataFrame, dataDictionaryDF: DataFrame): DataFrame = {
val createFeatureVector = udf{(Format_Group:String, Format_Subgroup:String) => {
dataDictionaryDF.map(x => if(Format_Group.flatten.contains(x)) 1.0 else 0.0)++Array(Format_Subgroup)
}
}
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", createFeatureVector(dataDictionaryDF("Format_Group"), dataDictionaryDF("Format_Subgroup")))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
}
Furthermore, the numberCheckoutRecordsPerFormat should return a DataFrame of Format and number of Checkouts for a given item - but I got this part covered myself.
The data set used is the Seattle Library Checkout Records from Kaggle
Thanks, people!
Doomdaam, you can try to use the concat_ws built-in function (always use built-in functions when possible). Your code will look like :
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", concat_ws(":",$"Format_Group", $"Format_Subgroup"))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
Otherwise your UDF would have been :
val createFeatureVector = udf{(formatGroup:String, formatSubgroup:String) => Seq(formatGroup,formatSubgroup).mkString(":")}
I wrote this:
df.select(col("colname")).distinct().collect.map(_.toString()).toList
the result is
List("[2019-06-24]", "[2019-06-22]", "[2019-06-23]")
Whereas I want to get :
List("2019-06-24", "2019-06-22", "2019-06-23")
How to change this please
You need to change .map(_.toString()) to .map(_.getAs[String]("colname")).With .map(_.toString()), you are calling org.apache.spark.sql.Row.toString, that's why the output is like List("[2019-06-24]", "[2019-06-22]", "[2019-06-23]").Correct way is:
val list = df.select("colname").distinct().collect().map(_.getAs[String]("colname")).toList
Output will be:
List("2019-06-24", "2019-06-22", "2019-06-23")
Sample data:
val df=sc.parallelize(Seq(("2019-06-24"),( "2019-06-22"),("2019-06-23"))).toDF("cn")
Now select column then apply map to get first index value then add quotes and convert to string.
df.select("cn").collect().map(x => x(0)).map(x => s""""$x"""".toString)
//res36: Array[String] = Array("2019-06-24", "2019-06-22", "2019-06-23")
(or)
df.select("cn").collect().map(x => x(0)).map(x => s""""$x"""".toString).toList
//res37: List[String] = List("2019-06-24", "2019-06-22", "2019-06-23")
I want to call value of 'myActionID' variable. How do I do that?
If i pass static value like "actionId":1368201 to myActionID then it works, but If I use "actionId" : ${actionIdd} it gives error.
Here's the relevant code:
class LaunchWorkflow_Act extends Simulation {
val scenarioRepeatCount = 1
val userCount = 1
val myActionID = "13682002351"
val scn = scenario("LaunchMyFile")
.repeat (scenarioRepeatCount) {
exec(session => session.set("counter", (globalVar.getAndIncrement+" "+timeStamp.toString())))
.exec(http("LaunchRequest")
.post("""/api/test""")
.headers(headers_0)
.body(StringBody(
"""{ "actionId": ${myActionID} ,
"jConfig": "{\"wflow\":[{\"Wflow\":{\"id\": \"13500145349\"},\"inherit-variables\": true,\"workflow-context-variable\": [{\"variable-name\": \"externalFilePath\",\"variable-value\": \"/var/nem/nem/media/mount/assets/Test.mp4\"},{\"variable-name\": \"Name\",\"variable-value\": \"${counter}\"}]}]}"
}""")))
.pause(pause)
}
}
setUp(scn.inject(atOnceUsers(userCount))).protocols(httpProtocol)
Everything works fine If I put value 13682002351 instead of myActionID. While executing this script in Gatling I am Getting this error
ERROR i.g.http.action.HttpRequestAction - 'httpRequest-3' failed to
execute: No attribute named 'myActionID' is defined
Scala has various mechanisms for String Interpolation (see docs), which can be used to embed variables in strings. All of them can be used in conjunction with the triple quotes """ used to create multi-line strings.
In this case, you can use:
val counter = 12
val myActionID = "13682002351"
val str = s"""{
"actionId": $myActionID ,
"jConfig": "{\"wflow\":[{\"Wflow\":{\"id\": \"13500145349\"},\"inherit-variables\": true,\"workflow-context-variable\": [{\"variable-name\": \"externalFilePath\",\"variable-value\": \"/var/nem/nem/media/mount/assets/Test.mp4\"},{\"variable-name\": \"Name\",\"variable-value\": \"${counter}\"}]}]}"
}"""
Notice the s prepended to the string literal, and the dollar sign prepended to the variable names.
Using S interpolated String we can do this easily:
s"""Hello Word , Welcome Back!
How are you doing ${userName}"""
I am new to Scala and I need to read the contents of a text file into a string while removing certain lines at the same time. The lines to be removed can be identified with a substring match. I could come up with the following solution, which almost works, the only problem is that the newlines are removed:
val fileAsFilteredString = io.Source.fromFile("file.txt").getLines.filter(s => !(s contains "filter these")).mkString;
How can I keep the newlines?
Add some parameters to mkString:
val fileAsFilteredString = io.Source.fromFile("file.txt").getLines
.filter(s => !(s contains "filter these")).mkString("\n")