Scala: oneSet.diff(otherSet) not working - scala

I have a function that finds new columns to add to a cassandra table:
val inputSet:Set[String] = inputColumns.map
{
cht => cht.stringLabel.toLowerCase()
}.distinct.toSet
logger.debug("\n\ninputSet\n"+inputSet.mkString(", "))
val extantSet:Set[String] = extantColumns.map
{
e => e._1.toLowerCase()
}.toSet
logger.debug("\n\nextantSet\n"+inputSet.mkString(" * "))
inputSet.diff(extantSet)
I want the values that are ONLY in the input set. I will then create columns in Cassandra table.
The return set (i.e., inputSet.diff(extantSet)),however, includes columns that are in both sets.
From my log files:
inputSet
incident, funnel, v_re-evaluate, adj_in-person, accident, v_create,....
extantSet
incident * funnel * v_re-evaluate * adj_in-person * accident *
v_create.....
returned set:
funnel | v_re-evaluate | adj_in-person | v_explain | v_devise | dmepos
|....
Which in the end throws
com.datastax.driver.core.exceptions.InvalidQueryException: Invalid column name adj_in-person because it conflicts with an existing column
What have I done wrong?
Any help would be deeply appreciated?

this is what i have tired. which gives me the output as follows.
object ABC extends App {
val x = List("A","B","c","d","e","a","b").map(_.toLowerCase)
val y = List("a","b","C").map(_.toLowerCase)
println(s"${x diff y} List diff")
println(s"${x.toSet diff y.toSet} Set diff")
}
Output:
List(d, e, a, b) List diff
Set(e, d) Set diff
and i think you are looking for the set difference.
As you can see when we are taking the diff of two list then we are getting duplicates in the answer which are a, b but after the operation .toSet we are getting rid of duplicates so this should work for you too.

Related

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

Merging n CSV strings ignoring headers from every string except the first one

val csvDataWithHeader1 =
s"""SubId,RouteId
|332214238915,423432344323
|332214238915,423432344323""".stripMargin
val csvDataWithHeader2 =
s"""SubId,RouteId
|332214238915,423432344323
|332214238915,423432344323""".stripMargin
val csvHeaders = List(csvDataWithHeader1, csvDataWithHeader2)
Reading 'n' CSV files of same type as strings. Trying to get rid of additional headers when merging them.
Wondering if I should eliminate the headers before merging the strings or after merging (by splitting and eliminating duplicates). Is there a significant performance benefit for one approach over the other?
IMHO from a performance standpoint, it is highly beneficial to eliminate the headers of each individual csv file and then merging them together. To eliminate the header you can delete the first element of the list which happens in O(1) time.
Whereas, to remove duplicates from the list, if you use the list.distinct, then it has an additional overhead of creating a Hashset internally to remove duplicates.
/** Builds a new $coll from this $coll without any duplicate elements.
* $willNotTerminateInf
*
* #return A new $coll which contains the first occurrence of every element of this $coll.
*/
def distinct: Repr = {
val b = newBuilder
val seen = mutable.HashSet[A]()
for (x <- this) {
if (!seen(x)) {
b += x
seen += x
}
}
b.result()
}

Linear Regression with Apache Beam

How might one go about fitting a large number of linear regressions in a beam pipeline? I have a large csv, and I want to normalize every column (about 500) according to two columns A and B. That is, I would like to get standard residuals for X ~ A + B for each column in the csv X.
That's an interesting use case. You can do something like so:
INDEX_A = # Something
INDEX_B = # Something else
parsed_rows = pipeline | beam.ReadFromText(my_csv)
| beam.Map(parse_each_line)
def column_paired_rows(row):
for idx, val in row:
if idx in (INDEX_A, INDEX_B): continue
# Yield the values keyed with the independent + dependent variable indices
yield ((INDEX_A, idx), {'independent_var_value': row[INDEX_A],
'independent_var_idx': INDEX_A,
'dependent_var_value': val,
'dependent_var_idx': idx})
yield ((INDEX_B, idx), {'independent_var_value': row[INDEX_B],
'independent_var_idx': INDEX_B,
'dependent_var_value': val,
'dependent_var_idx': idx})
column_pairs = parsed_rows | beam.FlatMap(column_paired_rows) | beam.GroupByKey()
The column_pairs PCollection will group all your elements by independent, dependent variable pairs, and then you can run the analysis.
def perform_linear_regression(elm):
key = elm[0] # KEY is a tuple with (independent variable index, dependent variable index)
values = elm[1] # This is an iterable with the data points that you need.
pairs = [(v['independent_var_value'], v['dependent_var_value']) for v in values]
model = linear_regression(pairs)
return (key, model)
models = column_pairs | beam.Map(perform_linear_regression)
LMK if you'd like me to add further detail

get one random letter from each tuple then return them all as a string

3 tuples in a list
val l = List(("a","b"),("c","d"),("e","f"))
choice one element from each tuple then return this 3 letters word every time
for example: fca or afd or cbf ...
how to realize it
the same as:
echo {a,b}{c,d}{e,f}|xargs -n1|shuf -n1|sed 's/\B/\n/g'|shuf|paste -sd ''
Working with tuples can be a bit of a pain. You can't easily index them and tuples of different sizes are considered different types in the type system.
val ts = List(("a","b"),("c","d"),("e","f"))
val str = ts.map{t =>
t.productElement(util.Random.nextInt(t.productArity))
}.mkString("")
Every time I run this I get a different result: bde, acf, bdf, etc.

Map word ngrams to counts in scala

I'm trying to create a map which goes through all the ngrams in a document and counts how often they appear. Ngrams are sets of n consecutive words in a sentence (so in the last sentence, (Ngrams, are) is a 2-gram, (are, sets) is the next 2-gram, and so on). I already have code that creates a document from a file and parses it into sentences. I also have a function to count the ngrams in a sentence, ngramsInSentence, which returns Seq[Ngram].
I'm getting stuck syntactically on how to create my counts map. I am iterating through all the ngrams in the document in the for loop, but don't know how to map the ngrams to the count of how often they occur. I'm fairly new to Scala and the syntax is evading me, although I'm clear conceptually on what I need!
def getNGramCounts(document: Document, n: Int): Counts = {
for (sentence <- document.sentences; ngram <- nGramsInSentence(sentence,n))
//I need code here to map ngram -> count how many times ngram appears in document
}
The type Counts above, as well as Ngram, are defined as:
type Counts = Map[NGram, Double]
type NGram = Seq[String]
Does anyone know the syntax to map the ngrams from the for loop to a count of how often they occur? Please let me know if you'd like more details on the problem.
If I'm correctly interpreting your code, this is a fairly common task.
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = for {
sentence <- document.sentences
ngram <- nGramsInSentence(sentence, n)
} yield ngram
allNgrams.groupBy(identity).mapValues(_.size.toDouble)
}
The allNGrams variable collects a list of all the NGrams appearing in the document.
You should eventually turn to Streams if the document is big and you can't hold the whole sequence in memory.
The following groupBycreates a Map[NGram, List[NGram]] which groups your values by its identity (the argument to the method defines the criteria for "aggregate identification") and groups the corresponding values in a list.
You then only need to map the values (the List[NGram]) to its size to get how many recurring values there were of each NGram.
I took for granted that:
NGram has the expected correct implementation of equals + hashcode
document.sentences returns a Seq[...]. If not you should expect allNGrams to be of the corresponding collection type.
UPDATED based on the comments
I wrongly assumed that the groupBy(_) would shortcut the input value. Use the identity function instead.
I converted the count to a Double
Appreciate the help - I have the correct code now using the suggestions above. The following returns the desired result:
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = (for(sentence <- document.sentences;
ngram <- ngramsInSentence(sentence,n))
yield ngram)
allNGrams.groupBy(l => l).map(t => (t._1, t._2.length.toDouble))
}