flatMapValues (whole_list,an element list) - pyspark

i have an rdd with keys to be integers. For each key i have a list of strings. Example: [(0, ['transworld', 'systems', 'inc', 'trying', 'collect', 'debt', 'mine', 'owed', 'inaccurate'])]
What i want is to get a new RDD like this:
[(0, ['transworld', 'systems', 'inc', 'trying', 'collect', 'debt', 'mine', 'owed', 'inaccurate'],'transworld')]
[(0, ['transworld', 'systems', 'inc', 'trying', 'collect', 'debt', 'mine', 'owed', 'inaccurate'],'systems')]
[(0, ['transworld', 'systems', 'inc', 'trying', 'collect', 'debt', 'mine', 'owed', 'inaccurate'],'inc')] etc
I think tha i need flatMapValues but can't find the way to use it. Anybody help?

Perhaps this is useful -
Not sure about the usecase 2. Written in scala
val rdd = spark.sparkContext.parallelize(Seq((0, Seq("transworld", "systems", "inc", "trying", "collect", "debt",
"mine",
"owed", "inaccurate"))))
rdd.flatMap{case (i, seq) => Seq.fill(seq.length)((i, seq)).zip(seq).map(x => (x._1._1, x._1._2, x._2))}
.foreach(println)
/**
* (0,List(transworld, systems, inc, trying, collect, debt, mine, owed, inaccurate),transworld)
* (0,List(transworld, systems, inc, trying, collect, debt, mine, owed, inaccurate),systems)
* (0,List(transworld, systems, inc, trying, collect, debt, mine, owed, inaccurate),inc)
* (0,List(transworld, systems, inc, trying, collect, debt, mine, owed, inaccurate),trying)
* (0,List(transworld, systems, inc, trying, collect, debt, mine, owed, inaccurate),collect)
* (0,List(transworld, systems, inc, trying, collect, debt, mine, owed, inaccurate),debt)
* (0,List(transworld, systems, inc, trying, collect, debt, mine, owed, inaccurate),mine)
* (0,List(transworld, systems, inc, trying, collect, debt, mine, owed, inaccurate),owed)
* (0,List(transworld, systems, inc, trying, collect, debt, mine, owed, inaccurate),inaccurate)
*/

Related

Locating Scala Array deep documentation?

SO answer by Jerry includes this use of deep:
println(k.deep)
Works as described:
scala> println(Array(10, 20, 30, 40).deep)
Array(10, 20, 30, 40)
I am looking for documentation on deep for an Array. I go to Scala Standard Library 2.13.0 Array and do a search of the page for deepand get no matches.
How is this the incorrect sequence?
It seems it has been removed from Scala 2.13 according to https://github.com/scala/bug/issues/10985:
It's a hacky ugly testing utility to print values in (nested) arrays.
If you feel strongly about it, we can add it deprecated.
You can still find it in 2.12 docs and in 2.12 branch:
/** Creates a possible nested `IndexedSeq` which consists of all the elements
* of this array. If the elements are arrays themselves, the `deep` transformation
* is applied recursively to them. The `stringPrefix` of the `IndexedSeq` is
* "Array", hence the `IndexedSeq` prints like an array with all its
* elements shown, and the same recursively for any subarrays.
*
* Example:
* {{{
* Array(Array(1, 2), Array(3, 4)).deep.toString
* }}}
* prints: `Array(Array(1, 2), Array(3, 4))`
*
* #return An possibly nested indexed sequence of consisting of all the elements of the array.
*/
def deep: scala.collection.IndexedSeq[Any]

How to join vectors with prometheus?

It's probably something obvious but I don't seem to find a solution for joining 2 vectors in prometheus.
sum(
rabbitmq_queue_messages{queue=~".*"}
) by (queue)
*
on (queue) group_left max(
label_replace(
kube_deployment_labels{label_daemon_name!=""},
"queue",
"$1",
"label_daemon_queue_name",
"(.*)"
)
) by (deployment, queue)
Below a picture of the output of the two separate vectors.
Group left has the many on the left, so you've got the factors to the * the wrong way around. Try it the other way.

PostgreSQL - How is the cost of Sort Node in the Query Plan calculated?

I have the following Query Plan in postgreSQL:
Unique (cost=487467.14..556160.88 rows=361546 width=1093)
-> Sort (cost=487467.14..488371.00 rows=361546 width=1093)
Sort Key: (..)
-> Append (cost=0.42..108072.53 rows=361546 width=1093)
-> Index Scan using (..) (cost=0.42..27448.06 rows=41395 width=1093)
Index Cond: (..)
Filter: (..)
-> Seq Scan on (..) (cost=0.00..77009.02 rows=320151 width=1093)
Filter: (..)
I just wonder how the exact calculation for the two values in sort are done? I understand how it works for the scans and append but I can't find anything regarding the Sort-cost calculation.
Something like for the SeqScan which is:
(disk pages read * seq_page_cost) + (rows scanned * cpu_tuple_cost)
The Query for the Plan was basicly something like this: (not exactly because it contained a view but you get the idea)
SELECT * FROM (
SELECT *, true AS storniert
FROM auftragsposition
WHERE mengestorniert > 0::numeric AND auftragbestaetigt = true
UNION
SELECT *, false AS storniert
FROM auftragsposition
WHERE mengestorniert < menge AND auftragbestaetigt = true
) as bla
It is implemented (and documented, as source code is often the only documentation) in src/backend/optimizer/path/costsize.c function cost_sort() and basic cost is like N*log(N) compare operations for in-memory sort (disk-based sort may be slower, and its costs are estimated too).
This N*log(N) is expected: https://en.wikipedia.org/wiki/Sorting_algorithm#Efficient_sorts "general sorting algorithms are almost always based on an algorithm with average time complexity ... O(n log n)"):
https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/optimizer/path/costsize.c#L1409
/*
* cost_sort
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*
* If the total volume exceeds sort_mem, we switch to a tape-style merge
* algorithm. There will still be about t*log2(t) tuple comparisons in
* total, but we will also need to write and read each tuple once per
* merge pass. We expect about ceil(logM(r)) merge passes where r is the
* number of initial runs formed and M is the merge order used by tuplesort.c.
* Since the average initial run should be about sort_mem, we have
* disk traffic = 2 * relsize * ceil(logM(p / sort_mem))
* cpu = comparison_cost * t * log2(t)
*
* If the sort is bounded (i.e., only the first k result tuples are needed)
* and k tuples can fit into sort_mem, we use a heap method that keeps only
* k tuples in the heap; this will require about t*log2(k) tuple comparisons.
*
* The disk traffic is assumed to be 3/4ths sequential and 1/4th random
* accesses (XXX can't we refine that guess?)
*
* By default, we charge two operator evals per tuple comparison, which should
* be in the right ballpark in most cases. The caller can tweak this by
* specifying nonzero comparison_cost; typically that's used for any extra
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
* 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
* 'sort_mem' is the number of kilobytes of work memory allowed for the sort
* 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
*
* NOTE: some callers currently pass NIL for pathkeys because they
* can't conveniently supply the sort keys. Since this routine doesn't
* currently do anything with pathkeys anyway, that doesn't matter...
* but if it ever does, it should react gracefully to lack of key data.
* (Actually, the thing we'd most likely be interested in is just the number
* of sort keys, which all callers *could* supply.)
*/
Parts of actual calculations - disk, heap sort, quicksort. No estimations on parallel sort now (https://wiki.postgresql.org/wiki/Parallel_Internal_Sort, https://wiki.postgresql.org/wiki/Parallel_External_Sort)?
...
path->rows = tuples;
/*
* We want to be sure the cost of a sort is never estimated as zero, even
* if passed-in tuple count is zero. Besides, mustn't do log(0)...
*/
if (tuples < 2.0)
tuples = 2.0;
/* Include the default cost-per-comparison */
comparison_cost += 2.0 * cpu_operator_cost;
..
if (output_bytes > sort_mem_bytes)
{
...
/*
* We'll have to use a disk-based sort of all the tuples
*/
/*
* CPU costs
*
* Assume about N log2 N comparisons
*/
startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
/* Compute logM(r) as log(r) / log(M) */
if (nruns > mergeorder)
log_runs = ceil(log(nruns) / log(mergeorder));
else
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
* a total number of tuple comparisons of N log2 K; but the constant
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
/* We'll use plain quicksort on all the input tuples */
startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
* doesn't do qual-checking or projection, so it has less overhead than
* most plan nodes. Note it's correct to use tuples not output_tuples
* here --- the upper LIMIT will pro-rate the run cost so we'd be double
* counting the LIMIT otherwise.
*/
run_cost += cpu_operator_cost * tuples;

How to get number of days between two XMLGregorianCalendar Scala

In Scala, is it possible to get number of days between XMLGregorianCalendar? I cannot find any methods in this class that gets the range of two dates. If not how are you guys doing it?
Ah I guess not I just went with this instead and it works fine:
val range:Long = (toDate.toGregorianCalendar.getTimeInMillis - fromDate.toGregorianCalendar.getTimeInMillis) / (1000 * 60 * 60 * 24)

Human readable size units (file sizes) for scala code (like Duration)

Are there any libraries that provide an object/class with implicits conversions (from Int, Long, Float) for human readable file size units (like Duration).
With Duration you can do this:
11.millis
1.5.minutes
10.hours
I wonder if there is some library that would allow me to do:
1.gibabyte
1024.megabytes
10.gibibytes
10.GB
50.GiB
I know I could implement this myself, but I'm trying to not reinvent the wheel.
Squants is a good solution, especially if you need more than just the human readable byte size conversion from the lib, but another possibility is to use this simple 4-line solution ported from an old SO java solution. You may not need the ZB and YB today, but maybe in the future ;)
/**
* #see https://stackoverflow.com/questions/3263892/format-file-size-as-mb-gb-etc
* #see https://en.wikipedia.org/wiki/Zettabyte
* #param fileSize Up to Exabytes
* #return
*/
def humanReadableByteSize(fileSize: Long): String = {
if(fileSize <= 0) return "0 B"
// kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
val units: Array[String] = Array("B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
val digitGroup: Int = (Math.log10(fileSize)/Math.log10(1024)).toInt
f"${fileSize/Math.pow(1024, digitGroup)}%3.3f ${units(digitGroup)}"
}
I've just stumbled upon squants. As stated in their own site:
Squants is a framework of data types and a domain specific language
(DSL) for representing Quantities, their Units of Measure, and their
Dimensional relationships. The API supports typesafe dimensional
analysis, improved domain models and more. All types are immutable and
thread-safe.
With squants you can do:
10.kib
10.kibibytes
50.mib
100.gib
Although i didn't like that the unit symbols are all lowercase (i.e. gib instead of GiB)