performance reading a file in SCALA

performance reading a file in SCALA - scala

I am reading a file that contain 277272 lines with Int triples (s,p,o) like:
2,126088,113358
2,126101,126102
2,126106,126107
2,126111,126112
2,126128,126129
2,126136,126137
2,126141,126142
2,126146,126147
4,1,3
4,7,8
4,15,16
4,26,27
4,41,94825
4,63,15764
4,68,69
4,89,94836
4,90,91
4,93,94
4,105,94844
this triples are order by s and then by p. I write the following code that search a value by s, and once find the value I write a conditional to filter by p.
def SP_O (yi:Int, zi:Int):List[Vector[Int]]={
val file = new RandomAccessFile("patSPO.dat", "r")
var lengthfile = (file.length().asInstanceOf[Int])
var initpointer = lengthfile/2
while (band == true){
var p = bs.firstvalue(initpointer,"patSPO.dat") // p es la lista del valor del predicado con su respectivo puntero
var newpointer = p(0)
var subj = p(1)
//Comienzo el analisis para saber si el predicado es igual al del BGP
if (subj == yi){
// Hay que imprimir los valores de la presente linea
if (firsttime == true){
val fields = bs.getvalueSP_O(newpointer,zi)
if (fields.size != 0){
S_POlistbuffer += fields
//yld(Record(fields, schema))
}
firsttime = false
pointerad = newpointer
pointerat = newpointer
} // Ahora tengo que analizar si hay valores atras o adelante
while (boolad == true || boolat == true){
if (boolat == true) {
//Valores para atras
pointerat -= 6
if (pointerat <= 0) {
pointerat = 0
boolat = false
}
p = bs.firstvalue(pointerat, "patSPO.dat") // p ds la lista del valor del sujeto en pointerat y da su respectivo pointer
if (p(1) == yi) {
val fields = bs.getvalueSP_O(p(0),zi)
if (fields.size != 0){
S_POlistbuffer += fields
//yld(Record(fields, schema))
}
} else {
boolat = false
}
}
if (boolad == true){ //valores par adelante
p = bs.nextvalue(pointerad,"patSPO.dat")
if (p(1)==yi){
val fields = bs.getvalueSP_O(p(0),zi)
pointerad = p(0)
if (fields.size != 0){
S_POlistbuffer += fields
//yld(Record(fields, schema))
}
}else{
boolad = false
}
}
if (boolad == false && boolat == false){
band = false
}
}
}
else if ( subj > yi ){
initpointer = initpointer/2
}
else if ( subj < yi) {
initpointer = (initpointer*6)/4
}
}
val listf:List[Vector[Int]] = S_POlistbuffer.toList
listf
}
The general idea is that the code start looking the value from the half of the file, with the filerandomaccess, then I should extract the first value of the line, and compare with the value that I need, once I find the correct value I must analyze if the next line also has the correct value, in paralell I analyze the line before to the line that I choose in order to see if that line also is a correct value. For each line that match I analyze if the second value is the correct. If the line match I extract the o value, and I store it in a List.
The problem is when I am printing the result, this take so much time. However I developed this another solution that runs over the whole file:
while (in.hasNext) {
val s = in.next(',').toInt
val p = in.next(',').toInt
val o = in.next('\n').toInt
//val fields = schema map (n => in.next(if (n == schema.last) '\n' else ','))
if (p == yi && s == xi){
val fields = schema map (n => o)
yld(Record(fields, schema))
}
}
With this code I run over the whole file and I get the results faster than the first code. My big question is why if the first code in the best of the cases just run a portion of the file is to slow than the second code that run over the whole file? Is there another way to write this code with a better performance?
The time of the first code execution is like 750 seconds, the second code takes like 10 seconds.

There's two important pieces of information that I think you missed here :
There's no such thing as reading a file "in parallel". Even nowadays, a lot of disk reading is spinning the damn disk until it's in the correct position, and read. So, you can multi thread all you want, the more concurrent disk access you do, the worst read performance you get. A simplified view is that every time you go "backwards", you actually make your disk do a full spin forward to get to that previous line !
Reading from disk sequentially is multiple orders of magnitude faster than random disk access. This is correlated to 1, but really, reading a file line by line is the perfect way to access a disk ! And as it turns out, it's what your naive version do !
Which leads me to believe those are the logical next steps :
Apply your smart algorithm to the file having previously loaded it in memory
Compare again with the naive version
Find no significant difference
Conclude that your workflow is definitely I/O bound, and stop trying to optimize the part that doesn't matter in it !
Note that your mileage may vary if you have a SSD (I assumed not, since there's really no way I can think of a SSD would need 750 seconds to random access lines in it).

Related

Minimum cost solution to connect all elements in set A to at least one element in set B

I need to find the shortest set of paths to connect each element of Set A with at least one element of Set B. Repetitions in A OR B are allowed (but not both), and no element can be left unconnected. Something like this:
I'm representing the elements as integers, so the "cost" of a connection is just the absolute value of the difference. I also have a cost for crossing paths, so if Set A = [60, 64] and Set B = [63, 67], then (60 -> 67) incurs an additional cost. There can be any number of elements in either set.
I've calculated the table of transitions and costs (distances and crossings), but I can't find the algorithm to find the lowest-cost solution. I keep ending up with either too many connections (i.e., repetitions in both A and B) or greedy solutions that omit elements (e.g., when A and B are non-overlapping). I haven't been able to find examples of precisely this kind of problem online, so I hoped someone here might be able to help, or at least point me in the right direction. I'm not a graph theorist (obviously!), and I'm writing in Swift, so code examples in Swift (or pseudocode) would be much appreciated.
UPDATE: The solution offered by #Daniel is almost working, but it does occasionally add unnecessary duplicates. I think this may be something to do with the sorting of the priorityQueue -- the duplicates always involve identical elements with identical costs. My first thought was to add some kind of "positional encoding" (yes, Transformer-speak) to the costs, so that the costs are offset by their positions (though of course, this doesn't guarantee unique costs). I thought I'd post my Swift version here, in case anyone has any ideas:
public static func voiceLeading(from chA: [Int], to chB: [Int]) -> Set<[Int]> {
var result: Set<[Int]> = Set()
let im = intervalMatrix(chA, chB: chB)
if im.count == 0 { return [[0]] }
let vc = voiceCrossingCostsMatrix(chA, chB: chB, cost: 4)
// NOTE: cm contains the weights
let cm = VectorUtils.absoluteAddMatrix(im, toMatrix: vc)
var A_links: [Int:Int] = [:]
var B_links: [Int:Int] = [:]
var priorityQueue: [Entry] = []
for (i, a) in chA.enumerated() {
for (j, b) in chB.enumerated() {
priorityQueue.append(Entry(a: a, b: b, cost: cm[i][j]))
if A_links[a] != nil {
A_links[a]! += 1
} else {
A_links[a] = 1
}
if B_links[b] != nil {
B_links[b]! += 1
} else {
B_links[b] = 1
}
}
}
priorityQueue.sort { $0.cost > $1.cost }
while priorityQueue.count > 0 {
let entry = priorityQueue[0]
if A_links[entry.a]! > 1 && B_links[entry.b]! > 1 {
A_links[entry.a]! -= 1
B_links[entry.b]! -= 1
} else {
result.insert([entry.a, (entry.b - entry.a)])
}
priorityQueue.remove(at: 0)
}
return result
}
Of course, since the duplicates have identical scores, it shouldn't be a problem to just remove the extras, but it feels a bit hackish...
UPDATE 2: Slightly less hackish (but still a bit!); since the requirement is that my result should have equal cardinality to max(|A|, |B|), I can actually just stop adding entries to my result when I've reached the target cardinality. Seems okay...
UPDATE 3: Resurrecting this old question, I've recently had some problems arise from the fact that the above algorithm doesn't fulfill my requirement |S| == max(|A|, |B|) (where S is the set of pairings). If anyone knows of a simple way of ensuring this it would be much appreciated. (I'll obviously be poking away at possible changes.)

This is an easy task:
Add all edges of the graph in a priority_queue, where the biggest priority is the edge with the biggest weight.
Look each edge e = (u, v, w) in the priority_queue, where u is in A, v is in B and w is the weight.
If removing e from the graph doesn't leave u or v isolated, remove it.
Otherwise, e is part of the answer.
This should be enough for your case:
#include <bits/stdc++.h>
using namespace std;
struct edge {
int u, v, w;
edge(){}
edge(int up, int vp, int wp){u = up; v = vp; w = wp;}
void print(){ cout<<"("<<u<<", "<<v<<")"<<endl; }
bool operator<(const edge& rhs) const {return w < rhs.w;}
};
vector<edge> E; //edge set
priority_queue<edge> pq;
vector<edge> ans;
int grade[5] = {3, 3, 2, 2, 2};
int main(){
E.push_back(edge(0, 2, 1)); E.push_back(edge(0, 3, 1)); E.push_back(edge(0, 4, 4));
E.push_back(edge(1, 2, 5)); E.push_back(edge(1, 3, 2)); E.push_back(edge(1, 4, 0));
for(int i = 0; i < E.size(); i++) pq.push(E[i]);
while(!pq.empty()){
edge e = pq.top();
if(grade[e.u] > 1 && grade[e.v] > 1){
grade[e.u]--; grade[e.v]--;
}
else ans.push_back(e);
pq.pop();
}
for(int i = 0; i < ans.size(); i++) ans[i].print();
return 0;
}
Complexity: O(E lg(E)).

I think this problem is "minimum weighted bipartite matching" (although searching for " maximum weighted bipartite matching" would also be relevant, it's just the opposite)

Functional version of a typical nested while loop

I hope this question may please functional programming lovers. Could I ask for a way to translate the following fragment of code to a pure functional implementation in Scala with good balance between readability and execution speed?
Description: for each elements in a sequence, produce a sub-sequence contains the elements that comes after the current elements (including itself) with a distance smaller than a given threshold. Once the threshold is crossed, it is not necessary to process the remaining elements
def getGroupsOfElements(input : Seq[Element]) : Seq[Seq[Element]] = {
val maxDistance = 10 // put any number you may
var outerSequence = Seq.empty[Seq[Element]]
for (index <- 0 until input.length) {
var anotherIndex = index + 1
var distance = input(index) - input(anotherIndex) // let assume a separate function for computing the distance
var innerSequence = Seq(input(index))
while (distance < maxDistance && anotherIndex < (input.length - 1)) {
innerSequence = innerSequence ++ Seq(input(anotherIndex))
anotherIndex = anotherIndex + 1
distance = input(index) - input(anotherIndex)
}
outerSequence = outerSequence ++ Seq(innerSequence)
}
outerSequence
}

You know, this would be a ton easier if you added a description of what you're trying to accomplish along with the code.
Anyway, here's something that might get close to what you want.
def getGroupsOfElements(input: Seq[Element]): Seq[Seq[Element]] =
input.tails.map(x => x.takeWhile(y => distance(x.head,y) < maxDistance)).toSeq

Difficulty getting readLine() to work as desired on HackerRank

I'm attempting to submit the HackerRank Day 6 Challenge for 30 Days of Code.
I'm able to complete the task without issue in an Xcode Playground, however HackerRank's site says there is no output from my method. I encountered an issue yesterday due to browser flakiness, but cleaning caches, switching from Safari to Chrome, etc. don't seem to resolve the issue I'm encountering here. I think my problem lies in inputString.
Task
Given a string, S, of length N that is indexed from 0 to N-1, print its even-indexed and odd-indexed characters as 2 space-separated strings on a single line (see the Sample below for more detail).
Input Format
The first line contains an integer, (the number of test cases).
Each line of the subsequent lines contain a String, .
Constraints
1 <= T <= 10
2 <= length of S < 10,000
Output Format
For each String (where 0 <= j <= T-1), print S's even-indexed characters, followed by a space, followed by S's odd-indexed characters.
This is the code I'm submitting:
import Foundation
let inputString = readLine()!
func tweakString(string: String) {
// split string into an array of lines based on char set
var lineArray = string.components(separatedBy: .newlines)
// extract the number of test cases
let testCases = Int(lineArray[0])
// remove the first line containing the number of unit tests
lineArray.remove(at: 0)
/*
Satisfy constraints specified in the task
*/
guard lineArray.count >= 1 && lineArray.count <= 10 && testCases == lineArray.count else { return }
for line in lineArray {
switch line.characters.count {
// to match constraint specified in the task
case 2...10000:
let characterArray = Array(line.characters)
let evenCharacters = characterArray.enumerated().filter({$0.0 % 2 == 0}).map({$0.1})
let oddCharacters = characterArray.enumerated().filter({$0.0 % 2 == 1}).map({$0.1})
print(String(evenCharacters) + " " + String(oddCharacters))
default:
break
}
}
}
tweakString(string: inputString)
I think my issue lies the inputString. I'm taking it "as-is" and formatting it within my method. I've found solutions for Day 6, but I can't seem to find any current ones in Swift.
Thank you for reading. I welcome thoughts on how to get this thing to pass.

readLine() reads a single line from standard input, which
means that your inputString contains only the first line from
the input data. You have to call readLine() in a loop to get
the remaining input data.
So your program could look like this:
func tweakString(string: String) -> String {
// For a single input string, compute the output string according to the challenge rules ...
return result
}
let N = Int(readLine()!)! // Number of test cases
// For each test case:
for _ in 1...N {
let input = readLine()!
let output = tweakString(string: input)
print(output)
}
(The forced unwraps are acceptable here because the format of
the input data is documented in the challenge description.)

Hi Adrian you should call readLine()! every row . Here an example answer for that challenge;
import Foundation
func letsReview(str:String){
var evenCharacters = ""
var oddCharacters = ""
var index = 0
for char in str.characters{
if index % 2 == 0 {
evenCharacters += String(char)
}
else{
oddCharacters += String(char)
}
index += 1
}
print (evenCharacters + " " + oddCharacters)
}
let rowCount = Int(readLine()!)!
for _ in 0..<rowCount {
letsReview(str:String(readLine()!)!)
}

Promela system with unranged values

int rq_begin = 0, rq_end = 0;
int av_begin = 0, av_end = 0;
#define MAX_DUR 10
#define RQ_DUR 5
proctype Writer() {
do
:: (av_end < rq_end) -> av_end++;
if
:: (av_end - av_begin) > MAX_DUR -> av_begin = av_end - MAX_DUR;
:: else -> skip;
fi
printf("available span: [%d,%d]\n", av_begin, av_end);
od
}
proctype Reader() {
do
:: d_step {
rq_begin++;
rq_end = rq_begin + RQ_DUR;
}
printf("requested span: [%d,%d]\n", rq_begin, rq_end);
(rq_begin >= av_begin && rq_end <= av_end);
printf("got requested span\n");
od
}
init {
run Writer();
run Reader();
}
This system (only an example) should model a reader/writer queue where the reader requests a certain span of frames [rq_begin,rq_end], and the writer should then make at least this span available. [av_begin,av_end] is the span of available frames.
The 4 values are absolute frame indices, rq_begin gets incremented infinitley as the reader reads the next span of frames.
The system cannot be directly verified because the values are unranges (generating infinitely many states). Does Promela/Spin (or a similar software) has support to verify a system like this, and automatically transform it such that it becomes finite?
For example if all the 4 values were incremented by the same amount, the situation would still be the same. Or it could be reformulated into a model which instead has variables for the differences of these values, for example av_end - rq_end.
I'm using Promela/Spin to verify a more complex queuing system which uses absolute frame indices like this.

Using big unicode signs in java

I'm writing a programm for school which should compress text. So at first I want to build a kind of dictionary from a huge number of texts for compressing later.
My idea was that if i have 2 signs, I want to replace it with only 1. So at first i am building a treemap with all the pairs I have in my String.
So for example: String s = "Hello";
He -> 1
el -> 1
ll -> 1
lo -> 1
at the end my Treemap values are different high, and at a given point i want to write a rule in my dictionary. For example:
He -> x
el -> y
lo -> z
So here is the point. I want to start with the "new signs" at the unicode number 65536 and want to increase it for every rule by 1.
When i want to reanalyze my text to pairs i think i got a error but i am not sure about this..
TreeMap<String, Integer> map = new TreeMap<String, Integer>();
char[] text = s.toCharArray();
String signPair = "";
// search sign in map
for (int i = 0; i < s.length()-1; i++) {
// 1.Zeichen prüfen ob >65535 ->2chars
if (Character.codePointAt(text, i) > 65535) {
// 2.sign checking >65535 ->2chars
if (Character.codePointAt(text, i + 2) > 65535) {
signPair = s.substring(i, i + 4);
// compensate additional chars
i += 2;
// if not there
if (!map.containsKey(signPair)) {
// Key anlegen, Value auf 1 setzen
map.put(signPair, 1);
} else {
// Key vorhanden -> Value um 1 erhöhen
int value = map.get(signPair);
value++;
map.put(signPair, value);
}
At the end when i want to print my map in the console i only got � signs with a second one.. or later i also have a lot of 𐃰-typ signs which i cant interpret. In my output text there are mostly signs between 5000 and 60000. No one is higher than 65535...
Is it wrong to look at the chars and substring like them or is it a mistake to get the codepoint at them?
Thanks for help!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

performance reading a file in SCALA - scala

Related

Minimum cost solution to connect all elements in set A to at least one element in set B

Functional version of a typical nested while loop

Difficulty getting readLine() to work as desired on HackerRank

Promela system with unranged values

Using big unicode signs in java

Categories

Resources