grouping: being part of a department twice

grouping: being part of a department twice - pyspark

I have employee data and want to find out, if some persons worked twice or more times in a department or not.
The data looks as follows:
emp_id begin end dept_id
1. 07a0fcf5 30.06.2021 30.06.2021 1443
2. 07a0fcf5 01.07.2021 01.07.2021 1443
3. 07a0fcf5 02.07.2021 02.07.2021 1269
4. 07a0fcf5 03.07.2021 11.07.2021 1269
5. 07a0fcf5 12.07.2021 14.07.2021 1269
6. 07a0fcf5 15.07.2021 15.07.2021 1273
7. 07a0fcf5 16.07.2021 30.08.2021 1273
8. 07a0fcf5 31.08.2021 05.10.2021 1273
9. 07a0fcf5 06.10.2021 21.02.2022 1269
10. 07a0fcf5 24.02.2022 23.06.2022 1269
11. 07a0fcf5 24.06.2022 01.01.9999 1269
12. 07d06bee 28.06.2021 29.06.2021 1273
13. 07d06bee 30.06.2021 30.06.2021 1287
14. 07d06bee 01.07.2021 26.07.2021 1443
15. 07d06bee 27.07.2021 27.07.2021 1287
16. 07d06bee 28.07.2021 01.08.2021 1443
17. 07d06bee 02.08.2021 01.01.9999 1287
18. 07d1fdd3 25.05.2021 25.05.2021 1256
19. 07d1fdd3 26.05.2021 26.05.2021 1256
20. 07d1fdd3 27.05.2021 27.05.2021 1256
21. 07d1fdd3 28.05.2021 06.06.2021 1256
22. 07d1fdd3 07.06.2021 18.06.2021 1256
23. 07d1fdd3 19.06.2021 20.06.2021 1256
24. 07d1fdd3 21.06.2021 21.06.2021 1256
25. 07d1fdd3 22.06.2021 06.07.2021 1256
26. 07d1fdd3 07.07.2021 13.07.2021 1098
27. 07d1fdd3 14.07.2021 16.08.2021 1098
28. 07d1fdd3 17.08.2021 25.08.2021 1098
29. 07d1fdd3 26.08.2021 26.08.2021 1098
30. 07d1fdd3 27.08.2021 06.09.2021 1098
Thus, the desired result is something like this:
emp_id dept_id times_worked
1. 07a0fcf5 1443 1
2. 07a0fcf5 1269 2
3. 07a0fcf5 1273 1
4. 07d06bee 1273 1
5. 07d06bee 1287 3
6. 07d06bee 1443 2
7. 07d1fdd3 1256 1
8. 07d1fdd4 1098 1
Of course, the aggregation itself is not the problem, but I want to find out how to handle the several appearances of dept_ids in the raw data. As I have to use pySpark in a cloud environment with several executors, sorting makes it impossible to check the content of the different data lines as the data is spread over the executors. I already tried the window function, but with no useful result.

I think I know what you are trying to do. The count only increases if they actually change department. I think you can do this with a window, check if the dept_id changes, and then sum the changes?
I think this will be easier with Fugue though which lets you port Python and Pandas code to Spark or Dask. Let's concern ourselves with one employee. The logic will look like this in Pandas:
import pandas as pd
df = pd.DataFrame({"emp_id": ["07d06bee"] * 6,
"begin": ["28.06.2021", "30.06.2021", "01.07.2021", "27.07.2021", "28.07.2021", "02.08.2021"],
"end": ["29.06.2021", "30.06.2021", "26.07.2021", "27.07.2021", "01.08.2021", "01.01.9999"],
"dept_id": [1273, 1287, 1443, 1287, 1443, 1287]})
def one_emp_count(df: pd.DataFrame) -> pd.DataFrame:
# replace 01.01.9999 with end of year because it's not a valid datetime
df["end"] = df["end"].replace("01.01.9999", "31.12.2022")
# convert to datetime
df['begin'] = pd.to_datetime(df['begin'], format="%d.%m.%Y")
df['end'] = pd.to_datetime(df['end'], format="%d.%m.%Y")
# sort partition to guarantee order
df = df.sort_values("end", inplace=False)
# calculate change
df["last_dept"] = df["dept_id"].shift(1)
df["change_dept"] = (df["dept_id"] != df["last_dept"]).astype(int)
# construct result
res = df.groupby("dept_id").agg({"change_dept": sum}).reset_index()
res["emp_id"] = df.iloc[0]["emp_id"]
return res[["emp_id", "dept_id", "change_dept"]]
This gives me:
emp_id dept_id change_dept
0 07d06bee 1273 1
1 07d06bee 1287 3
2 07d06bee 1443 2
We can then bring it to Spark with Fugue using the transform() function
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
from fugue import transform
res = transform(sdf,
one_emp_count,
schema="emp_id:str, dept_id:int, change_dept:int",
partition={"by": "emp_id"},
engine=spark)
res.show()
This will apply the one_emp_count function on each partition where each partition is one employee id. If the engine is None, it will run on Pandas. By passing the SparkSession to the engine, it will run on Spark.

Related

How to solve this matrix equality in Octave/Matlab?

I'm studying my Biomechanics course and I need to reproduce results given in the book "Biomechanics and motor control in human movement by David A. Winter" Chapter 7, pages 186-187 numerically using Octave or Matlab.
I'm unable to solve the matrix equation (please see screenshots):
From what I know about matrices is that they are orthogonal (resulting from Kardan rotation sequence x-y'-z'').
In the aforementioned matrices, terms named s1, s2, s3 are the sines of theta1, theta2, and theta3, respectively and c1, c2, and c3 are the cosines of theta1, theta2, and theta3, respectively.
Below is the code I've tried in the Octave-online using symbolic toolbox and solve function.
GA = [0.5974 -0.7873 0.1515;
0.800 0.5969 -0.00550;
-0.0472 0.1544 0.9868]
% syms t1 t2 t3 real
% GAS = [(cosd(t2)*cosd(t3)) (sind(t3)*cosd(t1)+sind(t1)*sind(t2)*cosd(t3)) (sind(t1)*sind(t3)-cosd(t1)*sind(t2)*cosd(t3)); (-cosd(t2)*sind(t3)) (cosd(t1)*cosd(t3)-sind(t1)*sind(t2)*sind(t3)) (sind(t1)*cosd(t3)+cosd(t1)*sind(t2)*sind(t3)); (sind(t2)) (-sind(t1)*cosd(t2)) (cosd(t1)*cosd(t2))]
% [t1, t2, t3] = solve(GAS == GA, t1, t2, t3)
syms c1 c2 c3 s1 s2 s3 real
GAS = [c2*c3 s3*c1+s1*s2*c3 s1*s3-c1*s2*c3; -c2*s3 c1*c3-s1*s2*s3 s1*c3+c1*s2*s3; s2 -s1*c2 c1*c2]
S = solve(GAS == GA, c1, c2, c3, s1, s2, s3)
The result from Octave-Online (using both versions of code results in an empty structure).
GA =
0.5974000 -0.7873000 0.1515000
0.8000000 0.5969000 -0.0055000
-0.0472000 0.1544000 0.9868000
Symbolic pkg v2.8.0: Python communication link active, SymPy v1.5.
GAS = (sym 3×3 matrix)
⎡c₂⋅c₃ c₁⋅s₃ + c₃⋅s₁⋅s₂ -c₁⋅c₃⋅s₂ + s₁⋅s₃⎤
⎢ ⎥
⎢-c₂⋅s₃ c₁⋅c₃ - s₁⋅s₂⋅s₃ c₁⋅s₂⋅s₃ + c₃⋅s₁ ⎥
⎢ ⎥
⎣ s₂ -c₂⋅s₁ c₁⋅c₂ ⎦
warning: passing floating-point values to sym is dangerous, see "help sym"
warning: called from
double_to_sym_heuristic at line 50 column 7
sym at line 379 column 13
numeric_array_to_sym at line 36 column 16
sym at line 359 column 7
eq at line 91 column 5
warning: passing floating-point values to sym is dangerous, see "help sym"
warning: called from
double_to_sym_heuristic at line 50 column 7
sym at line 379 column 13
numeric_array_to_sym at line 36 column 16
sym at line 359 column 7
eq at line 91 column 5
warning: passing floating-point values to sym is dangerous, see "help sym"
warning: called from
double_to_sym_heuristic at line 50 column 7
sym at line 379 column 13
numeric_array_to_sym at line 36 column 16
sym at line 359 column 7
eq at line 91 column 5
warning: passing floating-point values to sym is dangerous, see "help sym"
warning: called from
double_to_sym_heuristic at line 50 column 7
sym at line 379 column 13
numeric_array_to_sym at line 36 column 16
sym at line 359 column 7
eq at line 91 column 5
warning: passing floating-point values to sym is dangerous, see "help sym"
warning: called from
double_to_sym_heuristic at line 50 column 7
sym at line 379 column 13
numeric_array_to_sym at line 36 column 16
sym at line 359 column 7
eq at line 91 column 5
warning: passing floating-point values to sym is dangerous, see "help sym"
warning: called from
double_to_sym_heuristic at line 50 column 7
sym at line 379 column 13
numeric_array_to_sym at line 36 column 16
sym at line 359 column 7
eq at line 91 column 5
warning: passing floating-point values to sym is dangerous, see "help sym"
warning: called from
double_to_sym_heuristic at line 50 column 7
sym at line 379 column 13
numeric_array_to_sym at line 36 column 16
sym at line 359 column 7
eq at line 91 column 5
warning: passing floating-point values to sym is dangerous, see "help sym"
warning: called from
double_to_sym_heuristic at line 50 column 7
sym at line 379 column 13
numeric_array_to_sym at line 36 column 16
sym at line 359 column 7
eq at line 91 column 5
warning: passing floating-point values to sym is dangerous, see "help sym"
warning: called from
double_to_sym_heuristic at line 50 column 7
sym at line 379 column 13
numeric_array_to_sym at line 36 column 16
sym at line 359 column 7
eq at line 91 column 5
S = {}(0x0)
Cheers,
SirDickens

Identify N maxes of a row, discarding the remaining bottom values from table rows

I've got stuck on the following problem and was hoping for some help. I've tried a few things and have used some information found on Stack Overflow (such as this/How to apply max function for each row in KDB? and this/Iterate over current row values in kdb query, and flipping then sliding windows as per my previous post), and flicked through my Q for Mortals again but for some reason I have hit a brick wall.
In my table, the first column is a date column, the rest are numbers. From this I'm trying to generate a table that only has the n maximum numbers of a row left, the rest set to zero or 0N (or, if you like, where the m bottom values have been discarded).
Example:
Starting table:
q)t:([] td:2001.01.01 2001.01.02 2001.01.03 2001.01.04 2001.01.05 2001.01.06;
AA:121.5 125.0 127.0 126.0 129.2 130.0; BB:111.0 115.3 117.0 116.0 119.2
120.0; CC:120.0 126.0 125.5 128.8 135.0 130.0; DD:120.1 123.3 128.4 128.3
127.5 126.0; NN:122.0 125.5 126.0 116.0 109.0 100.5)
td AA BB CC DD NN
----------------------------------------
2001.01.01 121.5 111 120 120.1 122
2001.01.02 125 115.3 126 123.3 125.5
2001.01.03 127 117 125.5 128.4 126
2001.01.04 126 116 128.8 128.3 116
2001.01.05 129.2 119.2 135 127.5 109
2001.01.06 130 120 130 126 100.5
The desired end result when identifying the 2 maximums per row and blanking the rest (with either 0 or 0n):
td AA BB CC DD NN
-------------------------------------
2001.01.01 121.5 122
2001.01.02 126 125.5
2001.01.03 127 128.4
2001.01.04 128.8 128.3
2001.01.05 129.2 135
2001.01.06 130 130
To take row 1 as an example, the top 2 values in AA and NN of that row have been left whilst the two others in BB and CC have been blanked out.
To get only the max value, i.e. the one top value, I can do the below and use the newly added column in a followup update statement. However, the problem here is that I need to find the n maxes and discard the rest.
q)update maxes:max(AA;BB;CC;DD;NN) from t
Not sure if it's of any interest, but as an example of what I have tried: If I use a tip from another stack overflow post and I execute that on the values themselves I can sort of get there, but not in a table format:
q)nthMax:{x (idesc x)[y-1]}
{x (idesc x)[y-1]}
q)nthMax[(121.5 111 120 120.1 122);1]
122f
q)nthMax[(121.5 111 120 120.1 122);2]
121.5
However when I try to use this as part of an update or select then it's not working; also it strikes me as a non-q approach so interested in what folks have to say about solving the above.
Another example was I tried flipping the table then use MMAX, however as the dates are at the top they "survive". Also, this seems a bit clunky maybe as I have to do this for n times per column if I'm interested in n maxes, or drop x numbers that form the bottom values, leaving n max numbers.
Kind regards,
Sven

If you don't need the columns to stay in the same order, the following will get you the correct result:
key[kt]!(uj/) value {enlist (2#idesc x)#x}each kt:1!t
Results in:
td | NN AA CC DD
----------| -----------------------
2001.01.01| 122 121.5
2001.01.02| 125.5 126
2001.01.03| 127 128.4
2001.01.04| 128.8 128.3
2001.01.05| 129.2 135
2001.01.06| 130 130
You could fix the order afterwards with "xcols" if it's important to you, or do this (which is a little longer but preserves columns which are never in the top n)
q)key[kt]!(uj/) value {enlist (key[x]!count[x]#0n),(2#idesc x)#x}each kt:1!t
td | AA BB CC DD NN
----------| --------------------------
2001.01.01| 121.5 122
2001.01.02| 126 125.5
2001.01.03| 127 128.4
2001.01.04| 128.8 128.3
2001.01.05| 129.2 135
2001.01.06| 130 130

Here's another option, possibly a little tidier:
q)0!{key[x]#(2#idesc x)#x}'[1!t]
td AA BB CC DD NN
-------------------------------------
2001.01.01 121.5 122
2001.01.02 126 125.5
2001.01.03 127 128.4
2001.01.04 128.8 128.3
2001.01.05 129.2 135
2001.01.06 130 130
This works on the assumption that the first column is the only one you don't want to consider for the maximums. It's similar to the other two answers in it's use of idesc. One part to note here is key[x]# which essentially adds null entries to a dictionary to ensure all keys are present. As an example of this:
q)`a`b`c#`a`c!1 2
a| 1
b|
c| 2
Note how b is in the resultant dictionary but not in the original dictionary. This is used to make sure the dictionary generated for each line conforms with the others, thus resulting in a table (which after all, is just a list of conforming dictionaries).

Here is an ugly bit of code that should work for your example:
{x,'flip y!flip{?[x>idesc y;y;0N]}[z]each flip x y}[t;`AA`BB`CC`DD`NN;2]
td AA BB CC DD NN
-------------------------------------
2001.01.01 121.5 122
2001.01.02 126 125.5
2001.01.03 127 128.4
2001.01.04 128.8 128.3
2001.01.05 129.2 135
2001.01.06 130 130
The function allows you to specify which columns should be included and how values in each row.

Modifying perl script to not print duplicates and extract sequences of a certain length

I want to first apologize for the biological nature of this post. I thought I should post some background first. I have a set of gene files that contain anywhere from one to five DNA sequences from different species. I used a bash shell script to perform blastn with each gene file as a query and a file of all transcriptome sequences (all_transcriptome_seq.fasta) from the five species as the subject. I now want to process these output files (and there are many) so that I can get all subject sequences that hit into one file per gene, with duplicate sequences removed (except to keep one), and ensure I'm getting the length of the sequences that actually hit the query.
Here is what the blastn output looks like for one gene file (columns: qseqid qlen sseqid slen qframe qstart qend sframe sstart send evalue bitscore pident nident length)
Acur_01000750.1_OFAS014956-RA-EXON04 248 Apil_comp17195_c0_seq1 1184 1 1 248 1 824 1072 2e-73 259 85.60 214 250
Acur_01000750.1_OFAS014956-RA-EXON04 248 Atri_comp5613_c0_seq1 1067 1 2 248 1 344 96 8e-97 337 91.16 227 249
Acur_01000750.1_OFAS014956-RA-EXON04 248 Acur_01000750.1 992 1 1 248 1 655 902 1e-133 459 100.00 248 248
Acur_01000750.1_OFAS014956-RA-EXON04 248 Btri_comp17734_c0_seq1 1001 1 1 248 1 656 905 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Atri_comp5613_c0_seq1 1067 1 2 250 1 344 96 1e-60 217 82.33 205 249
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Acur_01000750.1 992 1 1 250 1 655 902 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Btri_comp17734_c0_seq1 1001 1 1 250 1 656 905 1e-134 462 100.00 250 250
I've been working on a perl script that would, in short, take the sseqid column to pull out the corresponding sequences from the all_transcriptome_seq.fasta file, place these into a new file, and trim the transcripts to the sstart and send positions. Here is the script, so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
############################################################################
# blastn_post-processing.pl v. 1.0 by Michael F., XXXXXX
############################################################################
my($progname) = $0;
############################################################################
# Initialize variables
############################################################################
my($jter);
my($com);
my($t1);
if ( #ARGV != 2 ) {
print "Usage:\n \$ $progname <infile> <transcriptomes>\n";
print " infile = tab-delimited blastn text file\n";
print " transcriptomes = fasta file of all transcriptomes\n";
print "exiting...\n";
exit;
}
my($infile)=$ARGV[0];
my($transcriptomes)=$ARGV[1];
############################################################################
# Read the input file
############################################################################
print "Reading the input file... ";
open (my $INF, $infile) or die "Unable to open file";
my #data = <$INF>;
print #data;
close($INF) or die "Could not close file $infile.\n";
my($nlines) = $#data + 1;
my($inlines) = $nlines - 1;
print "$nlines blastn hits read\n\n";
############################################################################
# Extract hits and place sequences into new file
############################################################################
my #temparray;
my #templine;
my($seqfname);
open ($INF, $infile) or die "Could not open file $infile for input.\n";
#temparray = <$INF>;
close($INF) or die "Could not close file $infile.\n";
$t1 = $#temparray + 1;
print "$infile\t$t1\n";
$seqfname = "$infile" . ".fasta";
if ( -e $seqfname ) {
print " --> $seqfname exists. overwriting\n";
unlink($seqfname);
}
# iterate through the individual hits
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
} # end for ($jter=0; $jter<$t1...
# Arguments for "extract_from_genome2"
# // argv[1] = name of genome file
# // argv[2] = gi number for contig
# // argv[3] = start of subsequence
# // argv[4] = end of subsequence
# // argv[5] = name of output sequence
Using this script, here is the output I'm getting:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
As you can see, it's pretty close to what I'm wanting. Here are the two issues I have and cannot seem to figure out how to resolve with my script. The first is that a sequence may occur more than once in the sseqid column, and with the script in its current form, it will print out duplicates of these sequences. I only need one. How can I modify my script to not duplicate sequences (i.e., how do I only retain one but remove the other duplicates)? Expected output:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
The second is the script is not quite extracting the right base pairs. It's super close, off by one or two, but its not exact.
For example, take the first subject hit Apil_comp17195_c0_seq1. The sstart and send values are 824 and 1072, respectively. When I go to the all_transcriptome_seq.fasta, I get
AAGATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAAC
at that base pair range, not
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
as outputted by my script, which is what I'm expecting. You will also notice that the sequence outputted by my script is slightly shorter than it should be. Does anyone know how I can fix these issues in my script?
Thanks, and sorry for the lengthy post!
Edit 1: a solution was offered that work for some of the infiles. However, some were causing the script to output fewer sequences than expected. Here is one such infile with 9 hits, from which I was expecting only 4 sequences.
Note: this issue has been largely resolved based on the solution provided below the answer section
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Apil_comp16418_c0_seq1 2079 1 1 1587 1 416 2002 0.0 2931 100.00 1587 1587
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Atri_comp13712_c0_seq1 1938 1 1 1587 1 1651 75 0.0 1221 80.73 1286 1593
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Ctom_01003023.1 2162 1 1 1406 1 1403 1 0.0 1430 85.07 1197 1407
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Apil_comp16418_c0_seq1 2079 1 1 1437 1 1866 430 0.0 1170 81.43 1175 1443
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Atri_comp13712_c0_seq1 1938 1 1 1441 1 201 1641 0.0 2662 100.00 1441 1441
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Acur_01000228.1 2415 1 1 1440 1 2231 797 0.0 1906 90.62 1305 1440
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Apil_comp16418_c0_seq1 2079 1 3 1284 1 1714 430 0.0 1351 85.69 1102 1286
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Acur_01000228.1 2415 1 1 1287 1 2084 797 0.0 1219 83.81 1082 1291
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Ctom_01003023.1 2162 1 1 1289 1 106 1394 0.0 2381 100.00 1289 1289
Edit 2: There is still an occasional output lacking fewer sequences than expected, although not as many after incorporating modifications to my script from Edit 1 suggestion (i.e., accounting for reverse direction). I cannot figure out why the script would be outputting fewer sequences in these other cases. Below the infile in question. The output is lacking Btri_comp15171_c0_seq1:
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Apil_comp19456_c0_seq1 3549 1 1 2464 1 761 3224 0.0 4551 100.00 2464 2464
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Btri_comp15171_c0_seq1 3766 1 1 2456 1 3046 591 0.0 1877 80.53 1985 2465
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Apil_comp19456_c0_seq1 3549 1 1 2457 1 3214 758 0.0 1879 80.54 1986 2466
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Atri_comp28646_c0_seq1 1403 1 1256 2454 1 1401 203 0.0 990 81.60 980 1201
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Btri_comp15171_c0_seq1 3766 1 1 2457 1 593 3049 0.0 4538 100.00 2457 2457

You can use hash to remove duplicates
The bellow code remove duplicates depending on their subject length (keep larger subject length rows).
Just update your # iterate through the individual hits part with
# iterate through the individual hits
my %filterhash;
my $subject_length;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(exists $filterhash{$templine[2]} ){
if($filterhash{$templine[2]} < $subject_length){
$filterhash{$templine[2]}= $subject_length;
}
}
else{
$filterhash{$templine[2]}= $subject_length;
}
}
my %printhash;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(not exists $printhash{$templine[2]})
{
$printhash{$templine[2]}=1;
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
else{
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
#print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
} # end for ($jter=0; $jter<$t1...
Hope this will help you.
Edit part update
for negative stand you need to replace
$subject_length = $templine[9] -$templine[8];
with
if($templine[8] > $templine[9]){
$subject_length = $templine[8] -$templine[9];
}else{
$subject_length = $templine[9] -$templine[8];
}
You also need to update your extract_from_genome2 code for negative strand sequences.

Pivot table in kdb+/q

I'm trying to pivot some trade data in KDB/q. Although my data are only slightly different from the working example on the website (see the general pivot function: http://code.kx.com/q/cookbook/pivoting-tables/),
I can't get the function to work, even after several hours of trying (I'm very new to KDB).
Put simply, I'm trying to go from this table:
q)5# trades_agg
date sym time exchange buysell| shares
--------------------------------------| ------
2009.01.05 aaca 09:30 BATS B | 484
2009.01.05 aaca 09:30 BATS S | 434
2009.01.05 aaca 09:30 NASDAQ B | 235
2009.01.05 aaca 09:30 NASDAQ S | 429
2009.01.05 aaca 09:30 NYSE B | 309
to this one:
date sym time | BATSsharesB BATSsharesS NASDAQsharesB ...
----------------------| -----------------------------------------------
2009.01.05 aaca 09:30 | 484 434 235 ...
... | ...
I'll provide a working example to illustrate things:
// Create data
qpd:5*2*4*"i"$16:00-09:30
date:raze(100*qpd)#'2009.01.05+til 5
sym:(raze/)5#enlist qpd#'100?`4
sym:(neg count sym)?sym
time:"t"$raze 500#enlist 09:30:00+15*til qpd
time+:(count time)?1000
exchange:raze 500#enlist raze(qpd div 3)#enlist`NYSE`NASDAQ`BATS
buysell:raze 500#enlist raze(qpd div 2)#enlist`B`S
shares:(500*qpd)?100
trades:([]date;sym;time;exchange;buysell;shares)
//I then aggregate the data into equal sized buckets
trades_agg: select sum shares by date, sym, time: 15 xbar time.minute, exchange, buysell from trades
// pivot function from the code.kx.com website
piv:{[t;k;p;v;f;g]
v:(),v;
G:group flip k!(t:.Q.v t)k;
F:group flip p!t p;
count[k]!g[k;P;C]xcols 0!key[G]!flip(C:f[v]P:flip value flip key F)!raze
{[i;j;k;x;y]
a:count[x]#x 0N;
a[y]:x y;
b:count[x]#0b;
b[y]:1b;
c:a i;
c[k]:first'[a[j]#'where'[b j]];
c}[I[;0];I J;J:where 1<>count'[I:value G]]/:\:[t v;value F]}
I subsequently apply this pivot function to the example with the functions f and g set to their default (::) values but I get an error message:
piv[`trades_agg;`date`sym`time;`exchange`buysell;`shares;(::);(::)]
Even when I use the suggested f and g functions it doesn't work:
f:{[v;P]`$raze each string raze P[;0],'/:v,/:\:P[;1]}
g:{[k;P;c]k,(raze/)flip flip each 5 cut'10 cut raze reverse 10 cut asc c}
I don't get why this is not working correctly since it is so close to the example on the website.

This is a self-contained version that's easier to use:
tt:1000#0!trades_agg
piv:{[t;k;p;v]
/ controls new columns names
f:{[v;P]`${raze " " sv x} each string raze P[;0],'/:v,/:\:P[;1]};
v:(),v; k:(),k; p:(),p; / make sure args are lists
G:group flip k!(t:.Q.v t)k;
F:group flip p!t p;
key[G]!flip(C:f[v]P:flip value flip key F)!raze
{[i;j;k;x;y]
a:count[x]#x 0N;
a[y]:x y;
b:count[x]#0b;
b[y]:1b;
c:a i;
c[k]:first'[a[j]#'where'[b j]];
c}[I[;0];I J;J:where 1<>count'[I:value G]]/:\:[t v;value F]};
q)piv[`tt;`date`sym`time;`exchange`buysell;enlist `shares]
date sym time | BATS shares B BATS shares S NASDAQ shares B NASDAQ sha..
---------------------| ------------------------------------------------------..
2009.01.05 adkk 09:30| 577 359 499 452 ..
2009.01.05 adkk 09:45| 882 501 339 467 ..
2009.01.05 adkk 10:00| 620 513 411 128 ..
2009.01.05 adkk 10:15| 501 544 272 544 ..
2009.01.05 adkk 10:30| 291 594 363 331 ..
2009.01.05 adkk 10:45| 867 500 498 536 ..
2009.01.05 adkk 11:00| 624 632 694 493 ..
2009.01.05 adkk 11:15| 99 704 600 299 ..
2009.01.05 adkk 11:30| 269 394 280 392 ..
2009.01.05 adkk 11:45| 635 744 758 597 ..
2009.01.05 adkk 12:00| 562 354 498 405 ..
2009.01.05 adkk 12:15| 416 437 303 492 ..
2009.01.05 adkk 12:30| 447 699 370 302 ..
2009.01.05 adkk 12:45| 336 647 512 245 ..
2009.01.05 adkk 13:00| 692 457 497 553 ..

Your table is keyed so unkey it:
trades_agg:0!select sum shares by date, sym, time: 15 xbar time.minute,exchange,buysell from trades
And define your g as:
g:{[k;P;c]k,c}
Best way to figure out what the f/g needs to be is to define it with a breakpoint and then investigate the variables
g:{[k;P;c]break}

I found it difficult to understand the original piv function in Ryan's answer, so I updated it by adding some comments + more readable variable names HTH
piv:{[table; rows; columns; vals]
/ make sure args are lists
vals: (),vals;
rows: (),rows;
columns: (),columns;
/ Get columns of table corresponding to those of row labels and calculate groups
/ group returns filteredValues dict whose keys are the unique row labels and vals are the row indices of each group e.g. (0 1 3; 2 4; ...)
rowGroups: group rows#table;
rowGroupIdxs: value rowGroups;
rowValues: key[rowGroups];
/ Similarly, get columns of table corresponding to those of column labels and calculate groups
colGroups: group columns#table;
colGroupIdxs: value colGroups;
colValues: key colGroups;
getPivotCol: {[rowGroupStartIdx; nonSingleRowGroups; nonSingleRowGroupsIdx; vals; colGroupIdxs]
/ vals: the list of values for this particular value-column combination
/ colGroupIdxs: the list of indices for this particular column group
/ We only care about vals that should belong in this pivot column - we need to filter out vals not part of this column group
filteredValues: count[vals]#vals[0N];
filteredValues[colGroupIdxs]: vals[colGroupIdxs];
/ Equivalent to filteredValues <> 0N
hasValue: count[vals]#0b;
hasValue[colGroupIdxs]: 1b;
/ Seed off pivot column with the first (filtered) value of each row group
/ This will be correct for row groups of size 1 as no aggregation needs to occur
pivotCol: filteredValues[rowGroupStartIdx];
/ Otherwise, for the row groups larger than 1, get the first (filtered) value
pivotCol[nonSingleRowGroupsIdx]: first'[filteredValues[nonSingleRowGroups]#'where'[hasValue[nonSingleRowGroups]]];
pivotCol
}
/ Groups with more than 1 row (these are the ones that will need aggregating)
nonSingleRowGroupsIdx: where 1 <> count'[rowGroupIdxs];
/ Get resulting pivot column for each combination of column and value fields
pivotCols: raze getPivotCol[rowGroupIdxs[;0]; rowGroupIdxs[nonSingleRowGroupsIdx]; nonSingleRowGroupsIdx] /:\: [table[vals]; colGroupIdxs]
/ Columns names are the cross-product of column and value fields
colNames:`${raze "" sv vals} each string raze (flip value flip colValues),'/:vals;
/ Finally, stitch together row and column headings with pivot data to obtain final table
rowValues!flip colNames!pivotCols
};
I also made a small change to formatting of columns names for my needs btw

LMC to find prime numbers by user input

00: 599
01: 298
02: 738
03: 598
04: 297
05: 395
06: 730
07: 825
08: 597
09: 295
10: 717
11: 597
12: 196
13: 397
14: 592
15: 393
16: 600
17: 598
18: 902
19: 598
20: 196
21: 398
22: 594
23: 397
24: 600
25: 593
26: 196
27: 393
28: 595
29: 604
30: 593
31: 717
32: 598
33: 196
34: 398
35: 594
36: 397
37: 600
38: 000
91: 005
92: 000 // DAT 000
93: 000 // Counter
94: 002 // DAT 002
96: 001 // DAT 001 - plus 1
97: 002 // DAT 002 - dividor
98: 002 // DAT 001 - incrementor
99: 050 // DAT 10 - max
Hi guys,
I have a code to find the prime numbers between 1-100, but I'm struggling to recreate this into a program that finds only those between user input.
I had a plan to subtract one number from another, and then to divide that number by 2, 3, 4 and 5.
Do you guys have any advice how to go about this? I apologize for the lack of comments.

Disclaimer: I don't know what your original code does, as I don't read numeric codes that well. The following is from primes.lmc, which I wrote myself.
Code (heavily commented):
# Prime number finder. Prints all prime numbers between the numbers the user inputs (min, then max).
# Min
INP
SUB ONE
STA NUM
# Max
INP
STA MAX
# Main checking loop. Check each number from NUM to MAX.
TLOOP LDA NUM
# Have we done all MAX numbers?
SUB MAX
BRZ HALT
# Increment to next number to check.
LDA NUM
ADD ONE
STA NUM
# Reset divisor.
LDA ONE
STA DIV
# Check NUM for primeness by dividing all numbers from 2 to NUM - 1 into it.
DLOOP LDA DIV
# Increment to next divisor.
ADD ONE
STA DIV
# Have we checked up to the number itself?
LDA DIV
SUB NUM
BRZ PRIME
# Setup for divide function.
LDA NUM
# Modulus function: accumulator % DIV.
MODULUS SUB DIV
BRP MODULUS
ADD DIV
# Remainder is now in the accumulator. If its zero, NUM is not prime.
BRZ NPRIME
BRA DLOOP
# If its prime, print it.
PRIME LDA NUM
OUT
# Go back to the top.
NPRIME BRA TLOOP
# End of program.
HALT HLT
NUM DAT 1
DIV DAT 1
ONE DAT 1
MAX DAT
First user input is the minimum, second is the maximum (inclusive).
Running (on specter, from 13 to 23):

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

grouping: being part of a department twice - pyspark

Related

How to solve this matrix equality in Octave/Matlab?

Identify N maxes of a row, discarding the remaining bottom values from table rows

Modifying perl script to not print duplicates and extract sequences of a certain length

Pivot table in kdb+/q

LMC to find prime numbers by user input

Categories

Resources