how to interchange columns and rows in a file using spark - scala

how to convert columns to rows and from rows to columns similar to transpose of matrix for the data present in the file.
for ex:-
input file:-
aa ab ac ad ae af ag
ba bb bc bd be bf bg
ca cb cc cd ce cf cg
output file:-
aa ba ca
ab bb cb
ac bc cc
ad bd cd
ae be ce
af bf cf
ag bg cg
Thank you :)

Similar to the question here:
How to transpose an RDD in Spark
You can convert the DataFrame back to an rdd by calling df.rdd and follow the same steps provided in the post about RDDs.
If the DataFrame is small enough, his first example of a simple transform using collect() would work.

Related

How to default highest version filter in tableau dashboard?

I am trying to build a dashboard which displays record count by product, category and versions. One product can have multiple versions and one version will have multiple categories. I want to build this dashboard by defaulting it to show the record count by category for the highest version of the selected product whenever any user opens the Tableau dashboard. How do I set my filter so that initially, the dashboard only displays record count for the highest version for each category for the selected product?
I tried using max version for each product and use that as a filter but when I change the product filter, the dashboard becomes blank as the selected max version is not applicable for this next product.
Data:
Product Versions Category Count
P1 1 C1 15
P1 1 C3 20
P1 1 C4 150
P1 1 C5 200
P1 2 C1 60
P1 2 C3 50
P1 2 C4 10
P1 2 C5 25
P2 8 C1 1500
P2 8 C3 2001
P2 8 C4 1505
P2 8 C5 250
P2 12 C1 600
P2 12 C3 550
P2 12 C4 160
P2 12 C5 258
I expect the output when the user opens the dashboard as:
Filter selection: Product: P2
Category Version Count
C1 12 600
C3 12 550
C4 12 160
C5 12 258
Filter selection: Product: P1
Category Version Count
C1 2 60
C3 2 50
C4 2 10
C5 2 25
There are multiple ways of solving this. One would be to change the product filter to a context filter and keep using the MAX function as the aggregation of the variable count. You do this by right-clicking the filter pill and select "Add to context" This way the product filters out first and the Version filter is applied afterward.
This is probably the simplest in regards that it does not involve creating calculated fields and is just two clicks away.
You can read more about the order of operations here if you want to know more.

Identify missing Dimension Value in Tableau

I have student name, subject and marks in three columns. My use case is to identify the subject in which selected student has not enrolled. I can do it very easily using SQL but requirement is to identify it in Tableau. I have tried LOD's and Traditional Filters but it's of no help.
**Sample Data**-
Name Subject Marks
Rob A 90
Rob B 95
Rob C 98
Ted B 86
Ted D 70
**Desired Output**-
Name Subject
Rob D
Ted A
Ted C
If graphical solutions are also an option, you can print the Subjects on e.g. the X axis, and the Student Names on the Y axis. Then, set the marks type to e.g. text and display the Number of Records in each cell. This way, you should get a matrix of all Subject vs. all Students, with 0 or 1 in each cell (intersection).

Adding a column to a .csv in MATLAB

I have a nxn .csv file in which I am finding the cumulative sum of one column. I need to append this column with a header cumsum to the end of the existing .csv file to make it nx(n+1). How could this be done? I am attaching a samaple:
filename A B
aa 23 34
aa 56 98
aa 8 90
aa 7 89
I am finding the cumsum of column A
23
79
87
94
I need this column appended to the end of .csv as
filename A B cumsum
aa 23 34 23
aa 56 98 79
aa 8 90 87
aa 7 89 94
I have 2 problems here:
1. I am extracting the column A everytime to perform the cumsum operation. How do I find it directly from the table for a single column without extraction?
How do I create a new column at the end of the existing table to add the cumsum column with a header 'cumsum'?
For point 1: You can use csvread to read a specific column directly from a .csv file without loading the whole thing. For your example, you would do this:
A = csvread('your_file.csv', 1, 1, [1 1 nan 1]);
The nan allows it to read all rows until the end (although I'm not sure this is documented anywhere).
The use of csvread is applicable to files containing numeric data, although it works fine for the above example even with character entries in the first row and first column of the .csv file. However, it appears to fail if the part of your file that you want to read is followed by columns containing character entries. A more general solution using xlsread is as follows:
A = xlsread('your_file.csv', 'B:B');
For point 2: Built-in functions like csvwrite or dlmwrite don't appear able to append new columns, just new rows. You can however use xlswrite, even though it is a .csv file. Here's how it would work for your example:
xlswrite('your_file.csv', [{'cumsum'}; num2cell(cumsum(A))], 1, 'D1');
And here's what the contents of your_file.csv would look like:
filename,A,B,cumsum
aa,23,34,23
aa,56,98,79
aa,8,90,87
aa,7,89,94

Break a command into several lines in do-file in Stata

I want to run the keep command in a do-file in Stata 12:
keep a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4
What I want is to do the following:
keep {a1 a2 a3 a4 a5
b1 b2 b3 b4 b5
c1 c2 c3 c4}
I know the {} brackets don't do the trick but I'm looking for the command that does it. Using #delimiter ; does not work either.
I want to do this because subgroups of variables have a relation among themselves (which I intended to signal above by using a, b and c) and I want to have that clear in my code. I permanently add and delete variables. Note that I don't want to use the drop command (in which case the solution is trivial).
There are several ways. One is using ///. An example:
clear all
set more off
*----- example data -----
set obs 1
forvalues i = 1/25 {
gen var`i' = `i'
}
describe
*----- what you want -----
keep var1 var2 ///
var3-var7 ///
var8 var11
describe
#delimit will work if used correctly. An example:
<snip>
*----- what you want -----
#delimit ;
keep var1 var2
var3-var7
var8 var11 ;
#delimit cr
describe
There is still another way. help delimit (which you already knew about) states:
See [U] 16.1.3 Long lines in do-files for more information.
That manual entry points you directly to the relevant information.
I suspect lack of research/effort in this case. A Google search (with "stata + break lines in do files") would have easily gotten you there. I'm not recommending this be your first strategy when trying to solve problems in Stata. Rather, start with Stata resources: I recommend reading
[U] 3 Resources for learning and using Stata
[U] 4 Stata’s help and search facilities.
This is just a very simple trick to complement the real solutions by Roberto. Since you have so many variables, one thing I found sometimes useful is to use macros to group variables, especially if you can use the grouping in more than one occasion.
loca a a1 a2 a3 a4 a5
loca b b1 b2 b3 b4 b5
loca c c1 c2 c3 c4 c5
keep `a' `b' `c'

Matlab reading hex values from text file with non hex values interspersed?

I have a text file that looks something like this what's pasted below. Several hex values followed by "xx" followed by hex values. The pattern repeats ~1M times. I'm looking for a good way to read out just the hex values ignoring the "xx" values. Textscan seems interesting, but doesn't support hex. fscanf is great, but it chokes as soon as it hits the first "xx" in the file. I wrote a clunky script, which reads everything as a string, omits "xx"s and uses dec2hex, but this is painfully slow (obviously). Any suggestions?
7F
55
8A
9B
6E
XX
XX
XX
XX
FF
DE
BE
EF
XX
XX
XX
04
88
.
.
.
This solution reads 1 million 2-character lines in less than a second on my laptop:
fid = fopen('test.txt');
A = textscan(fid,'%2c','CommentStyle','XX');
fclose(fid);
A = hex2dec(A{:});
Note the 'CommentStyle' option that skips those lines that start with XX.