Scala - Remove first row of Spark DataFrame - scala

I know dataframes are supposed to be immutable and everything and I know it's not a great idea to try to change them. However, the file I'm receiving has a useless header of 4 columns (the whole file has 50+ columns). So, what I"m trying to do is just get rid of the very top row because it throws everything off.
I've tried a number of different solutions (mostly found on here) like using .filter() and map replacements, but haven't gotten anything to work.
Here's an example of how the data looks:
H | 300 | 23098234 | N
D | 399 | 54598755 | Y | 09983 | 09823 | 02983 | ... | 0987098
D | 654 | 65465465 | Y | 09983 | 09823 | 02983 | ... | 0987098
D | 198 | 02982093 | Y | 09983 | 09823 | 02983 | ... | 0987098
Any ideas?

The cleanest way I've seen so far is something along the lines of filtering out the first row
csv_rows = sc.textFile('path_to_csv')
skipable_first_row = csv_rows.first()
useful_csv_rows = csv_rows.filter(row => row != skipable_first_row)

Related

SPSS group by rows and concatenate string into one variable

I'm trying to export SPSS metadata to a custom format using SPSS syntax. The dataset with value labels contains one or more labels for the variables.
However, now I want to concatenate the value labels into one string per variable. For example for the variable SEX combine or group the rows F/Female and M/Male into one variable F=Female;M=Male;. I already concatenated the code and labels into a new variable using Compute CodeValueLabel = concat(Code,'=',ValueLabel).
so the starting point for the source dataset is like this:
+--------------+------+----------------+------------------+
| VarName | Code | ValueLabel | CodeValueLabel |
+--------------+------+----------------+------------------+
| SEX | F | Female | F=Female |
| SEX | M | Male | M=Male |
| ICFORM | 1 | Yes | 1=Yes |
| LIMIT_DETECT | 0 | Too low | 0=Too low |
| LIMIT_DETECT | 1 | Normal | 1=Normal |
| LIMIT_DETECT | 2 | Too high | 2=Too high |
| LIMIT_DETECT | 9 | Not applicable | 9=Not applicable |
+--------------+------+----------------+------------------+
The goal is to get a dataset something like this:
+--------------+-------------------------------------------------+
| VarName | group_and_concatenate |
+--------------+-------------------------------------------------+
| SEX | F=Female;M=Male; |
| ICFORM | 1=Yes; |
| LIMIT_DETECT | 0=Too low;1=Normal;2=Too high;9=Not applicable; |
+--------------+-------------------------------------------------+
I tried using CASESTOVARS but that creates separate variables, so several variables not just one single string variable. I'm starting to suspect that I'm running up against the limits of what SPSS can do. Although maybe it's possible using some AGGREGATE or OMS trickery, any ideas on how to do this?
First I recreate your example here to demonstrate on:
data list list/varName CodeValueLabel (2a30).
begin data
"SEX" "F=Female"
"SEX" "M=Male"
"ICFORM" "1=Yes"
"LIMIT_DETECT" "0=Too low"
"LIMIT_DETECT" "1=Normal"
"LIMIT_DETECT" "2=Too high"
"LIMIT_DETECT" "9=Not applicable"
end data.
Now to work:
* sorting to make sure all labels are bunched together.
sort cases by varName CodeValueLabel.
string combineall (a300).
* adding ";" .
compute combineall=concat(rtrim(CodeValueLabel), ";").
* if this is the same varname as last row, attach the two together.
if $casenum>1 and varName=lag(varName)
combineall=concat(rtrim(lag(combineall)), " ", rtrim(combineall)).
exe.
*now to select only relevant lines - first I identify them.
match files /file=* /last=selectthis /by varName.
*now we can delete the rest.
select if selectthis=1.
exe.
NOTE: make combineall wide enough to contain all the values of your most populated variable.

Combine multiple rows into single row in Google Data Prep

I have a table which has multiple payload values in separate rows. I want to combine those rows into a single row to have all the data together. Table looks something like this.
+------------+--------------+------+----+----+----+----+
| Date | Time | User | D1 | D2 | D3 | D4 |
+------------+--------------+------+----+----+----+----+
| 2020-04-15 | 05:39:45 UTC | A | 2 | | | |
| 2020-04-15 | 05:39:45 UTC | A | | 5 | | |
| 2020-04-15 | 05:39:45 UTC | A | | | 8 | |
| 2020-04-15 | 05:39:45 UTC | A | | | | 7 |
+------------+--------------+------+----+----+----+----+
And I want to convert it to something like this.
+------------+--------------+------+----+----+----+----+
| Date | Time | User | D1 | D2 | D3 | D4 |
+------------+--------------+------+----+----+----+----+
| 2020-04-15 | 05:39:45 UTC | A | 2 | 5 | 8 | 7 |
+------------+--------------+------+----+----+----+----+
I tried "set" and "aggregate" but they didn't work as I wanted them to and I am not sure how to go forward.
Any help would be appreciated.
Thanks.
tl;dr:
use fill() function to fill all empty values within each d1-d4 columns in the wanted group (AKA - the columns date+time+user) then dedup\aggregate to your heart's content.
long version
So the quickest way to do this is by using a window-function called "fill()".
What this function does for each given field in a column, it tells it:
"Look down. look up. find the closest non-empty value, and copy it!"
you can ofcourse limit it's sight (look only 3 rows above, for example) but for this example, don't need the limitation. so your fill function will look like this:
FILL($col, -1, -1)
So the "$col" will reference all the chosen columns. the "-1" says "unlimited sight".
finally, the "~" says "from column D1 to column D4".
So, function will look like this:
.
Which in turn will make your columns look like this:
.
Now you can use the "dedup" transformation to remove any duplications, and only 1 copy of each "group" will remain.
Alternatively, if you still want to use "group by", you can do that aswell.
Hope this helps =]
p.s
There are more ways to do this - which entails using the "pivot" transformation, and array unnesting. But in the process you'll lose your columns' names, and will need to rename them.

Fast split Spark dataframe by keys in some column and save as different dataframes

I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?

How to find the nearest point in POSTGIS?

table A:
lat | long | the_geom | code | sign
13.8433095 | 100.6360357 | 0101000020E61.... | ABC | start_point
13.7544738 | 100.5459646 | 0101000020E6..... | ABC | end_point
13.4124215 | 100.6232332 | 0101000020E61.... | DEF | start_point
13.2423438 | 100.2324426 | 0101000020E6..... | DEF | end_point
table B:
lat | long | the_geom | code
13.7546285 | 100.5458729 | 0101000020E.... | ABC
13.7546698 | 100.5458513 | 0101000020E.... | ABC
13.7547107 | 100.5458233 | 0101000020E.... | DEF
...
I would like to find the nearest point of each point (start and end pojnt) compare with every points with the same code in table B ?
What's the best PostGIS function/PostgreSQL query to solve this ?
If you are using recent versions of the software, you can quickly find the K nearest neighbors to a point in PostGIS using KNN-GiST techniques. Small values of K are fastest, and 1 is about as small as they get, so this should work very well for you. I've only used KNN-GiST with text trigrams, but I know they work with PostGIS, too -- I just don't know the best page to read to get started with it. A web search for "postgis knn gist" shows lots of likely candidates.
After a long time question, i just found the solution of "Nearest Neighbor"
http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_nearest_neighbor_generic

Copying a range remotely

I've got a table named FOO with the column ("Porc" |- 3 7 15 50 15 7 3) and I'm copying the numbers to another table, shown below. I'm doing it the hard way, cell for cell, but I was wondering if there is a way to copy that range of the remote table (A2 to the bottom) in a single command.
| Pr (%) | ROE de A | ROE de B |
|--------+----------+----------|
| 3 | -11.43 | -34.29 |
| 7 | 0. | -11.43 |
| 15 | 3.43 | 0. |
| 50 | 12. | 17.14 |
| 15 | 20.57 | 34.29 |
| 7 | 24. | 41.14 |
| 3 | 30.86 | 54.86 |
|--------+----------+----------|
| Média | 11.86 | 16.41 |
| Desvio | 8.37 | 17.61 |
#+TBLFM: #2$1=remote(FOO, A2)::#3$1=remote(FOO, A3)::#4$1=remote(FOO, A4)::etc
Thanks
It seems your answer is in the org-mode manual:
$3 = remote(FOO, ###$2)
copy column 2 from table FOO into
column 3 of the current table For the
second example, table FOO must have at
least as many rows as the current
table. Inefficient for large number of
rows.
A Kind of Corollary: Copying all fields in a given row
So just as:
$3 = remote(FOO, ###$2)
copies all the fields from a given column (col2) into column three of the new table, then:
#3 = remote(FOO, #1$$#)
copies all the fields from a given row (row1) into row 3.
There's something about how this standard reference form #r$c interacts with the ## and $# notation that makes this seem a bit abstruse. e.g. this is all the org manual has to say about this remote reference syntax:
## and $# can be used to get the row or column number of the field where the formula result goes.
Umm…?
Posting this example here because I found it all a bit mystifying and hope this helps some else save a few minutes when dealing with rows and tables in the awesome org-mode