I have a database where not all columns will be filled. I just want to select columns with minimum one value. So if the column is empty from top to bottom it should not be selected. All I could find looking for an answer where searches along the rows so if a column is empty the hole row will not be selected.
╔═══════════════╦════════════╦══════════════╦════════════╦═════════════╦════════════════╦════════════════╗
║ "receiver_id" ║ "gps_week" ║ "gps_second" ║ "latitude" ║ "longitude" ║ "altitude_msl" ║ "altitude_hae" ║
╠═══════════════╬════════════╬══════════════╬════════════╬═════════════╬════════════════╬════════════════╣
║ 1 ║ ║ ║ 38.0517465 ║ 15.660851 ║ ║ 691.379883 ║
║ 1 ║ ║ ║ 38.0517465 ║ 15.660851 ║ ║ 691.389404 ║
║ 1 ║ ║ ║ 38.0517465 ║ 15.660851 ║ ║ 691.402344 ║
║ 1 ║ ║ ║ 38.0517465 ║ 15.6608509 ║ ║ 691.413818 ║
║ 1 ║ ║ ║ 38.0517465 ║ 15.6608508 ║ ║ 691.425659 ║
╚═══════════════╩════════════╩══════════════╩════════════╩═════════════╩════════════════╩════════════════╝
In the example table I want to return
receiver_id, latitude, longitude, altitude_hae
Related
I need to convert the following dataframe:
╔══════╦════════╦════════╦════════╗
║ Year ║ ColA ║ ColB ║ ColC ║
╠══════╬════════╬════════╬════════╣
║ 2017 ║ 1 ║ 2 ║ 3 ║
║ 2018 ║ 4 ║ 5 ║ 6 ║
║ 2019 ║ 7 ║ 8 ║ 9 ║
╚══════╩════════╩════════╩════════╝
Into this:
╔══════╦════════╦═══════╗
║ Year ║ColName ║ Value ║
╠══════╬════════╬═══════╣
║ 2017 ║ ColA ║ 1 ║
║ 2017 ║ ColB ║ 2 ║
║ 2017 ║ ColC ║ 3 ║
║ 2018 ║ ColA ║ 4 ║
║ 2018 ║ ColB ║ 5 ║
║ 2018 ║ ColC ║ 6 ║
║ 2019 ║ ColA ║ 7 ║
║ 2019 ║ ColB ║ 8 ║
║ 2019 ║ ColC ║ 9 ║
╚══════╩════════╩═══════╝
This needs to support any number of columns besides the first "Year" one, which could be 1 or many. And it should be a generic solution, meaning it should not use hard-coded column names anywhere, but it should read the column names directly from the original dataframe.
I'm using Databricks with a notebook written in Scala. Very new to both Spark and Scala.
UPDATE
I've found this solution in Python that works well, but I'm having a hard time converting it to Scala.
def columnsToRows(df, by):
# Filter dtypes and split into column names and type description.
# Only get columns not in "by".
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Create and explode an array of (column_name, column_value) structs
kvs = F.explode(F.array([
F.struct(F.lit(c.strip()).alias("ColName"), F.col(c).alias("Value")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.ColName", "kvs.Value"])
You can use stack to transpose the data
val fixedColumns = Seq("Year", "FixedColumn")
val cols = df.columns
.filter(c => !(fixedColumns.contains(c)))
.map(c => (s"'${c}', ${c}" ))
val exp= cols.mkString(s"stack(${cols.size}, ", "," , ") as (Point, Value)")
df.select($"Year", expr(exp))
Output:
+----+------+-----+
|Year|Point |Value|
+----+------+-----+
|2017|PointA|1 |
|2017|PointB|2 |
|2017|PointC|3 |
|2018|PointA|4 |
|2018|PointB|5 |
|2018|PointC|6 |
|2019|PointA|7 |
|2019|PointB|8 |
|2019|PointC|9 |
+----+------+-----+
Your python-code translates like this:
val colsToKeep = Seq("year").map(col)
val colsToTransform = Seq("colA","colB","colC")
df.select((colsToKeep :+
explode(
array(colsToTransform.map(c => struct(lit(c).alias("colName"),col(c).alias("colValue"))):_*)
).as("NameValue")):_*)
.select((colsToKeep :+ $"nameValue.colName":+$"nameValue.colValue"):_*)
.show()
I posted a question on this here but then I realized I am wanting more than just what I was asking.
I actually need to be DISTINCT on the name column by its highest ts_rank, So my code is,
SELECT name
,ts_rank(to_tsvector(name), query) + ts_rank(to_tsvector(content), query2) AS rank
FROM users
INNER JOIN microposts ON users.id = microposts.user_id
,plainto_tsquery('re') query
,plainto_tsquery('comics') query2
WHERE users.name ## query
OR microposts.content ## query2
ORDER BY rank DESC;
Gives
╔════════════════╤═════════════════════════════════════════╤═══════════╗
║ name │ content │ rank ║
╠════════════════╪═════════════════════════════════════════╪═══════════╣
║ Dawson Kreiger │ dc comics dc comics dc comics dc comics │ 0.0919062 ║
╟────────────────┼─────────────────────────────────────────┼───────────╢
║ Kaylin Green │ dc comics dc comics dc comics │ 0.0889769 ║
╟────────────────┼─────────────────────────────────────────┼───────────╢
║ Dawson Kreiger │ dc comics dc comics │ 0.0827456 ║
╟────────────────┼─────────────────────────────────────────┼───────────╢
║ Kaylin Green │ dc comics │ 0.0759909 ║
╟────────────────┼─────────────────────────────────────────┼───────────╢
║ Dawson Kreiger │ I went to the beach dc comics │ 0.0607927 ║
╟────────────────┼─────────────────────────────────────────┼───────────╢
║ Dawson Kreiger │ I went to the beach dc comics │ 0.0607927 ║
╟────────────────┼─────────────────────────────────────────┼───────────╢
║ Kaylin Green │ I went to the beach dc comics │ 0.0607927 ║
╟────────────────┼─────────────────────────────────────────┼───────────╢
║ Kaylin Green │ I went to the beach dc comics │ 0.0607927 ║
╚════════════════╧═════════════════════════════════════════╧═══════════╝
So I need the output to be this,
╔════════════════╤═════════════════════════════════════════╤═══════════╗
║ name │ content │ rank ║
╠════════════════╪═════════════════════════════════════════╪═══════════╣
║ Dawson Kreiger │ dc comics dc comics dc comics dc comics │ 0.0919062 ║
╟────────────────┼─────────────────────────────────────────┼───────────╢
║ Kaylin Green │ dc comics dc comics dc comics │ 0.0889769 ║
╚════════════════╧═════════════════════════════════════════╧═══════════╝
So I need to select a record that is distinct by name and has the highest rank. But how will the code know how to select the distinct user with the highest ts_rank?
EDIT
For instance if I do this
SELECT name
, ts_rank(to_tsvector(name), query) + ts_rank(to_tsvector(content), query2) AS rank
FROM
(
SELECT DISTINCT name FROM users WHERE rank = MAX(rank)
)
INNER JOIN microposts ON users.id=microposts.user_id
, plainto_tsquery('re') query
,plainto_tsquery('comics') query2
WHERE users.name ## query
OR microposts.content ## query2
ORDER BY rank DESC;
I get error: column "rank" does not exist
You could do a GROUP BY with a MAX.
SELECT name
,MAX(ts_rank(to_tsvector(name), query) + ts_rank(to_tsvector(content), query2)) AS rank
FROM users
INNER JOIN microposts ON users.id = microposts.user_id
,plainto_tsquery('re') query
,plainto_tsquery('comics') query2
WHERE users.name ## query
OR microposts.content ## query2
GROUP BY name
ORDER BY rank DESC;
Table one - workorder
╔══════════╦══════════════╦══════════════╗
║ id ║ wpeople ║ start_date ║
╠══════════╬══════════════╬══════════════╣
║ 1 ║ 1,2,4 ║ 02.08.2016 ║
║ 2 ║ 4,5 ║ 28.09.2016 ║
╚══════════╩══════════════╩══════════════╝
Table two - employees
╔══════════╦═════════════════╗
║ id ║ name ║
╠══════════╬═════════════════╣
║ 1 ║ John ║
║ 2 ║ Ben ║
║ 3 ║ Ian ║
║ 4 ║ Hank ║
║ 5 ║ George ║
╚══════════╩═════════════════╝
Output selection for who need to work at the project
╔══════════╦════════════════╦════════════╗
║ 1 ║ John,Ben,Hank ║ 02.08.2016 ║
║ 2 ║ Hank,George ║ 28.09.2016 ║
╚══════════╩════════════════╩════════════╝
I have tried with GROUP_CONCAT and FIND_IN_SET
SELECT w.id,
GROUP_CONCAT(e.name ORDER BY e.id) workorder
FROM workorder w
INNER JOIN employees e
ON FIND_IN_SET(e.id, a.wpeople) > 0
GROUP BY w.id
But the output it's
╔══════════╦════════════════╦════════════╗
║ 1 ║ John ║ 02.08.2016 ║
║ 1 ║ Ben ║ 02.08.2016 ║
║ 1 ║ Hank ║ 02.08.2016 ║
║ 2 ║ Hank ║ 28.09.2016 ║
║ 2 ║ George ║ 28.09.2016 ║
╚══════════╩════════════════╩════════════╝
I search on google for this and the solution it's GROUP_CONCAT - FIND_IN_SET. Can be that I didn't understand very well this function.
Thanks for you time!
Stefan
For anyone who will need this:
I added a new table werkbon_employee
╔══════════╦═══════════════════╦═══════════════╗
║ id ║ workorder_id ║ employee_id ║
╠══════════╬═══════════════════╬═══════════════╣
║ 1 ║ 1 ║ 1 ║
║ 2 ║ 1 ║ 2 ║
║ 3 ║ 1 ║ 4 ║
║ 4 ║ 2 ║ 4 ║
║ 5 ║ 2 ║ 5 ║
╚══════════╩═══════════════════╩═══════════════╝
I used to select
SELECT *,
GROUP_CONCAT(e.name ORDER BY e.id) ename
FROM werkbon
LEFT JOIN werkbon_employee we ON werkbon.id = we.werkbon_id
INNER JOIN employees e ON FIND_IN_SET(e.id, we.employee_id) > 0
GROUP BY werkbon.id DESC LIMIT 1
Now the result it's
Werk mensen
John,Ben,Hank
Datum
02.08.2016
Thanks to #MarcB for help
I have two Google spreadsheets. Three columns on the second spreadsheet are being imported through the IMPORTRANGE() formula. It looks like this:
Spreadsheet 1
╔════════╦════════╦════════╦════════╗
║ title1 ║ title2 ║ title3 ║ title4 ║
╠════════╬════════╬════════╬════════╣
║ input1 ║ input4 ║ input7 ║ ║
║ input2 ║ input5 ║ input8 ║ ║
║ input3 ║ input6 ║ input9 ║ ║
╚════════╩════════╩════════╩════════╝
Spreadsheet 2
╔════════╦════════╦════════╗
║ title1 ║ title2 ║ title3 ║
╠════════╬════════╬════════╣
║ input1 ║ input4 ║ input7 ║
║ input2 ║ input5 ║ input8 ║
║ input3 ║ input6 ║ input9 ║
╚════════╩════════╩════════╝
The thing is, I only want the data to be imported if the corresponding cell in the title4 column is populated. Like this:
If Spreadsheet 1 looks like this
╔════════╦════════╦════════╦═════════╗
║ title1 ║ title2 ║ title3 ║ title4 ║
╠════════╬════════╬════════╬═════════╣
║ input1 ║ input4 ║ input7 ║ ║
║ input2 ║ input5 ║ input8 ║ input11 ║
║ input3 ║ input6 ║ input9 ║ ║
╚════════╩════════╩════════╩═════════╝
then Spreadsheet 2 should look like this
╔════════╦════════╦════════╗
║ title1 ║ title2 ║ title3 ║
╠════════╬════════╬════════╣
║ input2 ║ input5 ║ input8 ║
╚════════╩════════╩════════╝
I figured this out.
I used the argument:
=IF("Sheet Title 1!F1:F1000"<>"", IMPORTRANGE("Spreadsheet Key","Sheet Title 1!E1:E1000"), "")
So the contents of Column E in the first sheet are only copied to the second sheet if the cell in the same row in Column F is not empty.
From the table below, if Project.Date group has a Fail and Success, I'd like to keep the Fail row, but if single row (like the rest,) then keep that row regardless of Status. For example I'd like to keep the first row, discard the second and keep the rest on the table below.
╔═════════╦══════════╦═════════╗
║ PROJECT ║ DATE ║ STATUS ║
╠═════════╬══════════╬═════════╣
║ HLM ║ 20130422 ║ Fail ║
║ HLM ║ 20130422 ║ Success ║
║ HLM ║ 20130423 ║ Fail ║
║ HLM ║ 20130424 ║ Success ║
║ HLM ║ 20130425 ║ Fail ║
║ HLM ║ 20130426 ║ Success ║
╚═════════╩══════════╩═════════╝
WITH records
AS
(
SELECT [Project], [Date], [Status],
ROW_NUMBER() OVER (PARTITION BY [Project], [Date]
ORDER BY [Status]) rn
FROM TableName
)
SELECT [Project], [Date], [Status]
FROM records
WHERE rn = 1
SQLFiddle Demo
OUTPUT
╔═════════╦══════════╦═════════╗
║ PROJECT ║ DATE ║ STATUS ║
╠═════════╬══════════╬═════════╣
║ HLM ║ 20130422 ║ Fail ║
║ HLM ║ 20130423 ║ Fail ║
║ HLM ║ 20130424 ║ Success ║
║ HLM ║ 20130425 ║ Fail ║
║ HLM ║ 20130426 ║ Success ║
╚═════════╩══════════╩═════════╝