I have a input file as follows:
file1 : A
file2 : B
file3 : C
file4 : D
file5 : E
file6 : F
and i want my Target file to look like this:
file1 | file2 | file3 | file4 | file5 | file6
A | B | C | D | E | F
Trying to make the column values as header in the target file. is it possible with tmap, if so someone post the answer here.
You can check the tPivotToColumnsDelimited documentary click here
Related
I have an index:
--------------------------------------------------
| id | name | folder | tag1 | tag2 | topic |
--------------------------------------------------
| 1 | file 1 | AAA | [1,2] | [3,4] | 1 |
--------------------------------------------------
| 2 | file 2 | BBB | [1] | [4,5] | 1 |
--------------------------------------------------
| 3 | file 3 | AAA | [2] | [4] | 2 |
--------------------------------------------------
What I need to do is to filter files by tag and the following request works fine:
SELECT id, name,
ANY(var=1 FOR var IN tag1) as tag_filter_1,
ANY(var=5 FOR var IN tag2) as tag_filter_2,
GROUP_CONCAT(id) as files
FROM index
WHERE tag_filter_1 = 1 AND tag_filter_2 = 1
GROUP BY topic;
Now I need to modify the query to apply the tag1 filter only for the files from AAA folder and at the same time keep filtering by tag2 from all the folders.
I was thinking about OR condition but it's not supported. Also, the option was to use GROUP_CONCAT(tag1) and then filter the Sphinx results in PHP but tag1 is JSON, not scalar.
I am wondering if it's possible to solve this using SNIPPET or IF function and how. Or any other ideas?
It's maybe a little difficult to see, but can just do...
SELECT id, name,
ANY(var=1 FOR var IN tag1) OR NOT folder='AAA' AS tag_filter_1,
ANY(var=5 FOR var IN tag2) AS tag_filter_2
FROM index WHERE tag_filter_1 = 1 AND tag_filter_2 = 1;
tag_filter_1 becomes tag1 filter applies, OR not AAA folder.
So a document in AAA folder, needs to match the tag1 filter (because can't match the second part)
.. but document in other folders will always match because of OR, so the tag1 filter is ignored.
(there isnt a != string operator, so need to invert the use of = operator :)
I'm a beginner with Spark, and I have to regroup all data stored on several files into one.
Note : I already used Talend, and my goal is to do same thing but with Spark (scala).
Example :
File 1:
id | attr1.1 | attr1.2 | attr1.3
1 | aaa | aab | aac
2 | aad | aae | aaf
File 2:
id | attr2.1 | attr2.2 | attr2.3
1 | lll | llm | lln
2 | llo | llp | llq
File 3:
id | attr3.1 | attr3.2 | attr3.3
1 | sss | sst | ssu
2 | ssv | ssw | ssx
Ouput wished:
id |attr1.1|attr1.2|attr1.3|attr2.1|attr2.2|attr2.3|attr3.1|attr3.2|attr3.3
1 | aaa | aab | aac | lll | llm | lln | sss | sst | ssu
2 | aad | aae | aaf | llo | llp | llq | ssv | ssw | ssx
I have 9 files about orders, customers, items, ... And several hundreds of thousands of lines, that's why I have to use Spark. Fortunately, data can be tied with ids.
File format is .csv.
Final objective : Final objective is to do some visualizations from file generated by Spark.
Question : So, can you give me some clues to do this task please? I saw several ways with RDD or DataFrame but I am completely lost...
Thanks
you didn't specify anything about the original file formats so assuming you've got them in dataframes f1,f2... you can create a unified dataframe by joining them val unified=f1.join(f2,f1("id")===f2("id")).join(f3, f1("id")===f3("id"))....
This is folders table:
+------------+
| id name |
|------------|
| 1 dir1 |
| 2 dir2 |
+------------+
This is files table:
+------------------------------+
| id folder_id name deleted |
|------------------------------|
| 1 1 file1 true |
| 2 1 file2 false |
| 1 2 file3 true |
| 2 2 file4 true |
+------------------------------+
As you can see, dir1 and dir2 has 2 files per dir. But in dir1 only one file deleted and one file is available. However, dir2 has 2 files but both of them deleted. So no files remains in there.
What I am trying to do is get folders table and dirs that doesn't have any files (all deleted).
So far I tried this but did not worked.
SELECT
"dirs".*
FROM
"dirs"
INNER JOIN "files" ON ( "files"."folder_id" = "folders"."id" )
WHERE
( "files"."deleted" IS FALSE )
GROUP BY
"folders".ID
HAVING
COUNT ( files.ID ) != 0 /* or > 0 or == 0 */
Expected result:
+------------+
| id name |
|------------|
| 2 dir2 |
+------------+
Because dir2 has files but all deleted.
Use where not exists:
select *
from folders d
where not exists (
select 1
from files f
where folder_id = d.id
and not deleted)
Working example in rextester.
Suppose below sed command:
$ seq 4 | sed 'p;n;'
1
1
2
3
3
4
I couldn't understand why 2 and 4 are printed once while
The "n" command will print out the current pattern space...
and p; prints current pattern space earlier than n;.
Let me show you my thoughts (O: output, PS: pattern space):
+------------+---------+-----------+
| Current PS | `p;` | `n;` |
+------------+---------+-----------+
| 1 | O=1 | O=1 PS=2 |
+------------+---------+-----------+
| 2 | O=2 | O=2 PS=3 |
+------------+---------+-----------+
| 3 | O=3 | O=3 PS=4 |
+------------+---------+-----------+
| 4 | O=4 | O=4 PS=4 |
+------------+---------+-----------+
What am I missing in definition of n here that I expect 2 and 4 to be output twice as well?
This is what happens:
1 is read into PS.
p : 1 is printed.
n : 1 is printed again, 2 is read into PS.
End of iteration, 2 is printed.
3 is read into PS.
p : 3 is printed.
etc.
Modify the string to see why it's being printed:
$ seq 4 | sed 'p;s/$/ n command/;n;s/$/ end/'
1
1 n command
2 end
3
3 n command
4 end
I'm trying to find a way to mark duplicated cases similar to this question.
However, instead of counting occurrences of duplicated values, I'd like to mark them as 0 and 1, for duplicated and unique cases respectively. This is very similar to SPSS's identify duplicate cases function. For example if I have a dataset like:
Name State Gender
John TX M
Katniss DC F
Noah CA M
Katniss CA F
John SD M
Ariel FL F
And if I wanted to flag those with duplicated name, so the output would be something like this:
Name State Gender Dup
John TX M 1
Katniss DC F 1
Noah CA M 1
Katniss CA F 0
John SD M 0
Ariel FL F 1
A bonus would be a query statement that will handle which case to pick when determining the unique case.
SELECT name, state, gender
, NOT EXISTS (SELECT 1 FROM names nx
WHERE nx.name = na.name
AND nx.gender = na.gender
AND nx.ctid < na.ctid) AS Is_not_a_dup
FROM names na
;
Explanation: [NOT] EXISTS(...) results in a boolean value (which could be converted to an integer) Casting to boolean requires an extra pair of () , though:
SELECT name, state, gender
, (NOT EXISTS (SELECT 1 FROM names nx
WHERE nx.name = na.name
AND nx.gender = na.gender
AND nx.ctid < na.ctid))::integer AS is_not_a_dup
FROM names na
;
Results:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 6
name | state | gender | nodup
---------+-------+--------+-------
John | TX | M | t
Katniss | DC | F | t
Noah | CA | M | t
Katniss | CA | F | f
John | SD | M | f
Ariel | FL | F | t
(6 rows)
name | state | gender | nodup
---------+-------+--------+-------
John | TX | M | 1
Katniss | DC | F | 1
Noah | CA | M | 1
Katniss | CA | F | 0
John | SD | M | 0
Ariel | FL | F | 1
(6 rows)