Next command behavior on pattern space - sed

Suppose below sed command:
$ seq 4 | sed 'p;n;'
1
1
2
3
3
4
I couldn't understand why 2 and 4 are printed once while
The "n" command will print out the current pattern space...
and p; prints current pattern space earlier than n;.
Let me show you my thoughts (O: output, PS: pattern space):
+------------+---------+-----------+
| Current PS | `p;` | `n;` |
+------------+---------+-----------+
| 1 | O=1 | O=1 PS=2 |
+------------+---------+-----------+
| 2 | O=2 | O=2 PS=3 |
+------------+---------+-----------+
| 3 | O=3 | O=3 PS=4 |
+------------+---------+-----------+
| 4 | O=4 | O=4 PS=4 |
+------------+---------+-----------+
What am I missing in definition of n here that I expect 2 and 4 to be output twice as well?

This is what happens:
1 is read into PS.
p : 1 is printed.
n : 1 is printed again, 2 is read into PS.
End of iteration, 2 is printed.
3 is read into PS.
p : 3 is printed.
etc.
Modify the string to see why it's being printed:
$ seq 4 | sed 'p;s/$/ n command/;n;s/$/ end/'
1
1 n command
2 end
3
3 n command
4 end

Related

how to get multiple rows from one row in spark scala [duplicate]

This question already has an answer here:
Flattening Rows in Spark
(1 answer)
Closed 5 years ago.
I have a dataframe in spark like below and I want to convert all the column in different rows with respect to first column id.
+----------------------------------+
| id code1 code2 code3 code4 code5 |
+----------------------------------+
| 1 A B C D E |
| 1 M N O P Q |
| 1 P Q R S T |
| 2 P A C D F |
| 2 S D F R G |
+----------------------------------+
I want the output like below format
+-------------+
| id code |
+-------------+
| 1 A |
| 1 B |
| 1 C |
| 1 D |
| 1 E |
| 1 M |
| 1 N |
| 1 O |
| 1 P |
| 1 Q |
| 1 P |
| 1 Q |
| 1 R |
| 1 S |
| 1 T |
| 2 P |
| 2 A |
| 2 C |
| 2 D |
| 2 F |
| 2 S |
| 2 D |
| 2 F |
| 2 R |
| 2 G |
+-------------+
Can anyone please help me here how I will get the above output with spark and scala.
using array, explode and drop functions should have you the desired output as
df.withColumn("code", explode(array("code1", "code2", "code3", "code4", "code5")))
.drop("code1", "code2", "code3", "code4", "code5")
OR
as defined by undefined_variable, you can just use select
df.select($"id", explode(array("code1", "code2", "code3", "code4", "code5")).as("code"))
df.select(col("id"),explode(concat_ws(",",Seq(col(code1),col("code2"),col("code3"),col("code4"),col("code5")))))
Basically idea is first concat all required columns and then explode it

What is the fastest way to extract all n-grams of lengths 1, 2, and 3 from a body of text in PostgreSQL?

I have many bodies of text, and for each of them, I want to extract all unigrams, bigrams, and trigrams (words, not characters) and insert the counts and ngram lengths into another table.
Right now I am thinking of unnesting a regexp-splitted body of text using WITH ORDINALITY, and then using multiple subqueries for the bigrams and trigrams, but that requires ordering . However, I think this might be an inefficient way of going about it, since this sort of positional data should normally be accessed by index.
I am currently implementing this in Python, and a huge bottleneck is the dictionary insertion and searching of dictionaries/sets for stopwords.
Here is a very basic example:
Input:
This is a small, small sentence.
Output
ngram | count | length
-------------------------------------
this | 1 | 1
is | 1 | 1
a | 1 | 1
small | 2 | 1
sentence | 1 | 1
this is | 1 | 2
is a | 1 | 2
a small | 1 | 2
small small | 1 | 2
small sentence | 1 | 2
this is a | 1 | 3
is a small | 1 | 3
a small small | 1 | 3
small small sentence | 1 | 3
Stripping the punctuation/handling lowercase is not an issue here, but getting the proper counts is important.
As an preliminary or intermediate step, I would also be removing stopwords which, in this case, are this, a, and is.
ngram | count | length
--------------------------------------
small | 2 | 1
sentence | 1 | 1
small small | 1 | 2
small sentence | 1 | 2
small small sentence | 1 | 3
In the above example
Use the window function lead() to generate bigrams and trigrams, and unions to place all ngrams in a single list. In fact the most difficult was to keep the order in the resultset as in the starting sentence.
with my_table(sentence) as (
values ('This is a small, small sentence.')
),
words as (
select id, word
from my_table,
regexp_split_to_table(lower(sentence), '[^a-zA-Z]+') with ordinality as t(word, id)
where word <> ''
)
select ngram, count(*), length
from (
select distinct on(id, ngram) id, ngram, length
from (
select id, word as ngram, 1 as length
from words
union all
select id, concat_ws(' ', word, lead(word, 1) over w), 2
from words
window w as (order by id)
union all
select id, concat_ws(' ', word, lead(word, 1) over w, lead(word, 2) over w), 3
from words
window w as (order by id)
) s
order by id, ngram, length
) s
group by ngram, length
order by length, min(id);
ngram | count | length
----------------------+-------+--------
this | 1 | 1
is | 1 | 1
a | 1 | 1
small | 2 | 1
sentence | 1 | 1
this is | 1 | 2
is a | 1 | 2
a small | 1 | 2
small small | 1 | 2
small sentence | 1 | 2
this is a | 1 | 3
is a small | 1 | 3
a small small | 1 | 3
small small sentence | 1 | 3
(14 rows)
You can do this with a recursive query:
with recursive words as (
select id, translate(word, '.,', '') as word
from my_table,
regexp_split_to_table(lower(sentence), '\s+') with ordinality as t(word, id)
where word <> ''
), ngrams (id, ngram) as (
select id, array[word]
from words
where word not in ('this', 'a', 'is') -- remove stop words
union all
select c.id, p.ngram||c.word
from words c
join ngrams p on p.id + 1 = c.id
and cardinality(p.ngram) <= 2 -- limit to 3 words
)
select array_to_string(ngram, ' '),
count(*) over (partition by ngram) as "count",
cardinality(ngram) as length
from ngrams
order by cardinality(ngram);
For the sample 'This is a small, small sentence.' this returns:
ngram | count | length
---------------------+-------+-------
a | 1 | 1
is | 1 | 1
sentence | 1 | 1
small | 2 | 1
small | 2 | 1
this | 1 | 1
this is | 1 | 2
small small | 1 | 2
is a | 1 | 2
small sentence | 1 | 2
a small | 1 | 2
is a small | 1 | 3
this is a | 1 | 3
a small small | 1 | 3
small small sentence | 1 | 3
And with stop words removed:
ngram | count | length
---------------------+-------+-------
sentence | 1 | 1
small | 2 | 1
small | 2 | 1
small sentence | 1 | 2
small small | 1 | 2
small small sentence | 1 | 3
Not sure how fast this is going to be though.
Online example: http://rextester.com/CPPU86582

How to set sequence number of sub-elements in TSQL unsing same element as parent?

I need to set a sequence inside T-SQL when in the first column I have sequence marker (which is repeating) and use other column for ordering.
It is hard to explain so I try with example.
This is what I need:
|------------|-------------|----------------|
| Group Col | Order Col | Desired Result |
|------------|-------------|----------------|
| D | 1 | NULL |
| A | 2 | 1 |
| C | 3 | 1 |
| E | 4 | 1 |
| A | 5 | 2 |
| B | 6 | 2 |
| C | 7 | 2 |
| A | 8 | 3 |
| F | 9 | 3 |
| T | 10 | 3 |
| A | 11 | 4 |
| Y | 12 | 4 |
|------------|-------------|----------------|
So my marker is A (each time I met A I must start new group inside my result). All rows before first A must be set to NULL.
I know that I can achieve that with loop but it would be slow solution and I need to update a lot of rows (may be sometimes several thousand).
Is there a way to achive this without loop?
You can use window version of COUNT to get the desired result:
SELECT [Group Col], [Order Col],
COUNT(CASE WHEN [Group Col] = 'A' THEN 1 END)
OVER
(ORDER BY [Order Col]) AS [Desired Result]
FROM mytable
If you need all rows before first A set to NULL then use SUM instead of COUNT.
Demo here

How to compute the dot product of two column (think full column as a vector)?

gave this table:
| a | b | c |
|---+---+----+
| 3 | 4 | |
| 1 | 2 | |
| 1 | 3 | |
| 2 | 2 | |
I want to get the dot product of two column a and b ,the result should be equel to (3*4)+(1*2)+(1*3)+(2*2) which is 21.
I don't want use the clumsy formula (B1*B2+C1*C2+D1*D2+E1*E2) because actually I have a large table waiting to calculate.
I know emacs's Calc tool has a "vprod" function which can do those sort of things ,but I dont' know how to turn the full column to a vector.
Can anybody tell me how to achieve this task,appreciate it!
In emacs-calc, the simple product of 2 vectors calculates the dot product.
This works (I put the result in #6$3; also the parenthesis can be omitted):
| a | b | c |
|---+---+----|
| 3 | 4 | |
| 1 | 2 | |
| 1 | 3 | |
| 2 | 2 | |
|---+---+----|
| | | 21 |
#+TBLFM: #6$3=(#I$1..#II$1)*(#I$2..#II$2)
#I and #II span from the 1st hline to the second.
This can be solved using babel and R in org-mode:
#+name: mytable
| a | b | c |
|---+---+----+
| 3 | 4 | |
| 1 | 2 | |
| 1 | 3 | |
| 3 | 2 | |
#+begin_src R :var mytable=mytable
sum(mytable$a * mytable$b)
#+end_src
#+RESULTS:
: 23

how can i decipher dns messages?

i'm writing a program to receive dns messages and respond an appropriate answer(a simple dns server that only reply A records).
but when i receive messages it's not like the described format in 1035 RFC.
for example this is a dns query generated by nslookup:
'\xe1\x0c\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x06google\x03com\x00\x00\x01\x00\x01'
i know about dns headers and bits as defined in 1035 RFC but why it should be in hex?
should i consider them as hex numbers or their utf-8 equivalents?
should my responses have this format too?
It's coming out as hexadecimal because it is a raw binary request, but you are presumably trying to print it out as a string. That is apparently how non-printable characters are displayed by whatever you are using to print it out; it escapes them as hex sequences.
You don't interpret this as "hex" or UTF-8 at all; you need to interpret the binary format described by the RFC. If you mention what language you're using, I (or someone else) might be able to describe to you how to handle data in a binary format like this.
Untile then, let's take a look at RFC 1035 and see how to interpret your query by hand:
The header contains the following fields:
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
Each line there is 16 bits, so that's 12 bytes. Lets fill our first 12 bytes into there:
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID = e10c | \xe1 \x0c
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| 0| Opcode=0 | 0| 0| 1| 0| Z=0 | RCODE=0 | \x01 \x00
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT = 1 | \x00 \x01
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT = 0 | \x00 \x00
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT = 0 | \x00 \x00
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT = 0 | \x00 \x00
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
So. We have a query with ID = e10c (just an arbitrary number so the client can match queries up with responses), QR = 0 indicates that it's a query, opcode = 0 indicates that it's a standard query, AA and TC are for responses, RD = 1 indicates that recursion is desired (we are making a recursive query to our local nameserver). Z is reserved for future use, RCODE is a response code for responses. QDCOUNT = 1 indicates that we have 1 question, all the rest are numbers of different types of records in a response.
Now we come to the questions. Each has the following format:
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| |
/ QNAME /
/ /
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QTYPE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QCLASS |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
QNAME is the name the query is about. The format is one octet indicating the length of a label, followed by the label, terminated by a label with 0 length.
So we have:
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| LEN = 6 | g | \x06 g
| o | o | o o
| g | l | g l
| e | LEN = 3 | e \x03
| c | o | c o
| m | LEN = 0 | m \x00
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QTYPE = 1 | \x00 \x01
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QCLASS = 1 | \x00 \x01
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
This indicates that the name we are looking up is google.com (sometimes written as google.com., with the empty label at the end made explicit). QTYPE = 1 is an A (IPv4 address) record. QCLASS = 1 is an IN (internet) query. So this is asking for the IPv4 address of google.com.