Splunk: Group by certain entry in log file - group-by

I did this query in Splunk:
source="/log/ABCD/cABCDXYZ/xyz.log" doSomeTasks|timechart partial=f span=1h count as "#XYZ doSomeTasks" |fillnull
This query works out fine. Now, I would like to group this results by another entry in my log file. This entry is taskType. taskType can be either One, Two or Three. One, Two or Three are also entries after taskType.
How could I do this?

Add a by clause to your timechart:
source="/log/ABCD/cABCDXYZ/xyz.log" doSomeTasks
| timechart partial=f span=1h count as "#XYZ doSomeTasks" by taskType
| fillnull

Related

Count or Find Unique Values for a specific key

I feel like this shouldn't be that hard, but I'm having trouble getting a data structure in mind that will give me what I want. I have a large amount of data and I need to find the instances where there are multiple Secondary Identifiers as defined below.
Primary Identifier,Secondary Identifier,Fruit
11111,1,apple
11111,1,pear
22222,1,banana
22222,1,grapefruit
33333,1,apple
33333,1,pear
33333,2,apple
33333,2,orange
That might not be a great example to use - but basically only two of the columns matter. What I'd really like is to return the Primary Identifiers where the unique count of Secondary Identifiers is greater than 1. So I'm thinking maybe a HashTable would be my best bet, but I tried to doing something in a pipeline oriented way and failed so I'm wondering if there is an easier method or Cmdlet that I haven't tried.
The final array (or hashtable) would be something like this:
ID Count of Secondary ID
----- ---------------------
11111 1
22222 1
33333 2
At that point, getting the instances of multiple would be as easy as $array | Where-Object {$_."Count of Secondary ID" -gt 1}
If this example sucks or what I'm after doesn't make sense, let me know and I can rewrite it; but it's almost like I need an implementation of Select-Object -Unique that would allow you to use two or more input objects/columns. Basically the same as Excel's remove duplicates and then selecting which headers to include. Except there are too many rows to open in Excel
Use Group-Object twice - first to group the objects by common Primary Identifier, then use Group-Object again to count the number of distinct Secondary Identifier's within each group:
$data = #'
Primary Identifier,Secondary Identifier,Fruit
11111,1,apple
11111,1,pear
22222,1,banana
22222,1,grapefruit
33333,1,apple
33333,1,pear
33333,2,apple
33333,2,orange
'# |ConvertFrom-Csv
$data |Group-Object 'Primary Identifier' |ForEach-Object {
[pscustomobject]#{
# Primary Identifier value will be the name of the group, since that's what we grouped by
'Primary Identifier' = $_.Name
# Use `Group-Object -NoElement` to count unique values - you could also use `Sort-Object -Unique`
'Count of distinct Secondary Identifiers' = #($_.Group |Group-Object 'Secondary Identifier' -NoElement).Count
}
}

Limit results on OR condition in Sphinx

I am trying to limit results by somehow grouping them,
This query attempt should makes things clear:
#namee ("Cameras") limit 5| #namee ("Mobiles") limit 5| #namee ("Washing Machine") limit 5| #namee ("Graphic Cards") limit 5
where namee is the column
Basically I am trying to limit results/ based upon specific criteria.
Is this possible ? Any alternative way of doing what I want to do.
I am on sphinx 2.2.9
There is no Sphinx syntax to do this directly.
The easiest would be just to do directly 4 separate queries and 'UNION' them in the application itself. Performance isn't going to be terrible.
... If you REALLY want to do it in Sphinx, can explicit a couple of tricks to get close, but it gets very complicated.
Would need to create 4 separate indexes (or upto as many terms as you need!). Each with the the same data, but with the field called something different. (they duplicate each other!) You would also need an attribute on each one (more on why later)
source str1 {
sql_query = SELECT id, namee AS field1, 1 as idx FROM ...
sql_attr_unit = idx
source str2 {
sql_query = SELECT id, namee AS field2, 2 as idx FROM ...
sql_attr_unit = idx
... etc
Then create a single distributed index over the 4 indexes.
Then can run a single query to get all results kinda magically unioned...
MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
(The ##relaxed is important, as the fields are different. the matches must come from different indexes)
Now to limiting them... Because each keyword match must come from a different index, and each index has a unique attribute, the attribute identifies what term matches....
in Sphinx, there is a nice GROUP N BY where you only get a certain number of results from each attribute, so could do... (putting all that together)
SELECT *,WEIGHT() AS weight
FROM dist_index
WHERE MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
GROUP 4 BY idx
ORDER BY weight DESC;
simples eh?
(note it only works if want 4 from each index, if want different limits is much more complicated!)

How to match rows with one or more words in query, but without any words not in query?

I have a table in a MySQL database that has a list of comma separated tags in it.
I want users to be able to enter a list of comma separated tags and then use Sphinx or MySQL to select rows that have at least one of the tags in the query but not any tags the query doesn't have.
The query can have additional tags that are not in the rows, but the rows should not be matched if they have tags not in the query.
I either want to use Sphinx or MySQL to do the searching.
Here's an example:
creatures:
----------------------------
| name | tags |
----------------------------
| cat | wily,hairy |
| dog | cute,hairy |
| fly | ugly |
| bear | grumpy,hungry |
----------------------------
Example searches:
wily,hairy <-- should match cat
cute,hairy,happy <-- should match dog
happy,cute <-- no match (dog has hairy)
ugly,yuck,gross <-- should match fly
hairy <-- no match (dog has cute cat has wily)
grumpy <-- no match (bear has hungry)
grumpy,hungry <-- should match bear
wily,grumpy,hungry <-- should match bear
Is it possible to do this with Sphinx or MySQL?
To reiterate, the query will be a list of comma separated tags and rows that have at least one of the entered tags but not any tags the query doesn't have should be selected.
Sphinx expression ranker should be able to do this.
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1') AND w > 0
OPTION ranker=expr('IF(word_count>=tags_len,1,0)');
basically you want the number of matched tags never to be less than the number of tags.
Note these just gives all documents a weight of 1, if want to get more elaborate ranking (eg to match other keywords) it gets more complicated.
You need index_field_lengths enabled on the index to get the tags_len attribute.
(the same concept is obviouslly possible in mysql.. probably using FIND_IN_SET to do matching. And either a second column to store the number, or compute the number of tags, using say the REPLACE function)
Edit to add, details about multiple fields...
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1 #tags2 "one two thee"/1') AND w = 2
OPTION ranker=expr('SUM(IF(word_count>=IF(user_weight=2,tags2_len,tags_len),1,0))'),
field_weights=(tags=1,tags2=2);
The SUM function is run for each field in turn, so need to use the user_weight system to get be able to distinguish which field currently enumerating.

Return rows where words match in two columns OR in match in one column and the other column is empty?

This is a follow-up to another question I recently asked.
I currently have a SphinxQL query like this:
SELECT * FROM my_index
WHERE MATCH(\'#field1 "a few words"/1 #field2 "more text here"/1\')
However, I would still like it to match rows in the case where one of the fields in the row is empty.
For example, let's say the following rows exist in the database:
field1 | field2
-----------------------
words in here | text in here
| text in here
The above query would match the first row, but it would not match the second row because the quorum operator specifies that there has to be one or more matches for each field.
Is what I'm asking possible?
The actual query I'm trying to make this work with was provided in Barry Hunter's answer to my previous question:
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1 #tags2 "one two thee"/1') AND w = 2
OPTION ranker=expr('SUM(IF(word_count>=IF(user_weight=2,tags2_len,tags_len),1,0))'),
field_weights=(tags=1,tags2=2);
First problem is sphinx doesn't index "empty" so you can't search for it. (well actually the field_len attribute will be zero. But it can be hard to combine attribute filter with MATCH())
... so arrange for empty to be something to index
sql_query = SELECT id,...,IF(tags='','_empty_',tags) AS tags FROM ...
Then modify the query. As it happens your quorum search is easy!
#field1 "a few words _empty_"/1
Its just another word. But a more complex query would just have to be OR'ed with the word.
Then there is making it work within your complex query. But as luck would have it, its really easy. _empty_ is just another word. And in the case of the field being empty, one word will match. (ie there are no words in the field, not in the query)
So just add _empty_ into the two quorums and you done!

CASE in JOIN not working PostgreSQL

I got the following tables:
Teams
Matches
I want to get an output like:
matches.semana | teams.nom_equipo | teams.nom_equipo | Winner
1 AMERICA CRUZ AZUL AMERICA
1 SANTOS MORELIA MORELIA
1 LEON CHIVAS LEON
The columns teams.nom_equipo reference to matches.num_eqpo_lo & to matches.num_eqpo_v and at the same time they reference to the column teams.nom_equipo to get the name of each team based on their id
Edit: I have used the following:
SELECT m.semana, t_loc.nom_equipo AS LOCAL, t_vis.nom_equipo AS VISITANTE,
CASE WHEN m.goles_loc > m.goles_vis THEN 'home'
WHEN m.goles_vis > m.goles_loc THEN 'visitor'
ELSE 'tie'
END AS Vencedor
FROM matches AS m
JOIN teams AS t_loc ON (m.num_eqpo_loc = t_loc.num_eqpo)
JOIN teams AS t_vis ON (m.num_eqpo_vis = t_vis.num_eqpo)
ORDER BY m.semana;
But as you can see from the table Matches in row #5 from the goles_loc column (home team) & goles_vis (visitor) column, they have 2 vs 2 (number of goals - home vs visitor) being a tie but and when I run the code I get something that is not a tie:
Matches' score
Resultset from Select:
I also noticed that since the row #5 the names of both teams in the matches are not correct (both visitor & home team).
So, the Select brings correct data but in other order different than the original order (referring to the order from the table matches)
The order from the second week must be:
matches.semana | teams.nom_equipo | teams.nom_equipo | Winner
5 2 CRUZ AZUL TOLUCA TIE
6 2 MORELIA LEON LEON
7 2 CHIVAS SANTOS TIE
Row 8 from the Resultset must be Row # 5 and so on.
Any help would be really thanked!
When doing a SELECT which includes null for a column, that's the value it will always be, so winner in your case will never be populated.
Something like this is probably more along the lines of what you want:
SELECT m.semana, t_loc.nom_equipo AS loc_equipo, t_vis.nom_equipo AS vis_equipo,
CASE WHEN m.goles_loc - m.goles_vis > 0 THEN t_loc.nom_equipo
WHEN m.goles_vis - m.goles_loc > 0 THEN t_vis.nom_equipo
ELSE NULL
END AS winner
FROM matches AS m
JOIN teams AS t_loc ON (m.nom_eqpo_loc = t.num_eqpo)
JOIN teams AS t_vis ON (m.nom_eqpo_vis = t.num_eqpo)
ORDER BY m.semana;
Untested, but this should provide the general approach. Basically, you JOIN to the teams table twice, but using different conditions, and then you need to calculate the scores. I'm using NULL to indicate a tie, here.
Edit in response to comment from OP:
It's the same table -- teams -- but the JOINs produce different results, because the query uses different JOIN conditions in each JOIN.
The first JOIN, for t_loc, compares m.nom_eqpo_loc to t.num_eqpo. This means it gets the teams rows for the home team.
The second JOIN, for t_vis, compares m.nom_eqpo_vis to t.num_eqpo. This means it gets the teams rows for the visting team.
Therefore, in the CASE statement, t_loc refers to the home team, while t_vis refers to the visting one, enabling both to be used in the CASE statement, enabling the correct name to be found for winning.
Edit in response to follow-up comment from OP:
My original query was sorting by m.semana, which means other columns can appear in any order (essentially whichever Postgres feels is most efficient).
If you want the resulting table to be sorted exactly the same way as the matches table, then use the same ORDER BY tuple in its ORDER BY.
So, the ORDER BY clause would then become:
ORDER BY m.semana, m.nom_eqpo_loc, m.nom_eqpo_vis
Basically, the matches table PRIMARY KEY tuple.